A MUTUAL INFORMATION DUALITY ALGORITHM FOR MULTI-AGENT SPECIALIZATION Anonymous

Abstract

The social behavior change in a heterogeneous population is an essential study of multi-agent learning. The interactions between unique agents not only involves the optimization for single agent learning, agents' behavioral changes also depend on the mutual similarity and dissimilarity between the agents. Our study provides a theoretical derivation of the policies interactions under the formulation of joint policy optimization. We discover that joint policy optimization in a heterogeneous population can affect the population behaviors through mutual information (MI) maximization. In our study, we introduce a minimax formulation of MI (M&M) that optimizes the population behavior under MI minimization against the joint policy optimization. The main findings of our paper show that MI minimization can minimize the behavioral similarity between agents and enable agents to develop individualized policy specialization. Empirically M&M demonstrates a substantial gain in average population performance, diversity, and narrows the performance gap among the agents.

1. INTRODUCTION

From the success of multi-agent game plays [(OpenAI et al., 2019) , (Vinyals et al., 2019) ], the field of heterogeneous multi-agent learning is being actively studied in areas such as AI game design (Contributors, 2022) , embedded IoTs (Toyama et al., 2021) , and the research for future Human-AI interaction. With the unique physique and specialized character attributes in a heterogeneous population, the ability to optimize policies for specific purposes is essential. The difficulty as well as the goal of the heterogeneous population learning research is to search and define a general learning algorithm that optimizes an agent population such that the uniqueness of each agent can be fully utilized within the population. To approach the problem, our research aims to understand the symmetric and asymmetric social behavior changes that occur during the population learning through mutual information formulation. To study the learning and behavioral change, multi-agent learning has formed two branches of studies. In the simplest form of multi-agent RL, individualized learning is performed to learn separate behavioral policies for each agent. These prior works include Independent Q-Learning (IQL) (Tan, 1993) , Policy Space Response Oracle (PSRO) (Lanctot et al., 2017), and AlphaStar (Vinyals et al., 2019) . However, the empirical success of these prior approaches have excluded knowledge sharing. Since the training is done independently, one agent's learned experiences do not transfer to another. The individualized behaviors result in high redundancy of re-exploration and low socialized behaviors among the agents. On the contrary, joint policy optimization is proposed as a solution to the listed problems. Joint policy optimization utilizes a single conditioned policy to learn the diverse skillset, and the character attributes of the population via distributed RL optimization. Through population experiences sharing and joint policy optimization, a single conditioned policy network can learn a set of generalized policy skills that are transferable across the different agents. Notable examples include OpenAIFive (OpenAI et al., 2019) , HAPPO (Kuba et al., 2021) and NeuPL (Liu et al., 2022) that optimize the multi-agent behaviors under the expected population accumulated rewards. Through knowledge sharing and joint policy optimization, population learning in interactive games has benefited from the increased learning efficiency, and learning generalization of agents' social behaviors. Our research focuses on the latter, where we analyze the cause of social behavior change. By analyzing a heterogeneous population with pairwise interactions, we discover that the joint policy optimization of a two-player competitive game converges to MI maximization among the agents. Unfortunately, MI maximization optimizes the commonality in a population. We found that in a heterogeneous population where agents are individually unique, MI maximization benefits agents that have characteristics close to the population norm, but severely degrades the performance of unique agents with character attributes dissimilar to the average population. To address the drawback of MI maximization, our research's main contribution is a novel minimax MI (M&M) formulation of population learning that enables individual agents to learn specialization from the dual formation of MI maximization and minimization.

2. BACKGROUND AND RELATED WORK

Multi-agent learning is a broad field of study that not only covers the necessary intelligence to achieve individual agent reward maximization, but also the social behavior change when agents interact with other agents. Prior researches in population learning have studied the performance of the agent population in competitive and cooperative environments.

2.1. COMPETITIVE BEHAVIORS LEARNING

In a competitive environment, the goal of the multi-agent research is to utilize competition to optimize the performance of an agent population. One approach to perform the learning iteration is individualized learning. Each agent improves its policy through learning a best-response (BR) against the prior iterations of agents. The iterated elimination of dominated strategies optimizes a population of policies under Game Theory. Prior studies such as (Jaderberg et al., 2018) , PSRO, PS-TRPO (Gupta et al., 2017 ), (Vinitsky et al., 2020) and Alphastar utilizes different variants of self-play (Heinrich et al., 2015) to learn competitive Nash Equilibrium behaviors for a population. The different variants addresses the stability (TRPO constraint), robustness (adversarial population) and diversity (leagues of policies pretrained on human data) of individualized learning. In contrast, a joint policy optimization framework proposes by (Foerster et al., 2016) and (Lowe et al., 2017) optimizes the population with Centralized Learning Decentralized Execution (CLDE). The joint optimization enables common skill transfer across policies. The commonality among the agents can be learned once and the learned behavior can be utilized by different agents of a population. This form of joint policy optimization optimizes the population as an one-body problem with Mean Field Theory (Yang et al., 2018) . Under Mean Field Theory, the variations of individual agents can be averaged. The modeling of the population behaviors are reduced from a many-body problem to an one-body problem. Prior works include MADDPG, HAPPO (Kuba et al., 2021 ), OpenAIFive (OpenAI et al., 2019 ) and NeuPL (Liu et al., 2022) , where OpenAIFive and NeuPL have both further developed efficient graph solvers based on (Shoham & Leyton-Brown, 2008) to optimize the social graphs of the population. The graph solver F optimizes the match pairing of the agents (x,y) with weighted edges Σ (x,y) . The objective of F commonly optimize the policy learning to be robust against adversarial exploitation or agents with the most similar performance strengths.

2.2. COOPERATIVE BEHAVIORS LEARNING

To develop social behaviors of cooperation, prior studies have proposed auxiliary rewards and regularization of mutual information as part of the objective function. This includes OpenAIFive's team spirit reward, (Chenghao et al., 2021)'s Q-value, (Cuervo & Alzate, 2020)'s PPO, and (Mahajan et al., 2019)'s latent variable regularization of MI maximization. While the above studies suggest that learning of cooperative social behaviors require auxiliary modifications with mutual information, (Dobbe et al., 2017) shows an interesting analysis on the distortion rate of the joint policy optimization versus the individualized learning with MI. The study shows that even without auxiliary modification, there is a significant stability difference in distortion rate between the joint optimization and individualized learning. This issue has shown to negatively impact cooperative learning.

