POPULATION-SIZE-AWARE POLICY OPTIMIZATION FOR MEAN-FIELD GAMES

Abstract

In this work, we attempt to bridge the two fields of finite-agent and infinite-agent games, by studying how the optimal policies of agents evolve with the number of agents (population size) in mean-field games, an agent-centric perspective in contrast to the existing works focusing typically on the convergence of the empirical distribution of the population. To this end, the premise is to obtain the optimal policies of a set of finite-agent games with different population sizes. However, either deriving the closed-form solution for each game is theoretically intractable, training a distinct policy for each game is computationally intensive, or directly applying the policy trained in a game to other games is sub-optimal. We address these challenges through the Population-size-Aware Policy Optimization (PAPO). Our contributions are three-fold. First, to efficiently generate efficient policies for games with different population sizes, we propose PAPO, which unifies two natural options (augmentation and hypernetwork) and achieves significantly better performance. PAPO consists of three components: i) the population-size encoding which transforms the original value of population size to an equivalent encoding to avoid training collapse, ii) a hypernetwork to generate a distinct policy for each game conditioned on the population size, and iii) the population size as an additional input to the generated policy. Next, we construct a multi-task-based training procedure to efficiently train the neural networks of PAPO by sampling data from multiple games with different population sizes. Finally, extensive experiments on multiple environments show the significant superiority of PAPO over baselines, and the analysis of the evolution of the generated policies further deepens our understanding of the two fields of finite-agent and infinite-agent games.



1 INTRODUCTION 3RSXODWLRQ6L]H $SSUR[1DVK&RQY ( ) 3321DLYH 332 3$322XUV Games involving a finite number of agents have been extensively investigated, ranging from board games such as Go (Silver et al., 2016; 2018 ), Poker (Brown & Sandholm, 2018; 2019; Moravčík et al., 2017) , and Chess (Campbell et al., 2002) to real-time strategy games such as StarCraft II (Vinyals et al., 2019) and Dota 2 (Berner et al., 2019) . However, existing works are typically limited to a handful of agents, which hinders them from broader applications. To break the curse of many agents (Wang et al., 2020) , mean-field game (MFG) (Huang et al., 2006; Lasry & Lions, 2007) was introduced to study the games that involve an infinite number of agents. Recently, benefiting from reinforcement learning (RL) (Sutton & Barto, 2018) and deep RL (Lillicrap et al., 2016; Mnih et al., 2015) , MFG provides a versatile framework to model games with large population of agents (Cui & Koeppl, 2022; Fu et al., 2019; Guo et al., 2019; Laurière et al., 2022; Perolat et al., 2021; Perrin et al., 2022; 2020; Yang et al., 2018) . Despite the successes of finite-agent games and infinite-agent games, the two fields are largely evolving independently. Establishing the connection between an MFG and the corresponding finite-agent Markov (stochastic) gamesfoot_0 has been a research hotspot and it is done typically by the convergence of the empirical distribution of the population to the mean-field (Saldi et al., 2018; Cui & Koeppl, 2022; Cui et al., 2022; Fabian et al., 2022) . However, few results have been achieved from an agentcentric perspective. Specifically, a fundamental question is: how do the optimal policies of agents evolve with the population size? As the population size increases, the finite-agent games approximate, though never equal to, their infinite-agent counterparts (Cui & Koeppl, 2021; Mguni et al., 2018) . Therefore, the solutions returned by methods in finite-agent games should be consistent with that returned by methods in infinite-agent games. As we can never generate finite-agent games with infinite number of agents, we need to investigate the evolution of the optimal policies of agents, i.e., scaling lawsfoot_1 , to check the consistency of the methods. However, theoretically investigating the scaling laws is infeasible, as obtaining the closed-form solutions of a set of finite-agent games is typically intractable except for some special cases (Guo & Xu, 2019) . Hence, another natural question is: how to efficiently generate efficient policies for a set of finite-agent games with different population sizes? Most methods in finite-agent games can only return the solution of the game with a given number of agents (Bai & Jin, 2020; Jia et al., 2019; Littman et al., 2001) . Unfortunately, the number of agents varies dramatically and rapidly in many real-world scenarios. For example, in Taxi Matching environment (Nguyen et al., 2018; Alonso-Mora et al., 2017) , the number of taxis could be several hundred in rush hour while it could be a handful at midnight. Fig. 1 demonstrates the failure of two naive options in this environment: i) directly apply the policy trained in a given population size to other population sizes (PPO-Naive), and ii) train a policy by using the data sampled from multiple games with different population sizes (PPO). Furthermore, computing the optimal policies for games with different population sizes is computationally intensive. In this work, we propose a novel approach to efficiently generate efficient policies for games with different population sizes, and then investigate the scaling laws of the generated policies. Our main contributions are three-fold. First, we propose PAPO, which unifies two natural methods: augmentation and hypernetwork, and thus, achieves better performance. Specifically, PAPO consists of three components: i) the population-size encoding which transforms the original value of population size to an equivalent encoding to avoid training collapse, ii) a hypernetwork to generate a distinct policy for each game conditioned on the population size, and iii) the population size as an additional input to the generated policy. Next, to efficiently train the neural networks of PAPO, we construct a multi-task-based training procedure where the networks are trained by using the data sampled from games with different population sizes. Finally, extensive experiments on multiple widely used game environments demonstrate the superiority of PAPO over several naive and strong baselines. Furthermore, with a proper similarity measure (centered kernel alignment (Kornblith et al., 2019 )), we show the scaling laws of the policies generated by PAPO, which deepens our understanding of the two fields of finite-agent and infinite-agent games. By establishing the state-of-the-art for bridging the two research fields, we believe that this work contributes to accelerating the research in both fields from a new and unified perspective.

2. RELATED WORKS

Our work lies in the intersection of the two research fields: learning in Markov games (MGs) and learning in mean-field games (MFGs). Numerous works have tried to study the connection between an MFG and the corresponding finite-agent MGs from a theoretical or computational viewpoint such as (Saldi et al., 2018; Doncel et al., 2019; Cabannes et al., 2021) , to name a few. The general result achieved is either that the empirical distribution of the population converges to the mean-field as the number of players goes to infinity or that the Nash equilibrium (NE) of an MFG is an approximate NE of the finite-agent game for a sufficiently large number of players, under different conditions such as the Lipschitz continuity of reward/cost and transition functions (Saldi et al., 2018; Cui & Koeppl, 2021; 2022; Cui et al., 2022) or/and the convergence of the sequence of step-graphons (Cui & Koeppl, 2022; Cui et al., 2022) . Though the advancements in these works provide a theoretical or



In this work, we focus on the finite-agent Markov games sharing a similar structure with the MFG, see Sec. 3 and Appendix A.2 for more details and discussion. We use the term "scaling laws" to refer to the evolution of agents' optimal policies with the population size, which is different from that of(Kaplan et al., 2020). See Appendix A.4 for a more detailed discussion.



Figure 1: Experiments on Taxi Matching environment show the failure of two naive methods and the success of our PAPO. ↓ means the lower the better performance. See Sec. 5.1 for details.

