PROVABLE FICTITIOUS PLAY FOR GENERAL MEAN-FIELD GAMES

Abstract

We propose a reinforcement learning algorithm for stationary mean-field games, where the goal is to learn a pair of mean-field state and stationary policy that constitutes the Nash equilibrium. When viewing the mean-field state and the policy as two players, we propose a fictitious play algorithm which alternatively updates the mean-field state and the policy via gradient-descent and proximal policy optimization, respectively. Our algorithm is in stark contrast with previous literature which solves each single-agent reinforcement learning problem induced by the iterates mean-field states to the optimum. Furthermore, we prove that our fictitious play algorithm converges to the Nash equilibrium at a sublinear rate. To the best of our knowledge, this seems the first provably convergent reinforcement learning algorithm for mean-field games based on iterative updates of both mean-field state and policy.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) (Shoham et al., 2007; Busoniu et al., 2008; Hernandez-Leal et al., 2017; Hernandez-Leal et al.; Zhang et al., 2019) aims to tackle sequential decisionmaking problems in multi-agent systems (Wooldridge, 2009) by integrating the classical reinforcement learning framework (Sutton & Barto, 2018) with game-theoretical thinking (Başar & Olsder, 1998) . Powered by deep-learning (Goodfellow et al., 2016) , MARL recently has achieved striking empirical successes in games (Silver et al., 2016; 2017; Vinyals et al., 2019; Berner et al., 2019; Schrittwieser et al., 2019) , robotics (Yang & Gu, 2004; Busoniu et al., 2006; Leottau et al., 2018 ), transportation (Kuyer et al., 2008; Mannion et al., 2016) , and social science (Leibo et al., 2017; Jaques et al., 2019; Cao et al., 2018; McKee et al., 2020) . Despite the empirical successes, MARL is known to suffer from the scalability issue. Specifically, in a multi-agent system, each agent interacts with the other agents as well as the environment, with the goal of maximizing its own expected total return. As a result, for each agent, the reward function and the transition kernel of its local state also involve the local states and actions of all the other agents. As a result, as the number of agents increases, the capacity of the joint state-action space grows exponentially, which brings tremendous difficulty to reinforcement learning algorithms due to the need to handle high-dimensional input spaces. Such a curse of dimensionality due to having a large number of agents in the system is named as the "curse of many agents" (Sonu et al., 2017) . To circumvent such a notorious curse, a popular approach is through mean-field approximation, which imposes symmetry among the agents and specifies that, for each agent, the joint effect of all the other agents is summarized by a population quantity, which is oftentimes given by the empirical distribution of the local states and actions of all the other agents or a functional of such an empirical distribution. Specifically, to obtain symmetry, the reward and local state transition functions are the same for each agent, which are functions of the local state-action and the population quantity. Thanks to mean-field approximation, such a multi-agent system, known as the mean-field game (MFG) (Huang et al., 2003; Lasry & Lions, 2006a; b; 2007; Huang et al., 2007; Guéant et al., 2011; Carmona & Delarue, 2018) , is readily scalable to an arbitrary number of agents. In this work, we aim to find the Nash equilibrium (Nash, 1950) of MFG with infinite number of agents via reinforcement learning. By mean-field approximation, such a game consists of a population of symmetric agents among which each individual agent has infinitesimal effect over the whole population. By symmetry, it suffices to find a symmetric Nash equilibrium where each agent adopts the same policy. Under such consideration, we can focus on a single agent, also known as the representative agent, and view MFG as a game between the representative agent's local policy π and the mean-field state L which aggregates the collective effect of the population. Specifically, the representative agent π aims to find the optimal policy when the mean-field state is fixed to L, which reduces to solving a Markov decision process (MDP) induced by L. Simultaneously, we aim to let L be the mean-field state when all the agents adopt policy π. The Nash equilibrium of such a two-player game, (π * , L * ), yields a symmetric Nash equilibrium π * of the original MFG. Under proper conditions, the Nash equilibrium (π * , L * ) can be obtained via fixed-point updates, which generate a sequence {π t , L t } as follows. For any t ≥ 0, in the t-th iteration, we solve the MDP induced by L t and let π t be the optimal policy. Then we update the mean-field state by letting L t+1 be the mean-field state obtained by letting every agent follow π t . Under appropriate assumptions, the mapping from L t to L t+1 is a contraction and thus such an iterative algorithm converges to the unique fixed-point of such a contractive mapping, which corresponds to L * (Guo et al., 2019). Based on the contractive property, various reinforcement learning methods are proposed to approximately implement the fixed-point updates and find the Nash equilibrium (π * , L * ) (Guo et al., 2019; 2020; Anahtarci et al., 2019b; a; 2020) . However, such an approach requires solving a standard reinforcement learning problem approximately within each iteration, which itself is solved by an iterative algorithm such as Q-learning (Watkins & Dayan, 1992; Mnih et al., 2015; Bellemare et al., 2017) or actor-critic methods (Konda & Tsitsiklis, 2000; Haarnoja et al., 2018; Schulman et al., 2015; 2017) . As a result, this approach leads to a double-loop iterative algorithm for solving MFG. When the state space S is enormous, function approximation tools such as deep neural networks are equipped to represent the value and policy functions in the reinforcement learning algorithm, making solving each inner subproblem computationally demanding. To obtain a computationally efficient algorithm for MFG, we consider the following question: Can we design a single-loop reinforcement learning algorithm for solving MFG which updates the policy and mean-field state simultaneously in each iteration? For such a question, we provide an affirmative answer by proposing a fictitious play (Brown, 1951) policy optimization algorithm, where we view the policy π and mean-field state L as the two players and update them simultaneously in each iteration. Fictitious play is a general algorithm framework for solving games where each player first infers the opponent and then improves its own policy based on the inferred opponent information. When it comes to MFG, in each iteration, the policy player π first infers the mean-field state implicitly by solving a policy evaluation problem associated with π on the MDP induced by L. Then the policy π is updated via a proximal policy optimization (PPO) (Schulman et al., 2017) step with entropy regularization, which is adopted to ensure the uniqueness of the Nash equilibrium. Meanwhile, the mean-field state L obtains its update direction by solving how the mean-field state evolves when all the agents execute policy π with their state distribution being L. Then L is updated towards this direction with some stepsize. Such an algorithm is singleloop as the mean-field state L is updated immediately when π is updated. Furthermore, since L is a distribution over the state space S, when S is continuous, L lies in an infinite-dimensional space, which makes it computationally challenging to be updated. To overcome this challenge, we employ a succinct representation of L via kernel mean embedding, which maps L to an element in a reproducing kernel Hilbert space (RKHS) (Smola et al., 2007; Gretton et al., 2008; Sriperumbudur et al., 2010) . Such a mechanism enables us to update the mean-field state within RKHS, which can be computed efficiently. When the stepsizes for policy and mean-field state updates are properly chosen, we prove that our single-loop fictitious play algorithm converges to the entropy-regularized Nash equilibrium at a sublinear O(T -1/5 )-rate, where T is the total number of iterations and O(•) hides logarithmic terms. To our best knowledge, we establish the first single-loop reinforcement learning algorithm for meanfield game with finite-time convergence guarantee to Nash equilibrium. Our Contributions. Our contributions are two-fold. First, we propose a single-loop fictitious play algorithm that updates both the policy and the mean-field state simultaneously in each iteration, where the policy is updated via entropy-regularized proximal policy optimization. Moreover, we utilize kernel mean embedding to represent the mean-field states and the policy update subroutine

