PROVABLE FICTITIOUS PLAY FOR GENERAL MEAN-FIELD GAMES

Abstract

We propose a reinforcement learning algorithm for stationary mean-field games, where the goal is to learn a pair of mean-field state and stationary policy that constitutes the Nash equilibrium. When viewing the mean-field state and the policy as two players, we propose a fictitious play algorithm which alternatively updates the mean-field state and the policy via gradient-descent and proximal policy optimization, respectively. Our algorithm is in stark contrast with previous literature which solves each single-agent reinforcement learning problem induced by the iterates mean-field states to the optimum. Furthermore, we prove that our fictitious play algorithm converges to the Nash equilibrium at a sublinear rate. To the best of our knowledge, this seems the first provably convergent reinforcement learning algorithm for mean-field games based on iterative updates of both mean-field state and policy.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) (Shoham et al., 2007; Busoniu et al., 2008; Hernandez-Leal et al., 2017; Hernandez-Leal et al.; Zhang et al., 2019) aims to tackle sequential decisionmaking problems in multi-agent systems (Wooldridge, 2009) by integrating the classical reinforcement learning framework (Sutton & Barto, 2018) with game-theoretical thinking (Başar & Olsder, 1998) . Powered by deep-learning (Goodfellow et al., 2016) , MARL recently has achieved striking empirical successes in games (Silver et al., 2016; 2017; Vinyals et al., 2019; Berner et al., 2019; Schrittwieser et al., 2019) , robotics (Yang & Gu, 2004; Busoniu et al., 2006; Leottau et al., 2018) , transportation (Kuyer et al., 2008; Mannion et al., 2016) , and social science (Leibo et al., 2017; Jaques et al., 2019; Cao et al., 2018; McKee et al., 2020) . Despite the empirical successes, MARL is known to suffer from the scalability issue. Specifically, in a multi-agent system, each agent interacts with the other agents as well as the environment, with the goal of maximizing its own expected total return. As a result, for each agent, the reward function and the transition kernel of its local state also involve the local states and actions of all the other agents. As a result, as the number of agents increases, the capacity of the joint state-action space grows exponentially, which brings tremendous difficulty to reinforcement learning algorithms due to the need to handle high-dimensional input spaces. Such a curse of dimensionality due to having a large number of agents in the system is named as the "curse of many agents" (Sonu et al., 2017) . To circumvent such a notorious curse, a popular approach is through mean-field approximation, which imposes symmetry among the agents and specifies that, for each agent, the joint effect of all the other agents is summarized by a population quantity, which is oftentimes given by the empirical distribution of the local states and actions of all the other agents or a functional of such an empirical distribution. Specifically, to obtain symmetry, the reward and local state transition functions are the same for each agent, which are functions of the local state-action and the population quantity. Thanks to mean-field approximation, such a multi-agent system, known as the mean-field game (MFG) (Huang et al., 2003; Lasry & Lions, 2006a; b; 2007; Huang et al., 2007; Guéant et al., 2011; Carmona & Delarue, 2018) , is readily scalable to an arbitrary number of agents. In this work, we aim to find the Nash equilibrium (Nash, 1950) of MFG with infinite number of agents via reinforcement learning. By mean-field approximation, such a game consists of a population of symmetric agents among which each individual agent has infinitesimal effect over the whole population. By symmetry, it suffices to find a symmetric Nash equilibrium where each agent adopts the same policy. Under such consideration, we can focus on a single agent, also known as the representative agent, and view MFG as a game between the representative agent's local policy π and the mean-field state L which aggregates the collective effect of the population. Specifically, the representative agent π aims to find the optimal policy when the mean-field state is fixed to L, which reduces to solving a Markov decision process (MDP) induced by L. Simultaneously, we aim to let L be the mean-field state when all the agents adopt policy π. The Nash equilibrium of such a two-player game, (π * , L * ), yields a symmetric Nash equilibrium π * of the original MFG. Under proper conditions, the Nash equilibrium (π * , L * ) can be obtained via fixed-point updates, which generate a sequence {π t , L t } as follows. For any t ≥ 0, in the t-th iteration, we solve the MDP induced by L t and let π t be the optimal policy. Then we update the mean-field state by letting L t+1 be the mean-field state obtained by letting every agent follow π t . Under appropriate assumptions, the mapping from L t to L t+1 is a contraction and thus such an iterative algorithm converges to the unique fixed-point of such a contractive mapping, which corresponds to L * (Guo et al., 2019) . Based on the contractive property, various reinforcement learning methods are proposed to approximately implement the fixed-point updates and find the Nash equilibrium (π * , L * ) (Guo et al., 2019; 2020; Anahtarci et al., 2019b; a; 2020) . However, such an approach requires solving a standard reinforcement learning problem approximately within each iteration, which itself is solved by an iterative algorithm such as Q-learning (Watkins & Dayan, 1992; Mnih et al., 2015; Bellemare et al., 2017) or actor-critic methods (Konda & Tsitsiklis, 2000; Haarnoja et al., 2018; Schulman et al., 2015; 2017) . As a result, this approach leads to a double-loop iterative algorithm for solving MFG. When the state space S is enormous, function approximation tools such as deep neural networks are equipped to represent the value and policy functions in the reinforcement learning algorithm, making solving each inner subproblem computationally demanding. To obtain a computationally efficient algorithm for MFG, we consider the following question: Can we design a single-loop reinforcement learning algorithm for solving MFG which updates the policy and mean-field state simultaneously in each iteration? For such a question, we provide an affirmative answer by proposing a fictitious play (Brown, 1951) policy optimization algorithm, where we view the policy π and mean-field state L as the two players and update them simultaneously in each iteration. Fictitious play is a general algorithm framework for solving games where each player first infers the opponent and then improves its own policy based on the inferred opponent information. When it comes to MFG, in each iteration, the policy player π first infers the mean-field state implicitly by solving a policy evaluation problem associated with π on the MDP induced by L. Then the policy π is updated via a proximal policy optimization (PPO) (Schulman et al., 2017) step with entropy regularization, which is adopted to ensure the uniqueness of the Nash equilibrium. Meanwhile, the mean-field state L obtains its update direction by solving how the mean-field state evolves when all the agents execute policy π with their state distribution being L. Then L is updated towards this direction with some stepsize. Such an algorithm is singleloop as the mean-field state L is updated immediately when π is updated. Furthermore, since L is a distribution over the state space S, when S is continuous, L lies in an infinite-dimensional space, which makes it computationally challenging to be updated. To overcome this challenge, we employ a succinct representation of L via kernel mean embedding, which maps L to an element in a reproducing kernel Hilbert space (RKHS) (Smola et al., 2007; Gretton et al., 2008; Sriperumbudur et al., 2010) . Such a mechanism enables us to update the mean-field state within RKHS, which can be computed efficiently. When the stepsizes for policy and mean-field state updates are properly chosen, we prove that our single-loop fictitious play algorithm converges to the entropy-regularized Nash equilibrium at a sublinear O(T -1/5 )-rate, where T is the total number of iterations and O(•) hides logarithmic terms. To our best knowledge, we establish the first single-loop reinforcement learning algorithm for meanfield game with finite-time convergence guarantee to Nash equilibrium. Our Contributions. Our contributions are two-fold. First, we propose a single-loop fictitious play algorithm that updates both the policy and the mean-field state simultaneously in each iteration, where the policy is updated via entropy-regularized proximal policy optimization. Moreover, we utilize kernel mean embedding to represent the mean-field states and the policy update subroutine can readily incorporate any function approximation tools to represent both the value and policy functions, which makes our fictitious play method a general algorithmic framework that is able to handle MFG with continuous state space. Second, we prove that the policy and mean-field state sequence generated by the proposed algorithm converges to the Nash equilibrium of the MFG at a sublinear O(T -foot_0/5 ) rate. Related Works. Our work belongs to the literature on discrete-time MFG. A variety of works have focused on the existence of a Nash equilibrium and the behavior of Nash equilibrium as the number of agents goes to infinity under various settings of MFG. See, e.g., Gomes et al. (2010) ; Tembine & Huang (2011) ; Moon & Başar (2014) ; Biswas (2015) ; Saldi et al. (2018b; a; 2019) ; Więcek (2020) and the references therein. In addition, our work is more related to the line of research that aims to solve MFG via reinforcement learning methods. Most of the existing works propose to find the Nash equilibrium via fixed-point iterations in space of the mean-field states, which requires solving an MDP induced by a mean-field state within each iteration (Guo et al., 2019; 2020; Anahtarci et al., 2019a; b; Fu et al., 2019; uz Zaman et al., 2020; Anahtarci et al., 2020) . Among these works, Guo et al. (2019; 2020) ; Anahtarci et al. (2019a; b; 2020) 2020), which study the convergence of a version of fictitious play for MFG. Similar to our algorithm, their fictitious play also regards the policy and the mean-field state as the two players. However, for policy update, they compute the best response policy to the current mean-field state by solving the MDP induced by the mean-field state to approximate optimality, and the obtained policy is added to the set of previous policy iterates to form a mixture policy. As a result, their algorithm is double-loop in essence due to solving an MDP in each iteration. In contrast, our fictitious play is single-loop -the policy is updated via a single PPO step in each iteration, and the mean-field state is updated before the policy solves any MDP associated with a mean-field state. Notations. We use • 1 to denote the vector 1 -norm, and ∆(D) the probability simplex over D. The Kullback-Leibler (KL) divergence between p 1 , p 2 ∈ ∆(A) is defined as D KL (p 1 p 2 ) := a∈A p 1 (a) log p1(a) p2(a) . Let 1 n ∈ R n denote the all-one vector. For two quantities x and y that may depend on problem parameters (|A|, γ, etc.), if x ≥ Cy holds for a universals constant C > 0, we write x y, x = Ω(y) and y = O(x). We use O(•) to denote O(•) ignoring logarithmic factors.

2. BACKGROUND AND PRELIMINARIES

In this section, we first review the standard setting of mean-field games (MFG) from Guo et al. (2019) , and then introduce a more general MFG with mean embedding and entropy regularization.

2.1. MEAN-FIELD GAMES

Consider a discrete-time Markov game involving an infinite number of identical and interchangeable agents. Let S ⊆ R d and A ⊆ R p be the state space and action space, respectively, that are common to the agents. We assume that S is compact and A is finite. The reward and the state dynamic for each agent depend on the collective behavior of all agents through the mean-field state, i.e., the distribution of the states of all agents. As the agents are homogeneous and interchangeable, one can focus on a single agent representative of the population. Let r : S × A × ∆(S) → [0, R max ] be the (bounded) reward function and P: S × A × ∆(S) → ∆(S) be the state transition kernel. At each time t, the representative agent is in state s t ∈ S, and the probability distribution of s t , denoted by L t ∈ ∆(S), corresponds to the mean-field state. Upon taking an action a t ∈ A, the agent receives a reward r(s t , a t , L t ) and transitions to a new state s t+1 ∼ P(•|s t , a t , L t ). A Markovian policy for the agent is a function π : S → ∆(A) that maps her own state to a distribution over actions, 1 i.e., π(a|s) is the probability of taking action a in state s. Let Π be the set of all Markovian policies. When an agent is operating under a policy π ∈ Π and the mean-field population flow is L := (L t ) t≥0 , we define the expected cumulative discounted reward (or value function) of this agent as V π (s, L) := E ∞ t=0 γ t r(s t , a t , L t ) | s 0 = s , where a t ∼ π(•|s t ), s t+1 ∼ P(•|s t , a t , L t ), and γ ∈ (0, 1) is the discount factor. The goal of this agent is to find a policy π that maximizes V π (s, L) while interacting with the mean-field L. We are interested in finding a stationary (time-independent) Nash Equilibrium (NE) of the game, which is a policy-population pair (π * , L * ) ∈ Π × ∆(S) satisfying the following two properties: • (Agent rationality) V π * (s, L * ) ≥ V π (s, L * ), ∀π ∈ Π, s ∈ S. • (Population consistency) L t = L * , ∀t under policy π * with initial mean-field state L 0 = L * . That is, π * is the optimal policy under the mean-field L * , and L * remains fixed under π * . We formalize the notion of NE in Section 2.3 after introducing a more general setting of MFG.

2.2. MEAN EMBEDDING OF MEAN-FIELD STATES

Note that the mean-field state L * is a distribution over the states. When the state space is continuous, the NE (π * , L * ) is an infinite dimensional object, posing challenges for learning the NE. To overcome this challenge, we make use of a succinct representation of the mean-field via mean embedding, which embeds the mean-field states into a reproducing kernel Hilbert space (RKHS) (Smola et al., 2007; Gretton et al., 2008; Sriperumbudur et al., 2010) . Specifically, given a positive definite kernel k : S ×S → R, let H be the associated RKHS endowed with the inner product •, • H and norm • H . For each L ∈ ∆(S), its mean embedding µ L ∈ H is defined as µ L (s) := E x∼L [k(x, s)] , ∀s ∈ S. Let M := {µ L : L ∈ ∆(S) } ⊆ H be the set of all possible mean embeddings. Note that when k is the identity kernel, we have µ L = L and M = ∆(S) . On the other hand, when k is more structured (e.g., with a fast decaying eigen spectrum), M has significantly lower complexity than the set ∆(S) of raw mean-field states. We assume that the MFG respects the mean embedding structure, in the sense that the reward r : S × A × M → [0, R max ] and transition kernel P : S × A × M → ∆(S) (with a slight abuse of notation) depend on the mean-field state L through its mean embedding representation µ L . In particular, at each time t with state s t and mean-field state L t , the representative agent takes action a t ∼ π(•|s t ), receives reward r(s t , a t , µ Lt ) and then transitions to a new state s t+1 ∼ P(•|s t , a t , µ Lt ). The NE of the game is defined analogously. As mentioned, when k is the identity kernel, the above setting reduces to the standard setting in Section 2.1 with raw-mean field states. We impose a standard regularity condition on the kernel k. Assumption 1. The kernel k is bounded and universal, in the sense that k(s, s) ≤ 1, ∀s ∈ S and the corresponding RKHS H is dense w.r.t. the L ∞ norm in the space of continuous functions on S. Assumption 1 is standard in the kernel learning literature (Caponnetto & De Vito, 2007; Muandet et al., 2012; Szabó et al., 2015; Lin et al., 2017) . When the kernel is bounded, the embedding of each L ∈ ∆(S) satisfies µ L H ≤ x∼L k(x, •) H dx ≤ 1. When one uses a universal kernel (e.g., Gaussian or Laplace kernel), the mean embedding mapping is injective and hence each embedding µ ∈ M uniquely characterizes a distribution L in ∆(S) (Gretton et al., 2008; 2012) .

2.3. ENTROPY REGULARIZATION

To ensure the uniqueness of the NE and achieve fast algorithmic convergence, we use an entropy regularization approach (Cen et al., 2020; Shani et al., 2019; Nachum et al., 2017) , which augments the standard expected reward objective with an entropy term of the policy. In particular, we define the entropy-regularized value function as V λ,π µ (s) := E at∼π(•|st),st+1∼P(•|st,at,µ) ∞ t=0 γ t [r(s t , a t , µ) -λ log π(a t |s t )] | s 0 = s , where the parameter λ > 0 controls the regularization level and µ is the mean-embedding of some given mean-field state (fixed over time). Equivalently, one may view V λ,π µ as the usual value function of π with an entropy-regularized reward r λ,π µ (s, a) := r(s, a, µ) -λ log π(a|s), ∀s ∈ S, a ∈ A. (1) Also define the Q-function of a policy π as Q λ,π µ (s, a) = r(s, a, µ) + γE V λ,π µ (s 1 ) | s 0 = s, a 0 = a , (2) which is related to the value function as V λ,π µ (s) = E a∼π(•|s) Q λ,π µ (s, a) -λ log π(a|s) = Q λ,π µ (s, •), π(•|s) + H (π(•|s)) , ) where H (π(•|s)) :=a π(a|s) log π(a|s) is the Shannon entropy of the distribution π(•|s). Since the reward function r is assumed to be R max -bounded, it is easy to show that the Q-function is also bounded as Q λ,π µ ∞ ≤ Q max := (R max + γλ log |A|)/(1 -γ); see Lemma 5. Single-Agent MDP. When the mean-field state and its mean-embedding remain fixed over time, i.e., L t = L and µ Lt = µ, ∀t, a representative agent aims to solve the optimization problem max π:S→∆(A) V λ,π µ (s) for each s ∈ S. This problem corresponds to finding the (entropy-regularized) optimal policy for a single-agent discounted MDP, denoted by MDP µ := (S, A, P(•|•, •, µ), r(•, •, µ), γ), that is induced by µ ∈ M. Let π λ, * µ be the optimal solution to the problem (4), that is, the optimal regularized policy of MDP µ . The optimal policy is unique whenever λ > 0. One can thus define a mapping Γ λ 1 : M → Π via Γ λ 1 (µ) = π λ, * µ , which maps each embedded mean-field state µ to the optimal regularized policy π λ, * µ of MDP µ . Let Q λ, * µ be the optimal regularized Q-function corresponding to the optimal policy π λ, * µ . Throughout the paper, we fix a state distribution ν 0 ∈ ∆(S), which will serve as the initial state of our policy optimization algorithm. For each µ ∈ M and a policy π : S → ∆(A), define J λ µ (π) := E s∼ν0 V λ,π µ (s) as the expectation of the value function V λ,π µ (s) of policy π on the regularized MDP µ . We define the discounted state visitation distribution ρ π µ induced by a policy π on MDP µ as: ρ π µ (s) := (1 -γ) ∞ t=0 γ t P(s t = s), where P(s t = s) is the state distribution when s 0 ∼ ν 0 and the actions are chosen according to π. Mean-field Dynamics. When all agents follow the same policy π, we can define another mapping Γ 2 : Π×M → M that describes the dynamic of the embedded mean-field state. In particular, given the current embedding µ corresponding to some mean-field state L, the next embedded mean-field state µ + = Γ 2 (π, µ) is given by L + (s ) = S a∈A L(s)π(a|s)P(s |s, a, µ)ds, µ + = µ L + . ( ) Note that the evolution of the mean-field depends on the agents' policy in a deterministic manner. Entropy-regularized Mean-field Nash Equilibrium (NE). With the above notations, we can formally define our notion of equilibrium. Definition 1. A stationary (time-independent) entropy-regularized Nash equilibrium for the MFG is a policy-population pair (π * , µ * ) ∈ Π × M that satisfies (agent rationality) π * = Γ λ 1 (µ * ), (population consistency) µ * = Γ 2 (π * , µ * ). When λ = 0, the above definition reduces to that of the (unregularized) NE discussed in Section 2.1, which requires π * to the unregularized optimal policy of MDP µ * . For general values of λ, the regularized NE (π * , µ * ) approximates the unregularized NE (Geist et al., 2019) , in the sense that π * is an approximate optimal policy of MDP µ * satisfying max π∈Π {J 0 µ * (π)} -J λ µ * (π * ) ≤ λ log |A|/(1 -γ). One may further define the composite mapping Λ λ : M → M as Λ λ (µ) = Γ 2 Γ λ 1 (µ), µ . When Λ λ is a contraction, the regularized NE exists and is unique (Guo et al., 2019) . Moreover, the iterates {(π t , µ t )} t≥0 given by the two-step update π t = Γ λ 1 (µ t ), µ t+1 = Γ 2 (π t , µ t ) converge to the regularized NE at a linear rate. Note that the first step above requires an oracle for computing the exact optimal policy π λ, * µt . In most cases, such an exact oracle is not available; various single-agent reinforcement learning algorithms have been considered for computing an approximate optimal policy, including Q-learning (Guo et al., 2019) and policy gradient methods (Guo et al., 2020; Subramanian & Mahajan, 2019) . The recent work by Elie et al. (2019) considers fictitious play iterative learning scheme. We remark that their convergence guarantee requires being able to compute the approximate optimal policy to an arbitrary precision with high probability.

3. FICTITIOUS PLAY ALGORITHM FOR MFG

In this section, we present a fictitious play algorithm, which simultaneously estimates the policy π * and the embedded mean-field state µ * of the NE. As given in Algorithm 1, each iteration of the algorithm involves three steps: policy evaluation (line 3), policy improvement (line 4), and updating the embedded mean-field state (line 5). Below we explain each step in more details. Algorithm 1 Mean-Embedded Fictitious Play 1: Input: initial estimate (π 0 , µ 0 ), step size sequence {α t , β t } t≥0 , mixing parameter η. 2: for Iteration t = 0, 1, 2, . . . , T -1 do 3: (Policy evaluation step) Compute an approximate version Q λ t : S × A → [0, Q max ] of the Q-function Q λ,πt µt of policy π t with respect to the entropy-regularized MDP µt 4: (Policy improvement step) Update the policy by π t+1 (•|s) ∝ (π t (•|s)) 1-αtλ exp α t Q λ t (s, •) (8) π t+1 (•|s) = (1 -η) π t+1 (•|s) + η • 1 |A| (•)/|A| (9) 5: Update the embedded mean-field state by µ t+1 = (1 -β t )µ t + β t • Γ 2 (π t+1 , µ t ). (10) 6: end for 7: Output: {(π t , µ t )} t=1,...,T Policy Evaluation. In each iteration, we first evaluate the current policy π t with respect to the regularized single-agent MDP µt induced by the current mean-field estimate µ t . In particular, we compute an approximation Q λ t of the true Q-function Q λ t := Q λ,πt µt , which can be done using, e.g., TD(0) or LSTD methods. Our theorem characterizes how convergence depends on the policy evaluation error in this step. Policy Improvement. To update our policy estimate π t , we first compute an intermediate policy π t+1 by a single policy improvement step: for each s ∈ S, π t+1 (•|s) = argmax π(•|s)∈∆(A) α t Q λ t (s, •) -λ log π t (•|s), π(•|s) -π t (•|s) -D KL (π(•|s) π t (•|s)) , where α t > 0 is the stepsize. This step corresponds to one iteration of Proximal Policy Optimization (PPO) (Schulman et al., 2017) . It can also be viewed as one mirror descent iteration, where the shifted Q-function Q λ t (s, •) -λ log π t (•|s) plays the role of the gradient. The maximizer π t+1 in equation ( 11) can be computed in closed form as done in equation ( 8) in Algorithm 1. We then compute the new policy π t+1 by mixing π t+1 with a small amount of uniform distribution, as done in equation ( 9). "Mixing in" a uniform distribution is a standard technique to prevent the policy from approaching the boundary of the probability simplex and becoming degenerate. Doing so allows us to upper bound a quantity of the form D KL (p π t+1 (•|s)) (cf. Lemma 2), which otherwise may be infinite. It also ensures that the KL divergence satisfies a Lipschitz condition (cf. Lemma 3). Mean-field Update. We next compute an updated (embedded) mean-field state µ t+1 as a weighted average of the current µ t and the mean-field state Γ 2 (π t+1 , µ t ) induced by the new policy π t+1 , namely, µ t+1 = (1 -β t )µ t + β t • Γ 2 (π t+1 , µ t ), where β t ∈ (0, 1) is the stepsize. This update can be viewed as a single step of the (soft) fixed point iteration for the equation µ = Γ 2 (π t+1 , µ). We remark that our algorithm is similar to the classical fictitious play approach for finding NEs, where each agent plays a response to the empirical average of its opponent's past behaviors. In our algorithm, the representative agent views the population of all agents collectively as an opponent. Expanding the recursion (8) and ignoring the difference between π t+1 and π t+1 , we can write the policy π t+1 as π t+1 (•|s) ∝ exp t τ =0 w τ Q λ τ (s, •) for some positive weights {w τ }. Therefore, the representative agent is playing a policy that responds to the (weighted) average of all previous Q functions, which reflects the representative agent's belief on the aggregate population policy. Also note that our algorithm only performs a single policy improvement step to compute the updated policy π t+1 . It is unnecessary to compute the exact optimal policy π * t+1 = Γ λ 1 (µ t ) under µ t (which would require an inner loop for solving MDP µt ), as µ t is only an approximate anyway of the true NE mean-field µ * . Our algorithm updates π t and µ t simultaneously within a single loop.

4. MAIN RESULTS

In this section, we establish the theoretical guarantees on learning the regularized NE (π * , µ * ) of the MFG for our fictitious play algorithm. To state our theorem, we first discuss several regularity assumptions on the MFG model. Recall the definition (6) of the discounted state visitation distribution and let ρ * := ρ π * µ * ∈ ∆(S) be the visitation distribution induced by the NE (π * , µ * ). We make use of the following distance metric between two policies π, π ∈ Π: D(π, π ) := E s∼ρ * [ π(•|s) -π (•|s) 1 ] . As in the classical MFG literature (Guo et al., 2020; Saldi et al., 2018b) , we assume certain Lipschitz properties for the two mappings Γ λ 1 : M → Π and Γ 2 : Π × M → M defined in Section 2.3. The first assumption states that Γ λ 1 (µ) is Lipschitz in the mean-embedded mean-field state µ with respect to the RKHS norm. Assumption 2. There exists a constant d 1 > 0, such that for any µ, µ ∈ M, it holds that D Γ λ 1 (µ), Γ λ 1 (µ ) ≤ d 1 µ -µ H . The second assumption states that Γ 2 (π, µ) is Lipschitz in each of its arguments when the other argument is fixed. Assumption 3. There exist constants d 2 > 0, d 3 > 0 such that for any policies π, π ∈ Π and embedded mean-field states µ, µ ∈ M, it holds that Γ 2 (π, µ) -Γ 2 (π , µ) H ≤ d 2 D (π, π ) , Γ 2 (π, µ) -Γ 2 (π, µ ) H ≤ d 3 µ -µ H . Assumptions 2 and 3 immediately imply Lipschitzness of the composite mapping Λ λ : M → M, which we recall is defined as Λ λ (µ) = Γ 2 Γ λ 1 (µ), µ . The proof is provided in Appendix D.1. Lemma 1. Suppose Assumptions 2 and 3 hold. Then for each µ, µ ∈ M, it holds that Λ λ (µ) -Λ λ (µ ) H ≤ (d 1 d 2 + d 3 ) µ -µ H . We next impose an assumption on the boundedness of certain concentrability coefficients. This type of assumption, standard in analysis of policy optimization algorithms (Kakade & Langford, 2002; Shani et al., 2019; Bhandari & Russo, 2019; Agarwal et al., 2020) , allows one to define the policy optimization error in an average-case sense with respect to appropriate distributions over the states. Assumption 4 (Finite Concentrability Coefficients). There exist two constants C ρ , C ρ > 0 such that for each µ ∈ M, it holds that ρ π λ, * µ µ /ρ * ∞ := sup s ρ π λ, * µ µ (s)/ρ * (s) ≤ C ρ and E s∼ρ π λ, * µ µ ρ * (s)/ρ π λ, * µ µ (s) 2 1/2 ≤ C ρ . Finally, our last assumption stipulates that the state visitation distributions are smooth with respect to the (embedded) mean-field states of the MFG. This assumption is analogous to those in the literature on MDP and two-player games (Fei et al., 2020; Radanovic et al., 2019) , which requires the visitation distributions to be smooth with respect to the policy. Assumption 5. There exists a constant d 0 > 0, such that for any µ, µ ∈ M, it holds that ρ π λ, * µ µ -ρ π λ, * µ µ 1 ≤ d 0 µ -µ H . We now state our theoretical guarantees on the convergences of the policy-population sequence {π t , µ t } in Algorithm 1 to the NE {π * , µ * }. For the estimates of the embedded mean-field states, it is natural to consider the distance µ t -µ * H in RKHS norm. For convergence to NE policy µ * , recall that µ * is the optimal policy to MDP µ * , and each iteration of our algorithm involves a single policy improvement step to compute π t+1 rather than solving MDP µt to its optimal policy π * t+1 := Γ λ 1 (µ t ). As such, we analyze the difference between these two policies in terms of D π t+1 , π * t+1 , where the metric D is defined in equation ( 12). Also let ρ * t := ρ π * t+1 µt denote the discounted visitation distribution induced by the optimal policy π * t+1 of MDP µt . 2 With the above considerations in mind, we have the following theorem, which is proved in Appendix B. . Theorem 1. Suppose that Assumptions 1-5 hold and d 1 d 2 + d 3 < 1 and that the error in the policy evaluation step in Algorithm 1 satisfies E s∼ρ * t Q λ t (s, •) -Q λ t (s, •) 2 ∞ ≤ ε 2 , ∀t ∈ [T ]. With the choice of η = c η T -1 , α t ≡ α = c α T -2/5 , β t ≡ β = c β T -4/5 , for some universal constants c η > 0, c α > 0 and c β > 0 in Algorithm 1, the resulting policy and embedded mean-field state sequence {(π t , µ t )} T t=1 satisfy D 1 T T t=1 π t , 1 T T t=1 π * t ≤ 1 T T t=1 D(π t , π * t ) 1 √ λ • log T • T -1/5 + √ ε , T t=1 µ t -µ * H ≤ 1 T T t=1 µ t -µ * H 1 √ λ • log T • T -1/5 + √ ε . ( ) Theorem 1 bounds the distance between π t and the optimal policy π * t of MDP µ * t . By directly measuring the distance between π t and the NE policy π * , we can define the notion of an δ-approximate NE of the game. Definition 2. For each δ > 0, a policy-population pair (π, µ) is called an δ-approximate (entropyregularized) NE of the MFG if D(π, π * ) ≤ δ and µ -µ * H ≤ δ. The following corollary of Theorem 1 shows that after T iterations of our algorithm, the average policy-population pair ( 1 T T t=1 π t , 1 T T t=1 µ t ) is an O T -1/5 -approximate NE. Corollary 1. Under the assumptions of Theorem 1, we have D 1 T T t=1 π t , π * + 1 T T t=1 µ t -µ * H 1 √ λ • log T • T -1/5 + √ ε . We prove this corollary in Appendix C. The above results require an 2 -error of ε for policy evaluation. A variety of algorithms have been shown to achieve such a guarantees, including TD(0) and LSTD (Bhandari et al., 2018) . We also remark that the ∞ condition on concentrability coefficient in Assumption 4 can be relaxed to an 2 condition of the form E ρ  π λ, * µ µ (s)/ρ * (s) 2 1/2 ≤ C ρ , (a ) = log p(a ) (1 -η)p(a ) + η/|A| ≤ log p(a ) (1 -η)p(a ) ≤ η 1 -η ≤ 2η, where the third step follows from the fact that log(z) ≤ z -1 for all z > 0 and the last step holds as η ∈ [0, 1 2 ]. Therefore, we have log p(a ) p(a ) ≤ 2η. Applying Holder's inequality to (15) completes the proof. Lemma 3. Let x, y and z ∈ ∆(A). If x(a) ≥ α 1 , y(a) ≥ α 1 and z(a) ≥ α 2 for all a ∈ A, then D KL (x z) -D KL (y z) ≤ 1 + log 1 min {α 1 , α 2 } • x -y 1 . Proof. Under the lower bound assumption of the lemma, we have dD KL (x z) dx(a) = 1 + log x(a) z(a) ≤ 1 + log 1 α 2 and - dD KL (x z) dx(a) ≤ -1 -log α 1 . It follows that dD KL (x z) dx(a) ∞ ≤ max 1 + log 1 α 2 , -1 -log α 1 ≤ 1 + log 1 min {α 1 , α 2 } . Hence the function x → D KL (x z) is Lipschitz w.r.t. • 1 , the dual norm of • ∞ .

B PROOF OF THEOREM 1

In order to obtain an upper bound on the optimality gap σ t µ := µ t -µ * H , (16) where µ * is the embedded mean-field state of the entropy regularized NE, we also need to estimate the gap between π t+1 and the optimal solution to the entropy regularized MDP µt . We define σ t+1 π := E s∼ρ * t D KL π * t+1 (•|s) π t+1 (•|s) (17) to quantify the convergence of policy sequence. Before proceeding, we establish the following properties of entropy regularized MDPs, which are central to the convergence analysis. Properties of Regularized MDP. The following lemma quantifies the performance difference between two policies for a regularized MDP -measured in terms of the expected total rewardthrough the Q-function and their KL-divergence. The proof is provided in Appendix D.2. Lemma 4 (Performance Difference). For each µ ∈ M and policies π : S → ∆(A), it holds that J λ µ (π ) -J λ µ (π) + λ 1 -γ E s∼ρ π µ [D KL (π (•|s) π(•|s))] = 1 1 -γ E s∼ρ π µ Q λ,π µ (s, •) -λ log π(•|s), π (•|s) -π(•|s) , where ρ π µ is the discounted state visitation distribution induced by the policy π on MDP µ . We can characterize the optimal policy π λ, * µ in terms of the optimal Q-function Q λ, * µ as a Boltzmann distribution of the form Cen et al. (2020) ; Nachum et al. ( 2017) π λ, * µ (a|s) ∝ exp Q λ, * µ (s, a) λ . ( ) For the setting where the reward function is bounded, we then can obtain a lower bound on π λ, * µ , as stated in the following lemma. The proof is provided in Appendix D.3 Lemma 5. Suppose that there exists a constant R max > 0 such that 0 ≤ sup (s,a,µ)∈S×A×M r(s, a, µ) ≤ R max . For each µ ∈ M, and each policy π : S → ∆(A), we have Q λ,π µ ∞ ≤ Q max := R max + γλ log |A| 1 -γ . Also, the optimal policy π λ, * µ for the regularized MDP µ satisfies π λ, * µ (a|s) ≥ 1 e Qmax/λ |A| , ∀s ∈ S, a ∈ A. Convergence Analysis. We now move to the convergence analysis. For clarity of exposition, we use E ρ [ π -π 1 ] as shorthand for E s∼ρ [ π(•|s) -π (•|s) 1 ] , where ρ ∈ ∆(S); we also use E ρ [D KL (π π )] as shorthand for E s∼ρ [D KL (π(•|s) π (•|s))]. We recall that the step sizes are chosen as α t ≡ α = c α T -2/5 , β t ≡ β = c β T -4/5 , where the parameters c α and c β satisfy that: c α T -2/5 λ < 1, c β T -4/5 d < 1. ( ) Here d := 1-d 1 d 2 -d 3 > 0, where d 1 appears in Assumption 2, and d 2 , d 3 appear in Assumption 3. Step 1: Convergence of Policy. To analyze the convergence of the optimality gap σ t+1 µ = µ t+1 -µ * H , we first characterize the convergence behavior of the policy sequence {π t } t≥0 . In particular, we establish a recursive relationship between σ t+1 π = E s∼ρ * t D KL π * t+1 (•|s) π t+1 (•|s ) and σ t π , as stated in the following lemma. The proof is provided in Section B.1. Lemma 6. Under the setting of Theorem 1, for each t ≥ 1, we have σ t+1 π ≤ (1 -λα t )σ t π + (1 -λα t ) d 0 log |A| η + κC ρ d 1 µ t-1 -µ t H + 2εα t + Q 2 max 2 α 2 t + 2η, ( ) where κ = 4 1-γ log |A| η + 2Rmax λ(1-γ) . Recall that µ t = (1 -β t-1 )µ t-1 + β t-1 • Γ 2 (π t , µ t-1 ). Under Assumption 1, we have µ t-1 -µ t H = β t-1 µ t-1 -Γ 2 (π t , µ t-1 ) H ≤ 2β t-1 . (22) Lemma 6 implies that σ t+1 π ≤ (1 -λα t )σ t π + (1 -λα t )C 1 β t-1 + 2εα t + Q 2 max 2 α 2 t + 2η, where we define C 1 := 2 d 0 log |A| η + κC ρ d 1 . With α t ≡ α, β t ≡ β, from Equation ( 23) we have that σ t π ≤ 1 λα σ t π -σ t+1 π + 1 λα -1 C 1 β + 2ε λ + Q 2 max 2λ α + 2η λα . Summing over = 0, 2, . . . T -1 on both sides of ( 24) and dividing by t gives 1 T T -1 t=0 σ t π ≤ 1 T λα σ 0 π -σ T π + 1 λα -1 C 1 β + 2ε λ + Q 2 max 2λ α + 2η λα ≤ 1 T λα σ 0 π + C 1 β λα + 2ε λ + Q 2 max 2λ α + 2η λα . ( ) When choosing α = O(T -2/5 ), β = O(T -4/5 ) and η = O(T -1 ), we have C 1 = O(log T ). Therefore, we obtain 1 T T -1 t=0 σ t π log T λT 2/5 + 2ε λ . If we let T be a random number sampled uniformly from {1, . . . , T }, then the above equation can be written equivalently as E T σ T π log T λT 2/5 + 2ε λ . ( ) Step 2: Convergence of Mean-field Embedding. We now proceed to characterize the optimality gap for the embedded mean-field state. We obtain the following upper bound on the optimality gap σ t+1 µ = µ t+1 -µ * H . The proof is provided in Section B.2. Lemma 7. Under the setting of Theorem 1, for each t ≥ 0, we have σ t+1 µ ≤ 1 -β t d σ t µ + d 2 C ρ β t σ t+1 π , where d = 1 -d 1 d 2 -d 3 > 0. Lemma 7 implies that σ t µ ≤ 1 dβ t σ t µ -σ t+1 µ + d 2 C ρ d σ t+1 π . With β t ≡ β = O(T -4/5 ), averaging equation ( 28) over iteration t = 0, . . . , T -1, we obtain 1 T T -1 t=0 σ t µ ≤ 1 dβT σ 0 µ -σ T µ + d 2 C ρ dT T -1 t=0 σ t+1 π ≤ σ 0 µ dβT + d 2 C ρ dT T -1 t=0 σ t+1 π ≤ σ 0 µ dβT + d 2 C ρ d 1 T T -1 t=0 σ t+1 π , where the last inequality follows from Cauchy-Schwarz inequality. From Eq. ( 26), we have 1 T T -1 t=0 σ t µ σ 0 µ d T -1/5 + d 2 C ρ d log T λT 2/5 + 2ε λ log T λT 2/5 + 2ε λ 1 √ λ √ log T T 1/5 + √ ε . This equation, together with Jensen's inequality, proves equation ( 14) in Theorem 1. Turning to equation ( 13) in Theorem 1, we have 1 T T t=1 D (π t , π * t ) = E T [D (π T , π * T )] = E T E s∼ρ * [ π * T (•|s) -π T (•|s) 1 ] = E T E s∼ρ * T-1 ρ * (s) ρ * T-1 (s) π * T (•|s) -π T (•|s) 1 (i) ≤ E T E s∼ρ * T-1 ρ * (s) ρ * T-1 (s) 2 • E T E s∼ρ * T-1 π * T (•|s) -π T (•|s) 2 1 (ii) ≤ C 2 ρ • E T E s∼ρ * T-1 [2D KL (π * T (•|s) π T (•|s))] = C 2 ρ • 2E T [σ T π ] (iii) 1 √ λ √ log T T 1/5 + √ ε , where step (i) follows from Cauchy-Schwarz inequality, step (ii) follows from Assumption 4 and Pinsker's inequality, and step (iii) follows from the bound in equation ( 27). The above equation, together with Jensen's inequality, proves equation ( 13). We have completed the proof of Theorem 1.

B.1 PROOF OF LEMMA 6

The following lemma characterizes this policy improvement step. The proof is provided in Section D.4. Lemma 8. For any distributions p * , p ∈ ∆(A),state s ∈ S and function G : S × A → R, it holds for p ∈ ∆(A) with p (•) ∝ p(•) • exp [αG(s, •)] that D KL (p * p ) ≤ D KL (p * p) -α G(s, •), p * -p + α 2 G(s, •) 2 ∞ /2. Recall that π t+1 (•|s) ∝ π t (•|s) • exp α t Q λ t (s, •) -λ log π t (•|s) . Lemma 8 implies that for each s ∈ S, we have D KL π * t+1 (•|s) π t+1 (•|s) ≤D KL π * t+1 (•|s) π t (•|s) -α t Q λ t (s, •) -λ log π t (•|s), π * t+1 (•|s) -π t (•|s) + Q λ t 2 ∞ α 2 t /2 =D KL π * t+1 (•|s) π t (•|s) -α t Q λ t (s, •) -λ log π t (•|s), π * t+1 (•|s) -π t (•|s) + α t Q λ t (s, •) -Q λ t (s, •), π * t+1 (•|s) -π t (•|s) + Q λ t 2 ∞ α 2 t /2 ≤D KL π * t+1 (•|s) π t (•|s) -α t Q λ t (s, •) -λ log π t (•|s), π * t+1 (•|s) -π t (•|s) + 2α t Q λ t (s, •) -Q λ t (s, •) ∞ + Q λ t 2 ∞ α 2 t /2. Recall that π t+1 (•|s) = (1 -η) π t+1 (•|s) + η |A| 1 |A| . Lemma 2 implies that D KL π * t+1 (•|s) π t+1 (•|s) ≤D KL π * t+1 (•|s) π t+1 (•|s) + 2η. ( ) ≤D KL π * t+1 (•|s) π t (•|s) -α t Q λ t (s, •) -λ log π t (•|s), π * t+1 (•|s) -π t (•|s) + 2α t Q λ t (s, •) -Q λ t (s, •) ∞ + Q λ t 2 ∞ α 2 t /2 + 2η Yt(s) . Taking expectation over ρ * t on both sides of ( 30) yields E ρ * t D KL π * t+1 π t+1 ≤E ρ * t D KL π * t+1 π t -α t E s∼ρ * t Q λ t (s, •) -λ log π t (•|s), π * t+1 (•|s) -π t (•|s) + E s∼ρ * t [Y t (s)] (a) = E ρ * t D KL π * t+1 π t -(1 -γ)α t J λ µt (π * t+1 ) -J λ µt (π t ) -α t λE ρ * t D KL π * t+1 π t + E s∼ρ * t [Y t (s)] (b) ≤(1 -α t λ)E ρ * t D KL π * t+1 π t + E s∼ρ * t [Y t (s)] (c) ≤ (1 -α t λ) E ρ * t [D KL (π * t π t )] B1 +(1 -α t λ) E ρ * t D KL π * t+1 π t -D KL (π * t π t ) B2 +E s∼ρ * t [Y t (s)] , where step (a) follows from Lemma 4; step (b) follows from the fact that J λ µt (π t ) ≤ J λ µt (π * t+1 ), as π * t+1 = Γ λ 1 (µ t ) is the optimal policy for the regularized MDP µt ; and step (c) holds due to triangle inequality. Next we bound the first and second terms on the RHS of (31) separately. • For the second term B 2 : Note that π * t+1 and π * t are the optimal policy for the regularized MDP µt and MDP µt-1 , respectively. Define τ := 1 |A| exp - R max + γλ log |A| λ(1 -γ) . By Lemma 5, for all (s, a) ∈ S × A, we have π * t+1 (a|s) ≥ τ, and π * t (a|s) ≥ τ. Applying Lemma 3 yields B 2 ≤ κE s∼ρ * t π * t (•|s) -π * t+1 (•|s) 1 = κE s∼ρ * ρ * t (s) ρ * (s) • π * t (•|s) -π * t+1 (•|s) 1 ≤ κC ρ E s∼ρ * π * t (•|s) -π * t+1 (•|s) 1 Assumption 4 = κC ρ D Γ λ 1 (µ t-1 ), Γ λ 1 (µ t ) ≤ κC ρ d 1 µ t-1 -µ t H , Assumption (2) ( ) where κ := 1 + log 1 min τ, η |A| ≤ 2 max log |A| η , 2 1 -γ log |A| + R max λ(1 -γ) ≤ 4 1 -γ log |A| η + 2R max λ(1 -γ) = 4 1 -γ KL max + 2R max λ(1 -γ) . • For the first term B 1 : We have Combining (31), ( 32) and (33), we have B 1 = E ρ * t-1 [D KL (π * t π t )] + E ρ * t -E ρ * t-1 [D KL (π * t π t )] = E ρ * t-1 [D KL (π * t π t )] + E s∼ρ * ρ * t (s) -ρ * t-1 (s) ρ * (s) D KL (π * t (•|s) π t (•|s)) (a) ≤ E ρ * t-1 [D KL (π * t π t )] + E s∼ρ * ρ * t (s) -ρ * t-1 (s) ρ * (s) • KL max , (b) ≤ E ρ * t-1 [D KL (π * t π t )] + KL max • d 0 µ t -µ t-1 H E ρ * t D KL π * t+1 π t+1 ≤(1 -λα t )E ρ * t-1 [D KL (π * t π t )] + (1 -λα t )d 0 • KL max µ t -µ t-1 H + (1 -λα t )κC ρ d 1 µ t-1 -µ t H + E s∼ρ * t [Y t (s)] =(1 -λα t )E ρ * t-1 [D KL (π * t π t )] + (1 -λα t ) (d 0 • KL max + κC ρ d 1 ) µ t-1 -µ t H + E s∼ρ * t [Y t (s)] . (34) Note that E s∼ρ * t [Y t (s)] = 2α t E s∼ρ * t Q λ t (s, •) -Q λ t (s, •) ∞ + Q λ t 2 ∞ 2 α 2 t + 2η ≤ 2α t E s∼ρ * t Q λ t (s, •) -Q λ t (s, •) 2 ∞ + Q λ t 2 ∞ 2 α 2 t + 2η ≤ 2εα t + Q 2 max 2 α 2 t + 2η, where the last step holds by the assumption on the policy evaluation error and the fact that Q λ t-1 : S × A → [0, Q max ] satisfies Q λ t-1 ∞ ≤ Q max by definition. Combining the last two display equations proves the lemma.

B.2 PROOF OF LEMMA 7

Proof. According to the update rule (10) for the embedded mean-field state, we have µ t+1 -µ * H = (1 -β t )µ t + β t Γ 2 (π t+1 , µ t ) -µ * H = (1 -β t ) (µ t -µ * ) + β t Γ 2 Γ λ 1 (µ t ), µ t -µ * -β t Γ 2 Γ λ 1 (µ t ), µ t -Γ 2 (π t+1 , µ t ) H ≤(1 -β t ) (µ t -µ * ) H + β t Γ 2 Γ λ 1 (µ t ), µ t -µ * H + β t Γ 2 Γ λ 1 (µ t ), µ t -Γ 2 (π t+1 , µ t ) H (i) =(1 -β t ) µ t -µ * H + β t Γ 2 Γ λ 1 (µ t ), µ t -Γ 2 Γ λ 1 (µ * ), µ * H (a) + β t Γ 2 Γ λ 1 (µ t ), µ t -Γ 2 (π t+1 , µ t ) H (b) , where the equality (i) follows from the fact that µ * = Γ 2 Γ λ 1 (µ * ), µ * . Lemma 1 implies that Λ(µ) = Γ 2 Γ λ 1 (µ), µ is d 1 d 2 + d 3 Lipschitz. It follows that (a) ≤ (d 1 d 2 + d 3 ) µ t -µ * H . (36) By Assumption 3, we have (b) ≤ d 2 D Γ λ 1 (µ t ), π t+1 . (37) Combining Eqs. ( 35)-( 37) yields µ t+1 -µ * H ≤ 1 -β t d µ t -µ * H + d 2 β t D Γ λ 1 (µ t ), π t+1 , where d = 1 -d 1 d 2 -d 3 > 0. Let us bound the second RHS term above. By the definition of policy distance D in equation ( 12), we have D Γ λ 1 (µ t ), π t+1 = E ρ * Γ λ 1 (µ t ) -π t+1 1 = E s∼ρ * π * t+1 (•|s) -π t+1 (•|s) 1 = E s∼ρ * t ρ * (s) ρ * t (s) π * t+1 (•|s) -π t+1 (•|s) 1 ≤ E s∼ρ * t ρ * (s) ρ * t (s) 2 • E s∼ρ * t π * t+1 (•|s) -π t+1 (•|s) 2 1 1/2 ≤ C ρ E s∼ρ * t D KL π * t+1 (•|s) π t+1 (•|s) , where the first inequality holds due to Cauchy-Schwartz inequality, the last inequality follows from Assumption 4 and Pinsker's inequality. Combining ( 38)-( 39) gives µ t+1 -µ * H ≤ 1 -β t d µ t -µ * H + d 2 β t C ρ E s∼ρ * t D KL π * t+1 (•|s) π t+1 (•|s ) . This completes the proof.

C PROOF OF COROLLARY 1

Proof. Note that for each t ∈ [T ], we have D(π t , π * ) ≤ D (π t , π * t ) + D (π * t , π * ) = D (π t , π * t ) + D Γ λ 1 (µ t ), Γ λ 1 (µ * ) ≤ D (π t , π * t ) + d 1 µ t -µ * H , where the last step follows from Assumption 2 on the Lipschitzness of Γ λ 1 . It follows that D 1 T T t=1 π t , π * + 1 T T t=1 µ t -µ * H ≤ 1 T T t=1 D (π t , π * ) + 1 T T t=1 µ t -µ * H ≤ 1 T T t=1 (D (π t , π * t ) + d 1 µ t -µ * H ) + 1 T T t=1 µ t -µ * H 1 √ λ √ log T T 1/5 + √ ε , where in the last step we apply the bounds ( 13) and ( 14) in Theorem 1. For the second inequality, we have  π λ, * µ (a|s) = exp Q λ, * µ (s, a)/λ b∈A exp Q λ, * µ (s, b)/λ ≥ 1 b∈A exp (Q max /λ) = 1 e Qmax/λ G(s, •), p * -p = α G(s, •), p * -p + α G(s, •), p -p = D KL (p * p) -D KL (p * p ) -D KL (p p) + α G(s, •), p -p ≤ D KL (p * p) -D KL (p * p ) -D KL (p p) + α G(s, •) ∞ • p -p 1 . Rearranging terms yields D KL (p * p ) ≤ D KL (p * p) -α G(s, •), p * -p -D KL (p p) + α G(s, •) ∞ • p -p 1 . (45) Meanwhile, by Pinsker's inequality, it holds that D KL (p p) ≥ p -p 2 1 /2. (46) By combining ( 45) and ( 46), we obtain D KL (p * p ) ≤ D KL (p * p) -α G(s, •), p * -p -p -p 2 1 /2 + α G(s, •) ∞ • p -p 1 ≤ D KL (p * p) -α G(s, •), p * -p + α 2 G(s, •) 2 ∞ /2, which concludes the proof. E A WEAKER ASSUMPTION ON CONCENTRABILITY In this section, we consider a weaker assumption on concentrability, under which Algorithm 1 learns a policy-population pair that is O(T -1/9 )-approximate NE after T iterations. We consider the following distance metric between two policies π, π ∈ Π: W (π, π ) := E s∼ρ * π(•|s) -π (•|s) 2 1 . Similarly as before, we assume certain Lipschitz properties for the two mappings Γ λ 1 : M → Π and Γ 2 : Π × M → M defined in Section 2.3. In particular, we impose the following two assumtpions, both stated in terms of the new distance metric W (•, •) defined in (47) above. Assumption 6. There exists a constant d 1 > 0, such that for any µ, µ ∈ M, it holds that W Γ λ 1 (µ), Γ λ 1 (µ ) ≤ d 1 µ -µ H . Assumption 7. There exist constants d 2 > 0, d 3 > 0 such that for any policies π, π ∈ Π and embedded mean-field states µ, µ ∈ M, it holds that Γ 2 (π, µ) -Γ 2 (π , µ) H ≤ d 2 W (π, π ) , Γ 2 (π, µ) -Γ 2 (π, µ ) H ≤ d 3 µ -µ H . Assumptions 6 and 7 immediately imply Lipschitzness of the composite mapping Λ λ : M → M, which we recall is defined as Λ λ (µ) = Γ 2 Γ λ 1 (µ), µ . Lemma 9. Suppose Assumptions 6 and 7 hold. Then for each µ, µ ∈ M, it holds that Λ λ (µ) -Λ λ (µ ) H ≤ (d 1 d 2 + d 3 ) µ -µ H . We also consider the following relaxed, 2 -type assumption on the concentrability coefficients. Assumption 8 (Finite Concentrability Coefficients). There exist two constants C ρ , C ρ > 0 such that for each µ ∈ M, it holds that      E s∼ρ π λ, * µ µ    ρ π λ, * µ µ (s) ρ * (s) 2         1/2 ≤ C ρ and    E s∼ρ π λ, * µ µ   ρ * (s) ρ π λ, * µ µ (s) 2      1/2 ≤ C ρ . We establish the following convergence result for Algorithm 1. Theorem 2. Suppose that Assumptions 1, 5, 6, 7, and 8 hold and d 1 d 2 + d 3 < 1 and that the error in the policy evaluation step in Algorithm 1 satisfies E s∼ρ * t Q λ t (s, •) -Q λ t (s, •) 2 ∞ ≤ ε 2 , ∀t ∈ [T ]. With the choice of η = c η T -1 , α t ≡ α = c α T -4/9 , β t ≡ β = c β T -8/9 , for some universal constants c η > 0, c α > 0 and c β > 0 in Algorithm 1, the resulting policy and embedded mean-field state sequence {(π t , µ t )} T t=1 satisfy W 1 T T t=1 π t , 1 T T t=1 π * t ≤ 1 T T t=1 W (π t , π * t ) 1 λ 1/4 (log T ) 1/4 T 1/9 + ε 1/4 , T t=1 µ t -µ * H ≤ 1 T T t=1 µ t -µ * H 1 λ 1/4 (log T ) 1/4 T 1/9 + ε 1/4 . The following corollary of Theorem 2 shows that after T iterations of our algorithm, the average policy-population pair T 1/9 + ε 1/4 .

E.1 PROOFS OF THEOREM 2 AND COROLLARY 2

The proof follows similar lines as those of Theorem 1 and Corollary 1, with all appearances of the distance D replaced by the new distance W . Below we only point out the modifications needed. Lemma 6 remains valid as stated. For the proof of this lemma, the only different step is bounding the term B 2 in equation ( 31). In particular, the bounds in equation ( 32) should be replaced by the Assumption 8 = κC ρ W Γ λ 1 (µ t-1 ), Γ λ 1 (µ t ) ≤ κC ρ d 1 µ t-1 -µ t H . Assumption 6 (50) Lemma 7 should be replaced by the following lemma. Lemma 10. Under the setting of Theorem 2, for each t ≥ 0, we have σ t+1 µ ≤ 1 -β t d σ t µ + d 2 C ρ β t σ t+1 π 1/4 , where d = 1 -d 1 d 2 -d 3 > 0. The proof of Lemma 10 is similar to that of Lemma 7. The only different step is the term D Γ λ 1 (µ t ), π t+1 in equation ( 38) should be replaced by W Γ λ 1 (µ t ), π t+1 , which can be bounded as follows: where step (i) holds by Assumption 8 and the fact that ν -ν 1 ≤ 2, ∀ν, ν ∈ ∆(A), and step (ii) follows Pinsker's inequality. W Γ λ 1 (µ t ), π t+1 = E s∼ρ * π * t+1 (•|s) -π t+1 We now turn to the proof of Theorem 2. We first establish the convergence for σ t π by following the exactly same steps from equation ( 21) up to equation (25). We restate the bound on 1  If we let T be a random number sampled uniformly from {1, . . . , T }, then the above equation can be written equivalently as E T σ T π log T λT 4/9 + 2ε λ . ( ) We now proceed to bound the average embedded mean-field state where steps (i) and (ii) follow from Cauchy-Schwarz inequality. From equation ( 53 T 1/9 + ε 1/4 , where step (i) holds due to Jensen's inequality, step (ii) follows from Cauchy-Schwarz inequality, step (iii) follows from Assumption 8 and the fact that ν -ν 1 ≤ 2, ∀ν, ν ∈ ∆(A), step (iv)



In general, the policy may be a function of the mean-field state Lt as well. We have suppressed this dependency since our ultimate goal is to find a stationary equilibrium, under which the mean-field state remains fixed over time. SeeGuo et al. (2019);Saldi et al. (2018b) for a similar treatment.



Let p * and p ∈ ∆(A) and p = (1 -η)p + η 1 |A| |A| . Then D KL (p * p) ≤ log |A| η , D KL (p * p) -D KL (p * p) ≤ 2η. Proof. By definition we have D KL (p * p) KL (p * p) -D KL (p * p) ≤ 1 for all a ∈ A then we have D KL (p * p) -D KL (p * p) ≤ 0; otherwise, there exists a such that p(a ) ≥ p(a ) and we have log p(a ) p

where step (a) uses the fact thatDKL (π * t (•|s) π t (•|s)) ≤ KL max := log |A| η (cf.Lemma 2) and step (b) follows from Assumption 5.

For any function g : A → R and distribution p ∈ ∆(A), let z : A → R be a constant function defined by z(a) = log a ∈A p(a ) • exp (αg(a )) . Note that for any distributions p * , p ∈ ∆(A), z, p * -p = 0. Since p (•) ∝ p(•) • exp (αg(•)) , we have αg(•) = z(•) + log(p (•)/p(•)). Hence α g, p * -p = z + log(p /p), p * -p = z, p * -p + log(p * /p), p * + log(p /p * ), p * + log(p /p), -p = D KL (p * p) -D KL (p * p ) -D KL (p p) . Therefore, for each state s ∈ S, we have α

α = O(T -4/9 ), β = O(T -8/9 ) and η = O(T -1 ), we have C 1 = O(log T ).

β t ≡ β = O(T -8/9), averaging equation (55) over iteration t = 0, . . . , T -1,

(π t , π * t ) = E T [W (π T , π * T )] = E T E s∼ρ * π * T (•|s) -π T (•|s) ρ • E T E s∼ρ * T-1 [D KL (π * T (•|s) π T (•|s))] 1/4 = C ρ • E T σ T π log T ) 1/4

(•|s)

D ADDITIONAL PROOFS D.1 PROOF OF LEMMA 1

Proof. By the definition of Λ, we haveAssumption 2 which proves the lemma.

D.2 PROOF OF LEMMA 4

Proof. By the definition of V λ,π µ in (4), we haveRecall that the Q-function Q λ,π µ of a policy π for the regularized MDP µ is related to V λ,π µ as(41) Plugging ( 41) into (40), we haveRecall the definition of J λ µ (π) in ( 5). Taking expectation with respect to s ∼ ν 0 on both sides of (42) yieldsFor the entropy term in (43), we haveTaking ( 44) into (43) yields the desired equation in Lemma 4.

D.3 PROOF OF LEMMA 5

Proof. Note that the value function V λ,π µ can be written asBy the definition of r λ,π µ in (1), we have 0comes from Pinsker's inequality, and step (v) follows from the bound in equation ( 54). The above equation, together with Jensen's inequality, proves equation ( 48). We have completed the proof of Theorem 2.The proof of Corollary 2 is the same as that of Corollary 1 and is omitted here.

