IMITATION LEARNING FOR MEAN FIELD GAMES WITH CORRELATED EQUILIBRIA

Abstract

Imitation learning (IL) aims at achieving optimal actions by learning from demonstrated behaviors without knowing the reward function and transition kernels. Conducting IL with a large population of agents is challenging as agents' interactions grow exponentially with respect to the population size. Mean field theory provides an efficient tool to study multi-agent problems by aggregating information on the population level. While the approximation is tractable, it is non-trivial to restore mean field Nash equilibria (MFNE) from demonstrations. Importantly, there are many real-world problems that cannot be explained by the classic MFNE concept; this includes the traffic network equilibrium induced from the public routing recommendations and the pricing equilibrium of goods generated on the E-commerce platform. In both examples, correlation devices are introduced to the equilibrium due to the intervention from the platform. To accommodate this, we propose a novel solution concept named adaptive mean field correlated equilibrium (AMFCE) that generalizes MFNE. On the theory side, we first prove the existence of AMFCE, and establish a novel framework based on IL and AMFCE with entropy regularization (MaxEnt-AMFCE) to recover the AMFCE policy from real-world demonstrations. Signatures from the rough path theory are then applied to characterize the mean-field evolution. A significant benefit of our framework is that it can recover both the equilibrium policy and the correlation device from data. We test our framework against the state-of-the-art IL algorithms for MFGs on several tasks (including a real-world traffic flow prediction problem), results justify the effectiveness of our proposed method and show its potential to predicting and explaining large population behavior under correlated signals.

1. INTRODUTION

Imitation learning (IL) (Hussein et al., 2017) has been widely adopted to learn the desired behavior through expert demonstrations and led to a series of impressive successes (Silver et al., 2016; Shi et al., 2019; Shang et al., 2019) . Existing imitation learning algorithms cannot handle tasks with a large group of agents due to the curse of dimensionality and the exponential growth of agent interactions when the number of agents increases. However, many real-world scenarios require the algorithm to handle a large population. Examples include traffic management and control (Bazzan, 2009) , Ad auction (Guo et al., 2019) , online business with a large customer base (Ahn et al., 2007) and social behaviors between game bots and humans (Jeong et al., 2015) . For systems with a large population of homogeneous agents, mean field theory provides a practically efficient and analytically feasible approach to analyze the otherwise challenging multi-agent games (Guo et al., 2019; Yang et al., 2018b) . In the mean field game (MFG) setting, the states of the entire population can be sufficiently summarized into an empirical distribution of states thanks to the homogeneity property. Therefore it suffices to consider a game between a representative agent and an empirical distribution. Existing (and rather limited) literature on mean-field IL assumes that the expert demonstrations are sampled from the classic mean field Nash equilibrium (MFNE) (Yang et al., 2018a; Chen et al., 2022) . The limitation of this framework is not general enough to capture many real-world situations where external and correlated signals influence the behavior of the entire populations. Examples include the behavior of drivers on the traffic network with routing recommendations from Google Map or Apple Map. Another possible example is the E-commerce platform recommendation for individual sellers on setting up the price for their products. In these two examples, a mediator or a coordinator recommends decisions but individual agents who seek for greedy decisions could deviate from the recommendation if she/he finds a better option given the available information. The existence of the mediator introduces correlations among the behaviors of individual agents. Therefore, a more general equilibrium concept is needed before we take a step further to learn from expert demonstrations. Inspired by the concept of correlated equilibrium (CE) for stateless game (Aumann, 1974) , there are recent developments on mean field correlated equilibrium (MFCE) with state dynamics. Campi and Fischer assume that a mediator recommends the same stochastic policy to the entire population, resulting in a limited equilibrium set which is the same as the classic MFNE (Campi & Fischer, 2022) . In addition, it is often more practical for the mediator to recommend an action rather than a stochastic policy to individuals (see the traffic routing and e-commerce examples). Muller et al. assume that the mediator recommends a time-independent and deterministic policy (sampled from some distribution over the deterministic policy space) to each individual (Muller et al., 2022) . This formulation is also rather limited in terms of describing the behaviors of many real-world applications and enabling sufficient flexibility of the population behavior. A more general and practical setting is to establish a framework where the mediator could sample a stochastic policy based on some time-dependent signals and recommend action for each individual, which is the exact framework investigated in this paper. (See Appendix H for a concrete example to show that our equilibrium concept is more general than the one proposed by Muller et al. (Muller et al., 2022) .) Given the above mentioned limitations of current existing MFCE concepts and mean-field IL approaches, we propose a new MFCE framework named adaptive mean field correlated equilibrium (AMFCE) with time-dependent correlated signals and an individual agent can adaptively update her belief on the unobserved correlated signal. We develop a method to recover AMFCE policy based on Maximum Entropy Regularization. Our framework has the following important and novel ingredients: • Novel MFCE concept with time-dependent correlated signals and adaptive belief updates from individual agents. In this paper, we propose a new MFCE framework (called AMFCE) that the mediator recommends an action sampled from a stochastic policy for each agent at every time step. This is a more general and flexible framework compared to previous works on the MFCE (Muller et al., 2022; Campi & Fischer, 2022) . We prove the existence of AMFCE solution under mild conditions and prove that MFNE is a subclass of AMFCE. • Entropy Regularization to overcome the equilibrium selection difficulty. Most of the IL algorithms for games face the equilibrium selection issue or identifiability issue as there often exist multiple equilibria. To bypass this difficulty, we further propose an entropy regularized AMFCE (MaxEnt-AMFCE) framework which is shown to have a unique solution. • Using signatures from rough path theory to efficiently represent mean-field evolution. Mean field information is often inaccessible in practice and it is computationally expensive to approximate the mean field information by its empirical distribution. To overcome this difficulty, we adopt signatures from the rough path theory to represent the mean-field evolution, which can be easily combined with neural network training architectures and the resulting method is computationally efficient. With all these ingredients, our correlated mean field imitation learning (CMFIL) framework can recover not only the policy but also the correlation device, which is the distribution that the correlated signal is sampled from. To the best of our knowledge, this paper is the first focusing on MFCE with the correlation device providing time-dependent recommendations and allowing adaptive belief updates for individual agents. In addition, we illustrate the performance of our framework by comparing the state-of-the-art imitation learning algorithms for MFGs on several tasks, including a real-world traffic flow prediction problem. The experimental results demonstrate that our framework outperforms the baseline in all tasks. As a by-product, our framework is also suitable for solving MFNE as MFNE is a subclass of AMFCE.

2. PRELIMINARY: CLASSIC MEAN FIELD NASH EQUILIBRIUM

This section introduces the classic framework of MFG and the concept of MFNE. The classic MFG models a game between a representative agent and the state distribution of all the other agents. Denote P(X ) as the set of probability distributions over X and denote T = {0, 1, N can be viewed as the limit of the empirical distribution of an homogeneous N -agent game where s i t is the state of agent i at time t and 1 {e} is an indicator function (with value 1 if expression e holds and 0 otherwise). Here P : S × A × P(S) → P(S). For fixed mean-field information μ μ μ = {μ t } T t=0 , the objective of the representative agent is to solve the following decision-making problem over all admissible policies π π π = {π t } T t=0 : maximize π π π V k (s, π π π, μ μ μ) := E T t=k γ t r(s t , a t , μ t ) s k = s subject to s t+1 ∼ P (•|s t , a t , μ t ), a t ∼ π t (s t ), (Classic MFG) The Mean-field Nash Equilibrium (MFNE) is defined as the following. Definition 1 (MFNE). In (Classic MFG), a player-population profile (π π π , μ μ μ ) is called a MFNE (under initial state μ 0 ) if 1. (Single player side) For any policy π π π, any time index t ∈ T , and any initial state s ∈ S, V t (s, π π π , μ μ μ ) ≥ V t (s, π π π, μ μ μ ) . 2. (Population side) {μ * t } T t=0 satisfies μ * t (•) = s∈S,a∈A P (•|s, a, μ * t-1 )π * t-1 (a|s)μ * t-1 (s) with initial condition μ * 0 = μ 0 . The single player side condition captures the optimality of π π π , when the population side is fixed. The population side condition ensures the "consistency" of the solution: it guarantees that the state distribution flow of the single player matches the population state sequence μ μ μ := {μ t } T t=0 .

3. PROBLEM FORMULATION

This section introduces a novel adaptive mean-field correlated equilibrium (AMFCE) framework and establishes the existence of equilibria solutions under mild conditions. We prove that the solution set of AMFCE is richer than the well-known MFNE. Furthermore, the maximum entropy principle is adopted to select the solution with maximum entropy among the solution set of the AMFCE.

3.1. ADAPTIVE MEAN FIELD CORRELATED EQUILIBRIUM (AMFCE)

To incorporate the correlations introduced by the central platforms in the traffic network example and the E-commerce marketplace example introduced in Section 1, we consider a mediator (or a central agent) who samples a correlated signal z t ∈ Z at each time t, where Z is a finite signal space. z t may represent some global conditions such as the weather on day t for the traffic network example and the supply-demand imbalance in month t for the E-commerce marketplace example. Before discussing the AMFCE, we first introduce the concepts of behavioral policy and correlation device. Definition 2. For each time t, the behavioral policy π t : Z × S → Δ(A) maps from the signal and state spaces to the simplex over the action space. Given the correlated signal z ∈ Z and an action a ∼ π t (•|s, z) will be independently sampled as a private recommendation for each agent at state s. Definition 3. The per-step correlation device ρ t ∈ Δ(Z) is a publicly known distribution over the space of correlated signal, from which the mediator will sample the correlated signal at time step t. Denote ρ ρ ρ = {ρ t } T t=0 as correlation device over the entire horizon. At every time step t, a correlated signal z t is sampled from the per-step correlation device ρ t . Then a recommendation action a t will be sampled independently from the behavior policy π t (•|s t , z t ), and sent to each agent at state s t . This recommended action a t is private and only available to the agent. Mathematically, denote I t = {ρ t , a t , π t (•, •, •), s t , z t-1 , μ t-1 } as the information available to the agent at the beginning of step t. And I 0 = {ρ 0 , a 0 , π 0 (•, •, •), s 0 }. Note that the agent only observes the functional form of π t but can not observe the correlated signal z t nor the recommended actions for other agents. Based on the information I t , the agent will take an action a t (which may be different from the mediator's recommendation), and then the agent at state s t will transit to the next state according to distribution P (•|s t , a t , μ t ) ∈ P(S) given current mean field μ t , which follows: μ t (•) = a∈A s∈S μ t-1 (s)P (•|s, a, μ t-1 )π t-1 (a|s, z t-1 ). This implies that, given μ t-1 and π t-1 , μ t is fully determined by z t-1 . After receiving the recommendation action a t , the agent can predict the correlated signal by ρ pred t (z t = z|I t ) = ρ t (z)π t (a t |s t , z) z ∈Z ρ t (z )π t (a t |s t , z ) . ( ) Based on the available information I t at time t, the agent can then update the prediction on the mean field distribution of the next time-step for each possible signal z: μ pred t+1 (•|I t , z) = a∈A s∈S μ t (s)P (•|s, a, μ t )π t (a|s, z) := Φ(μ t , π t , z). ( ) The Q function Q π π π t (s, a, μ, z; π π π ) for individual agent is defined as follows: Q π π π t (s, a, μ, z; π π π ) =r(s, a, μ) + γE π π π T i=t+1 γ i-t-1 r(s i , a i , μ i ) (s t , a t , μ t , z t ) = (s, a, μ, z) , where E π π π is the expectation taken with respect to z i ∼ ρ i (•), a i ∼ π i (•|s i , z i ), s i+1 ∼ P (•|s i , a i , μ i ), ∀i ∈ {t+1, t+2, • • • , T }. We can verify that the Q function satisfies the following Bellman equation: Q π π π t (s, a, μ, z; π π π ) = r(s, a, μ) + γE Q π π π t+1 (s , a , Φ(μ, π t , z), z ; π π π (s t , a t , μ t , z t ) = (s, a, μ, z) , ( ) where the expectation is taken with respect to z ∼ ρ t+1 (•), s ∼ P (•|s, a, μ), a ∼ π t+1 (•|s, z ). Similarly, we define the optimal Q-function Q * t (s, a, μ, z; π π π ) as the Q function associated with the optimal individual policy π π π * given population behavior π π π . It is easy to show that Q * satisfies the following Bellman equation: Q * t (s, a, μ, z; π π π ) = r(s, a, μ) + γ max a ∈A E Q * t+1 (s , a , Φ(μ, π t , z), z ; π π π ) (s t , a t , μ t , z t ) = (s, a, μ, z) , ( ) where the expectation is taken with respect to z ∼ ρ t+1 (•), s ∼ P (• | s, a, μ t ). It is worth noting that if the policy of population π π π is fixed, Q * T (s, a, μ, z; π π π ) ≥ Q π π π T (s, a, μ, z; π π π ) for any π π π. Then by induction, it holds that Q * t (s, a, μ, z; π π π ) ≥ Q π π π t (s, a, μ, z; π π π ) for all t ∈ T . To introduce the concept of AMFCE, we define the set of swap function U := {u : A → A}, namely u a function that modifies an action a to an action u(a). Let Δ t (s, μ, u; π π π, ρ ρ ρ) = E Q π π π t (s, u(a), μ, z; π π π) -Q π π π t (s, a, μ, z; π π π) , u ∈ U denote the margin of Q function of that agent takes action u(a) when a recommendation a is provided by the mediator, where the expectation is taken with respect to z ∼ ρ t (•), a ∼ π t (•|s, z). Definition 4. The profile (π π π , ρ ρ ρ) composed of the time-varying stochastic policy π π π = {π t } T t=0 and the correlation device ρ ρ ρ is an adaptive mean field correlated equilibrium (AMFCE) if • (Single agent side) No agent has an incentive to unilaterally deviate from the recommendation action after predicting the z by (2), i.e.  Δ t (s, μ t , u; π π π , ρ ρ ρ) ≤ 0, ∀u ∈ U, ∀s ∈ S, ∀t ∈ T . Equilibrium MFNE AMFCE Distribution π 0 0 (a = L|s = •) π 1 0 (a = L|s = •) π 2 0 (a = L|s = •) π0(a = L|s = •, z = 0) π0(a = L|s = •, z = 1) ρ0(z = 0) ρ0(z = 1) Value 1 0 1/2 2/3 1/3 1/2 1/2 (C) = 1. The reward r(s, a, μ) = 1 {s=L} μ(L) + 1 {s=R} μ(R) and T = {0, 1}. The environment dynamic is deterministic: P (s t+1 = R | s t = •, a = R) = 1, P (s t+1 = L | s t = •, a = R) = 0, P (s t+1 = R | s t = •, a = L) = 0, P (s t+1 = L | s t = •, a = L) = 1. We prove that in Example 1, the regulator in an AMFCE gives recommendations as follows (see the detailed proof in Appendix C). First, a random variable z is sampled from the correlated signal space Z = {0, 1} with equal probability ρ 0 (z = 0) = ρ 0 (z = 1) = 0.5, and the regulator gives the action recommendation for each fish according to the policy π 0 (a = L|z = 0) = 2/3, π 0 (a = R|z = 0) = 1/3, π 0 (a = L|z = 1) = 1/3, π 0 (a = R|z = 1) = 2/3. Then fish has no incentive to deviate from the recommendation, so an AMFCE is achieved. It is worth noting that the above AMFCE solution is not a classic MFNE, because there are only three MFNEs which are shown in Table 1 .

3.2. PROPERTIES OF AMFCE

This section focuses on the properties of AMFCE, including the conditions to guarantee the existence and its relationship to classic MFNE. In order to provide the existence of AMFCE solutions, we define the best response operator BR(π π π; ρ ρ ρ) = arg max π π π E π π π ,ρ ρ ρ T t=0 γ t r(s t , a t , μ t ) , where the expecation is taken with respect to z t ∼ ρ t (•), s t ∼ P (•|s t-1 , a t-1 , μ t-1 ), a t ∼ π t (•|s t , z t ), μ t = Φ(μ t-1 , π t-1 , z t-1 ). Unless otherwise stated, the expectation E π π π,ρ ρ ρ is taken with respect to z t ∼ ρ t (•), s t ∼ P (•|s t-1 , a t-1 , μ t-1 ), a t ∼ π t (•|s t , z t ), μ t = Φ(μ t-1 , π t-1 , z t-1 ). Then the existence of the solution will be derived using Kakutani's fixed point theorem (Kakutani, 1941) with the operator BR. We next provide a sufficient condition for the existence of AMFCE. Theorem 1. If the functions r(s, a, μ) and P (s |s, a, μ) are bounded and continuous with respect to μ, there exists an AMFCE solution. The AMFCE is a more general equilibrium concept than MFNE, which is illustrated in corollary 1. Corollary 1. If (π, μ) is an MFNE, then it leads to an AMFCE solution (π, ρ ρ ρ) with |Z| = 1 and ρ t (z) = 1 for all t ∈ T where z ∈ Z is the single element in the signal space. The proof is deferred to Appendix D.3. This proposition implies that the MFNE is a subset of AMFCE. The example in Example 1 shows that AMFCE may not be an MFNE.

3.3. MAXIMUM ENTROPY MEAN FIELD CORRELATED EQUILIBRIUM

Similar to the classic MFG setting, there may be multiple AMFCEs in our setting. Consequently, AMFCE is facing the equilibrium selection issue. One of the commonly used selection criteria is maximum entropy. For example, maximum entropy has been introduced to select correlated equilibrium in the normal form game (Ortiz et al., 2007) and Markov game (Ziebart et al., 2011) . We integrate the maximum entropy principle into the AMFCE as follows. Definition 5. The maximum entropy AMFCE (MaxEnt-AMFCE) is the one that maximizes the entropy (π * , ρ ρ ρ * ) = arg max (π,ρ ρ ρ)∈ΠAMFCE H(π π π, ρ ρ ρ), with H(π π π, ρ ρ ρ) = T t=0 E π π π,ρ ρ ρ [-log(π t (a t |s t , z t )ρ t (z t ))], Π AMFCE the set of all AMFCE solutions. The MaxEnt-AMFCE can avoid the equilibrium selection problem as it is unique under certain conditions. Denote Δ(π π π, ρ ρ ρ) = max u,s,t Δ t (s, μ t , u; π π π, ρ ρ ρ), where μ t = Φ(μ t-1 , π t-1 , z t-1 ). Corollary 2. MaxEnt-AMFCE is a unique equilibrium solution if Δ(π π π, ρ ρ ρ) is convex w.r.t. (π π π, ρ ρ ρ). The proof is deferred to Appendix D.4. Directly optimizing the entropy is difficult because the policy π t and ρ are coupled. So we decouple this term by the following proposition. Proposition 1. The entropy can be decoupled:  H(π π π, ρ ρ ρ) = T t=0 [H(ρ t ) + E π π π,ρ ρ ρ H(π t |s t , z t )]. H(ρ t )

4. IMITATION LEARNING FOR MEAN FIELD GAME

This section proposes a new framework based on imitation learning to recover AMFCE from collected expert demonstrations. To avoid the equilibrium selection problem, we choose the MaxEnt-AMFCE solution introduced in Section 3.3. To emphasize the role of unknown reward function in imitation learning, we use MFRL(r, ρ ρ ρ) to denote the policy of MaxEnt-AMFCE under the reward function r and correlation device ρ ρ ρ: MFRL(r, ρ ρ ρ) = arg min π π π (π π π,ρ ρ ρ)∈Π AMFCE α T t=0 EH(π t |s t , z t ) (6) The temperature constant α ≥ 0 is to control the entropy. The constraint on the AMFCE set makes the optimization problem ( 6) challenging. To address this, we provide an equivalent formulation in Proposition 2 and derive a Lagrangian reformulation of (6).

4.1. CORRELATED MEAN FIELD IMITATION LEARNING

We denote J(π π π, ρ ρ ρ) = E T t=0 γ t r(s t , a t , μ t ) , and R(a 0:T , π π π, ρ ρ ρ) as the margin of expected return between choosing actions a 0:T := {a t } t∈T and policy π π π under the correlation device ρ ρ ρ: R(a 0:T , π π π, ρ ρ ρ) E T t=0 γ t r(s t , a t , μ t ) a 0:T -J(π π π, ρ ρ ρ), where the expectation is taken with respect to z t ∼ ρ t (•), s t ∼ P (•|a t-1 , s t-1 , μ t-1 ). And μ t = Φ(μ t-1 , π t-1 , z t-1 ). Then we can get an equivalent constraint of AMFCE. Proposition 2. (π π π, ρ ρ ρ) is an AMFCE solution if and only if R(a 0:T , π π π, ρ ρ ρ) ≤ 0, ∀a 0:T ∈ A T . The proof is deferred to Appendix D.6. Compared to the original formulation (6), it is easier to work with a dual representation without constraints: L(π π π, ρ ρ ρ, λ, r) τ k ∈DE λ(τ k ) E T t=0 γ t r(s t , a t , μ t ) -J(π π π, ρ ρ ρ) -α T t=0 EH(π t |s t , z t ) (7) where D E is a set of action-signal sequence τ k = {a 0 , z 0 , a 1 , z 1 , a 2 , z 2 , • • • , a T , z T }. We show that (7) captures the difference of expected returns between two policies by selecting λ as follows.

Theorem 2. For policy π π π and correlation device

ρ ρ ρ, let λ * π π π (τ k ) = T t=0 ρ t (z t )π * t (a t |s t , z t ) be the probability of generating the sequence τ k if the individual policy is π π π * . Then we have L(π π π, ρ ρ ρ, λ * π π π , r) = E[ T t=0 γ t r(s t , a t , μ t )] -J(π π π, ρ ρ ρ) -α T t=0 E π π π,ρ ρ ρ H(π t |s t , z t ) , where the expectation is taken with respect to z t ∼ ρ t (•), s t ∼ P (•|s t-1 , a t-1 , μ t-1 ), a t ∼ π * t (•|s t , z t ), μ t = Φ(μ t-1 , π t-1 , z t-1 ). The proof of Theorem 2 is deferred to Appendix D.7. In the setting of imitation learning, the reward signal is not accessible. To construct a suitable reward function rationalizing the expert policy, we need to define a suitable AMFCE inverse reinforcement learning (AMFCE-IRL) operator which designs a reward to maximize the margin of expected return between expert policy and the other policies: AMFCE-IRL ψ (π E , ρ ρ ρ E ) = arg max r -ψ(r) -max π π π L(π π π E , ρ ρ ρ E , λ * π π π , r) , Algorithm 1: Correlated mean field imitation learning (CMFIL) Data: Expert trajectories D E = {s 0 , z 0 , a 0 , s 1 , z 1 , a 1 , . . . s T , z T , a T } Initial mean field μ 0 , The weight of gradient penalty β Result: Policy π π π θ , correlation device π π π φ Initialization the parameter θ of policy π π π θ and the parameter φ of correlation device ρ ρ ρ φ ; for each iteration do Obtain trajectories from (π, ρ ρ ρ) by the process: s 0 ∼ μ 0 , a t ∼ π θ (•|s t , z t ), s t+1 ∼ P (• | s t , μ t ), z t ∼ ρ φ t (•); Approximate μ t with the signature μt = S({z i } t i=0 ) using (11); for i in {0, 1, 2, . . . } do Update ω to increase the objective E π π π,ρ ρ ρ E T t=0 γ t log D ω (s t , a t , μt ) + E π π π E ,ρ ρ ρ E T t=0 γ t log 1 -D ω (s t , a t , μt ) end for t in {0, 1, 2, . . . } do Update θ by SAC with small step size: E ∇ θ ρ φ t (z t )π θ t (a t |s t , z t )Q π π π θ t (s t , a t , μt , z t ; π π π) + α∇ θ H(π θ t |s t , z t ) where the expectation is taken with respect to s 0 ∼ μ 0 , a t ∼ π θ (•|s t , z t ), s t+1 ∼ P (• | s t , μ t ), z t ∼ ρ φ t (•); Update φ with (10); end end (π π π E , ρ ρ ρ E ) ∈ Π MaxEnt-AMFCE is the MaxEnt-AMFCE from which expert demonstrations are sampled. We choose a special regularizer (Ho & Ermon, 2016) : ψ GA (r) E[ T t=0 γ t g(r(s t , a t , μ t ))] if r > 0 +∞ otherwise , where g(x) = x -log (1 -e -x ) if x > 0 +∞ otherwise After getting the reward function r = AMFCR-IRL(π π π E , ρ ρ ρ E ), we can characterize the AMFCE policy MFRL(r, ρ ρ ρ E ) with the learned r. Proposition 3. The policy π π π learned on the reward function recovered by AMFCE-IRL can be characterized as follows: MFRL• AMFCE-IRL ψ (π π π E , ρ ρ ρ E ):= arg min π π π max r J(π π π E , ρ ρ ρ E ) -E[ T t=0 γ t r(s t , a t , μ t )] -ψ GA (r) where the expectation is taken with respect to z t ∼ ρ E t (•), s t ∼ P (•|s t-1 , a t-1 , μ t-1 ), a t ∼ π t (•|s t , z t ), μ t = Φ(μ t-1 , π E t-1 , z t-1 ). The objective to recover MaxEnt-AMFCE is defined as: min π π π max ω E π π π,ρ ρ ρ E T t=0 γ t log D ω (s t , a t , μ t ) + E π π π E ,ρ ρ ρ E T t=0 γ t log 1 -D ω (s t , a t , μ t ) (9) where D ω is the discriminator network parameterized with ω, with input (s t , a t , μ t ) and output a real number in (0, 1]. The first expectation is taken with respect to z t ∼ ρ E t (•), s t ∼ P (•|s t-1 , a t-1 , μ t-1 ), a t ∼ π t (•|s t , z t ), μ t = Φ(μ t-1 , π E t-1 , z t-1 ). The proof is deferred to Appendix D.8. From a theoretical point of view, we assume that neural network D ω has the capacity to approximate the reward function. Under this assumption, the AMFCE (π π π E , ρ ρ ρ E ) could be recovered by optimizing the above objective (9). Note that simply applying GAIL to solve AMFCE cannot recover ρ ρ ρ E , so we derive ρ ρ ρ using a gradient descent method (with proof in Appendix D.9):  E z∼ρ φ t (•) ∇ φ log ρ φ t (z) -α log ρ φ t (z) + αH(π t (a|s, z)) + E a∼πt(•|s,z) Q π π π t (s, a, μ, z; π π π) . ( ) Now we propose the imitation learning algorithm for AMFCE (Algorithm 1). It is worth noting that this algorithm can recover AMFCE that does not have the maximum entropy by setting α = 0.

4.2. REPRESENTATION OF THE MEAN FIELD INFORMATION

As the mean field appears in the input of discriminator D ω (s, a, μ) in ( 9), it is necessary to find an efficient way to represent the mean field information. In the Kolmogorov equation ( 1), the mean field flow {μ t } T t=0 is deterministic given fixed correlated signal sequence {z t } T t=0 and given the initial mean field distribution μ 0 . Therefore, the mean field distribution μ t can be characterized by z z z 0:t = {z i } t i=0 . Motivated by this, we use the signatures of z z z 0:t from the rough path theory (Kidger & Lyons, 2021; Min & Hu, 2021) to represent the signal sequence and hence to characterize the mean field flow with μt = S(z z z 0:t ). The signatures provide a graduated summary of the path z z z 0:t . Therefore, the input of discriminator D ω in (9) could be replaced with (s t , a t , μt ). It is worth noting that the signature has been recently applied to the field of machine learning to extract characteristic features of sequential data in a non-parametric fashion (Min & Ichiba, 2020; Ni et al., 2020) . The use of signatures to encode historical information avoids heavy computational load which often suffered in tasks like training recurrent neural networks. In addition, the training stability can be significantly enhanced since the mapping is invariant. Definition 6. Let x = {x 1 , . . . , x L } with x i ∈ R d , for all i and L ≥ 2. Denote f : [0, 1] → R d to be the continuous piecewise affine function such that f ( i-1 L-1 ) = x i , ∀i ∈ {1, 2, . . . , L}. S(f ) 0,1 = (1, M 1 , • • • , M n , . . .) ( ) where M n = s<s1<•••<sn<t df dt (s 1 ) ⊗ • • • ⊗ df dt (s n )dt 1 • • • dt n . The signature of the path x is defined to be S(f ) 0,1 , denoted as S(x). Signature of sequential data includes infinite terms as shown in the (11), but fortunately, terms M n enjoy factorial decay. In practice we select the first n terms of the signature without losing crucial information of the data (Kidger et al., 2019) .

5. EXPERIMENTS

We evaluate the effectiveness of our algorithm in four environments: Sequential Squeeze, RPS, Flock, and Traffic Flow Prediction. We compare our CMFIL framework with MFIRL (Chen et al., 2021), as it is so far the only method to solve MFNE without requiring the knowledge on the reward. Since MFIRL does not consider correlated signals, we regard the correlated signal as an extension of the global state for their framework. We also compare CMFIL with MaxEnt ICE, smoothed multinomial distribution over the joint actions and logistic regression (Waugh et al., 2013) . As MaxEnt ICE is designed to recover correlated equilibrium in matrix game, we only compare CMFIL with MaxEnt ICE on tasks that can be reduced to matrix game, such as RPS and Sequential Squeeze with T = {0, 1}. We use the log loss, E a∼π(•|s,z) [-log(π(a|s, z) )], to mearsure the difference over recovered policy π and ground truth π. The Appendix F contains more details. We evaluate CMFIL on several tasks: Sequential Squeeze (Squeeze for short), Rock-Paper-Scissors (RPS), Flock and a real-world traffic flow prediction task. The first three experiments are numerical experiments. The traffic flow prediction task is to predict the traffic flow a complex traffic network based on the real world data. Details are presented in the Appendix E. Squeeze: Sequential Squeeze is a game with multi-steps. The purpose to implement this game is to verify the ability to recover expert policy through demonstrations sampled from a multi-step game. The learning curve is shown in the Fig. 3 , and the results are shown in Table 2 . The example of Ocean Ranch in Example 1 is a special case of Sequential Squeeze, where the horizon equals to 2.

RPS:

This experiment is a traditional mean field game task (Chen et al., 2021; Cui & Koeppl, 2021) . The demonstrations are sampled from MFNE, and the cardinality of the correlated signal set is one. We use RPS to verify that the algorithm proposed can recover the expert demonstrations sampled from MFNE, which also supports the results in Corollary 1.

Flock:

The experiment is based on the movement of fish (Perrin et al., 2021) . In nature, fish spontaneously aligns velocity according to the overall movement of the fish school, so that the final fish school forms a stable movement velocity. The video provided shows the convergence process (https://sites.google.com/view/mean-field-imitation-learning/). Traffic Flow Prediction: In the Traffic Flow Prediction task, we use the traffic data of London from Uber Movement. The goal of this experiment is to predict the traffic flow of a traffic network (with six locations) in real-world. Given the large-scale and high-complexity of this task, we compare CMFIL and MFIRL under this task to test their scalability. The results for numerical tasks are shown in Table 2 . CMFIL is better than other methods in general. Supervised learning methods such as logistic regression and smoothed multinomial distribution easily overfit. They may outperform CMFIL in some metrics but suffer from a higher loss than CMFIL in general. MFIRL shows larger deviations and higher loss than CMFIL in Table 2 and Table 3 . The reason is that MFIRL can not recover AMFCE, and it can not handle correlated signals properly. Although we have regarded correlated signal as an extension of state. The reward recovered by MFIRL is biased because the ground truth reward is independent of the correlated signal. Furthermore, CMFIL adds a regularizer ψ for the reward function to avoid overfitting, so it also outperforms MFIRL in RPS in which expert demonstrations are sampled from MFNE. MaxEnt ICE also performs poorly because it has a limited reward function class by assuming a linear reward structure. Figure 1 shows that CMFIL can recover the correlation device with a fast convergence speed.



Figure 1: The distribution of correlation device ρ recovered by CMFIL.

AMFCE and MFNE in the Ocean Ranch. The AMFCE shown in this table is not an MFNE.A toy example named Ocean Ranch is provided below to demonstrate the concept of AMFCE. Example 1. Suppose there exists a marine ranch with two sectors. The regulator of the marine ranch adjusts the size of the fish entering the two different sectors by giving recommendations for fish. The state space of fish is S = {C, L, R}, and the action space is A = {L, R}. Initial mean field μ 0

is the entropy of the correlation device, and H(π t |s t , z t ) = -at∈A π t (a t |s t , z t ) log(π t (a t |s t , z t )) is the entropy of π t (•|s t , z t ). (See proof in Appendix D.5).

Results for numerical tasks.

The results of predicted traffic flow for Traffic Network.Proposition 4. If ρ ρ ρ φ is parameterized with φ, the gradient to optimize φ given state s is

