MULTI-AGENT DEEP FBSDE REPRESENTATION FOR LARGE SCALE STOCHASTIC DIFFERENTIAL GAMES

Abstract

In this paper we present a deep learning framework for solving large-scale multiagent non-cooperative stochastic games using fictitious play. The Hamilton-Jacobi-Bellman (HJB) PDE associated with each agent is reformulated into a set of Forward-Backward Stochastic Differential Equations (FBSDEs) and solved via forward sampling on a suitably defined neural network architecture. Decision making in multi-agent systems suffers from curse of dimensionality and strategy degeneration as the number of agents and time horizon increase. We propose a novel Deep FBSDE controller framework which is shown to outperform the current state-of-the-art deep fictitious play algorithm on a high dimensional interbank lending/borrowing problem. More importantly, our approach mitigates the curse of many agents and reduces computational and memory complexity, allowing us to scale up to 1,000 agents in simulation, a scale which, to the best of our knowledge, represents a new state of the art. Finally, we showcase the framework's applicability in robotics on a belief-space autonomous racing problem.

1. INTRODUCTION

Stochastic differential games represent a framework for investigating scenarios where multiple players make decisions while operating in a dynamic and stochastic environment. The theory of differential games dates back to the seminal work of Isaacs (1965) studying two-player zero-sum dynamic games, with a first stochastic extension appearing in Kushner & Chamberlain (1969) . A key step in the study of games is obtaining the Nash equilibrium among players (Osborne & Rubinstein, 1994) . A Nash equilibrium represents the solution of non-cooperative game where two or more players are involved. Each player cannot gain benefit by modifying his/her own strategy given opponents equilibrium strategy. In the context of adversarial multi-objective games, the Nash equilibrium can be represented as a system of coupled Hamilton-Jacobi-Bellman (HJB) equations when the system satisfies the Markovian property. Analytic solutions exist only for few special cases. Therefore, obtaining the Nash equilibrium solution is usually done numerically, and this can become challenging as the number of states/agents increases. Despite extensive theoretical work, the algorithmic part has received less attention and mainly addresses special cases of differential games (e.g., Duncan & Pasik-Duncan (2015) ), or suffers from the curse of dimensionality (Kushner, 2002) . Nevertheless, stochastic differential games have a variety of applications including in robotics and autonomy, economics and management. Relevant examples include Mataramvura & Øksendal (2008) , which formulate portfolio management as a stochastic differential game in order to obtain a market portfolio that minimizes the convex risk measure of a terminal wealth index value, as well as Prasad & Sethi (2004) , who investigate optimal advertising spending in duopolistic settings via stochastic differential games. Reinforcement Learning (RL) aims in obtaining a policy which can generate optimal sequential decisions while interacting with the environment. Commonly, the policy is trained by collecting histories of states, actions, and rewards, and updating the policy accordingly. Multi-agent Reinforcement Learning (MARL) is an extension of RL where several agents compete in a common environment, which is a more complex task due to the interaction between several agents and the environment, as well as between the agents. One approach is to assume agents to be part of environment (Tan, 1993) , but this may lead to unstable learning during policy updates (Matignon et al., 2012) . On the other hand, a centralized approach considers MARL through an augmented state and action system, reducing its training to that of single agent RL problem. Because of the combinatorial complexity, the centralized learning method cannot scale to more than 10 agents (Yang et al., 2019) . Another method is centralized training and decentralized execute (CTDE), however the challenge therein lies on how to decompose value function in the execute phase for value-based MARL. Sunehag et al. (2018) and Zhou et al. (2019) decompose the joint value function into a summation of individual value functions. Rashid et al. (2018) keep the monotonic trends between centralized and decentralized value functions by augmenting the summation non-linearly and designing a mixing network (QMIX). Further modifications on QMIX include Son et al. (2019) ; Mahajan et al. (2019) . The mathematical formulation of a differential game leads to a nonlinear PDE. This motivates algorithmic development for differential games that combine elements of PDE theory with deep learning. Recent encouraging results (Han et al., 2018; Raissi, 2018) in solving nonlinear PDEs within the deep learning community illustrate the scalability and numerical efficiency of neural networks. The transition from a PDE formulation to a trainable neural network is done via the concept of a system of Forward-Backward Stochastic Differential Equations (FBSDEs) . Specifically, certain PDE solutions are linked to solutions of FBSDEs, and the latter can be solved using a suitably defined neural network architecture. This is known in the literature as the deep FBSDE approach. Han et al. (2018); Pereira et al. (2019) ; Wang et al. (2019b) utilize various deep neural network architectures to solve such stochastic systems. However, these algorithms address single agent dynamical systems. Two-player zero-sum games using FBSDEs were initially developed in Exarchos et al. (2019) and transferred to a deep learning setting in Wang et al. (2019a) . Recently, Hu (2019) brought deep learning into fictitious play to solve multi-agent non-zero-sum game, Han & Hu (2019) introduced the deep FBSDEs to a multi-agent scenario and the concept of fictitious play, furthermore, Han et al. (2020) gives the convergence proof. In this work we propose an alternative deep FBSDE approach to multi-agent non-cooperative differential games, aiming on reducing complexity and increasing the number of agents the framework can handle. The main contribution of our work is threefold: 1. We introduce an efficient Deep FBSDE framework for solving stochastic multi-agent games via fictitious play that outperforms the current state of the art in Relative Square Error (RSE) and runtime/memory efficiency on an inter-bank lending/borrowing example. 2. We demonstrate that our approach scales to a much larger number of agents (up to 1,000 agents, compared to 50 in existing work). To the best of our knowledge, this represents a new state of the art. 3. We showcase the applicability of our framework to robotics on a belief-space autonomous racing problem which has larger individual control and state space. The experiments demonstrates that the decoupled BSDE provides the possibility of applications for competitive scenario. The rest of the paper is organized as follows: in Section 2 we present the mathematical preliminaries. In Section 3 we introduce the Deep Fictitious Play Belief FBSDE, with simulation results following in Section 4. We conclude the paper and discuss some future directions in Section 5.

2. MULTI-AGENT FICTITIOUS PLAY FBSDE

Fictitious play is a learning rule first introduced in Brown (1951) where each player presumes other players' strategies to be fixed. An N -player game can then be decoupled into N individual decisionmaking problems which can be solved iteratively over M stages. When each agentfoot_0 converges to a stationary strategy at stage m, this strategy will become the stationary strategy for other players at stage m + 1. We consider a N -player non-cooperative stochastic differential game with dynamics dX(t) = f (X(t), t) + G(X(t), t)U (t) dt + Σ(X(t), t)dW (t), X(0) = X 0 , where X = (x 1 , x 2 , . . . , x N ) is a vector containing the state process of all agents generated by their controls U = (u 1 , u 2 , . . . , u N ) with x i ∈ R nx and u i ∈ R nu . Here, f : R nx × [0, T ] → R nx represents the drift dynamics, G : R nx × [0, T ] → R nx×nu represents the actuator dynamics, and Σ : [0, T ] × R n → R nx×nw represents the diffusion term. We assume that each agent is only driven by its own controls so G is a block diagonal matrix with G i corresponding to the actuation of agent i. Each agent is also driven by its own n w -dimensional independent Brownian motion W i , and denote W = (W 1 , W 2 , . . . , W N ). Let U i be the set of admissible strategies for agent i ∈ I := {1, 2, . . . , N } and U = ⊗ N i=1 U i as the Kronecker product space of U i . Given the other agents' strategies, the stochastic optimal control problem for agent i under the fictitious play assumption is defined as minimizing the expectation of the cumulative cost functional J i t J i t (X, u i,m ; u -i,m-1 ) = E g(X(T )) + T t C i (X(τ ), u i,m (X(τ ), τ ), τ ; u -i,m-1 )dτ , where g : R nx → R + is the terminal cost, and C i : [0, T ] × R nx × U → R + is the running cost for the i-th player. In this paper we assume that the running cost is of the form C(X, u i,m , t) = q(X) + 1 2 u T i,m Ru i,m + X T Qu i,m . We use the double subscript u i,m to denote the control of agent i at stage m and the negative subscript -i as the strategies excluding player i, u -i = (u 1 , . . . , u i-1 , u i+1 , . . . , u N ). We can define value function of each player as V i (t, X(t)) = inf ui,m∈Ui J i t (X, u i,m ; u -i,m-1 ) , V i (T, X(T )) = g(X(T )). Assume that the value function in eq. ( 3) is once differentiable w.r.t. t and twice differentiable w.r.t. x. Then, standard stochastic optimal control theory leads to the HJB PDE V i + h + V iT x (f + GU 0,-i ) + 1 2 tr(V i xx ΣΣ T ) = 0, V i (T, X) = g(X(T )), where h = C i * + GU * ,0 . The double subscript of U * ,0 denotes the augmentation of the optimal control u * i,m = -R -1 (G T i V i x + Q T i x ) and zero control u -i,m-1 = 0, and U 0,-i denotes the augmentation of u i,m = 0 and u -i,m-1 . Here we drop the functional dependencies in the HJB equation for simplicity. The detailed proof is in Appendix A. The value function in the HJB PDE can be related to a set of FBSDEs dX = (f + GU * ,-i )dt + ΣdW , X(0) = x 0 dV i = -(h + V iT x GU * ,0 )dt + V T x ΣdW, V (T ) = g(X(T )), where the backward process corresponds to the value function. The detailed derivation can be found in Appendix B. Note that the FBSDEs here differ from that of Han & Hu (2019) in the optimal control of agent i, GU * ,-i , in the forward process and compensation, V iT x GU * ,0 , in the backward process. This is known as the importance sampling for FBSDEs and allows for the FBSDEs to be guided to explore the state space more efficiently.

3. DEEP FICTITIOUS PLAY FBSDE CONTROLLER

In this section, we introduce a novel and scalable Deep Fictitious Play FBSDE (SDFP) Controller to solve the multi-agent stochastic optimal control problem. The framework can be extended to the partially observable scenario by combining with an Extended Kalman Filter, whose belief propagation can be described by an SDE for the mean and variance (see derivation in Appendix C). By the natural of decoupled BSDE, the framework can also been extended to cooperative and competitive scenario. In this paper, we demonstrate the example of competitive scenario.

3.1. NETWORK ARCHITECTURE AND ALGORITHM

Inspired by the success of LSTM-based deep FBSDE controllers (Wang et al., 2019b; Pereira et al., 2019) , we propose an approach based on an LSTM architecture similar to Pereira et al. (2019) . The benefits of introducing LSTM are two-fold: 1) LSTM can capture the features of sequential data. A performance comparison between LSTM and fully connected (FC) layers in the deep FBSDE framework has been elaborated in Wang et al. (2019b) ; 2) LSTM significantly reduces the memory complexity of our model since the memory complexity of LSTM with respect to time is O(1) in the inference phase compared with O(T ) in previous work (Han et al., 2018) , where T is the number of time steps. The overall architecture of SDFP is shown in Fig. 1 and features the same time discretization scheme as Pereira et al. (2019) . Each player's policy is characterized by its own copy During training within a stage, the initial value of each player is predicted by a FC layer parameterized by φ. At each timestep, the optimal policy for each player is computed using the value function gradient prediction V i x from the recurrent network (consisting of FC and LSTM layers), parameterized by θ. The FSDE and BSDE are then forward-propagated using the Euler integration scheme. At terminal time T , the loss function for each player is constructed as the mean squared error between the propagated terminal value V i T and the true terminal value V i * T computed from the terminal state. The parameters φ and θ of each player can be trained using any stochastic gradient descent type optimizer such as Adam. The detailed training procedure is shown in Algorithm 2. 𝑋 ! 𝑈 ! 𝑁𝑁 " 𝑁𝑁 # 𝑁𝑁 $ 𝑋 " 𝑈 " 𝑁𝑁 " 𝑁𝑁 # 𝑁𝑁 $ ⋱ ⋱ 𝑋 % 𝑁𝑁 " 𝑁𝑁 # 𝑁𝑁 $&" 𝑁𝑁 $ ⋱ ⋯ 𝑉 % " 𝑉 % # 𝑉 % $&" 𝑉 % $ ⋱ 𝑉 % * ℒ(𝑉 % , 𝑉 % * ) ⋯ 𝑁𝑁 $&" 𝑁𝑁 $&"

3.2. MITIGATING CURSE OF DIMENSIONALITY AND SAMPLE COMPLEXITY

Scalability and sample efficiency are two crucial criteria of reinforcement learning. In SDFP, as the number of agents increases, the number of neural network copies would increase correspondingly. Meanwhile, the size of each neural network should be enlarged to gain enough capacity to capture the representation of many agents, leading to the infamous curse of dimensionality; this limits the scalability of prior works. However, one can mitigate the curse of dimensionality in this case by taking advantage of the symmetric game setup. We summarize merits of symmetric game as following: 1. Since all agents have the same dynamics and cost function, only one copy of the network is needed. The strategy of other agents can be inferred by applying the same network. 2. Thanks to the symmetric property, we can applied invariant layer to extract invariant features to accelerate training and improve the performance with respect to the accumulate cost and RSE loss. Sharing one network: It's important to note that querying other agents should not introduce additional gradient paths. This significantly reduces the memory complexity. When querying other agents' strategy, one can either iterate through each agent or feed all agents' states to the network in a batch. The latter approach reduces the time complexity by adopting the parallel nature of modern GPU but requires O(N 2 ) memory rather than O(N ) for the first approach. Invariant Layers: The memory complexity can be further reduced with an invariant layer embedding (Zaheer et al., 2017) . The invariant layer utilizes a sum function along with the features in the same set to render the network invariant to permutation of agents. We apply the invariant layer on X -i and concatenate the resulting features to the features extracted from X i . However, vanilla invariant layer embedding will not reduce the memory complexity. Thanks to the symmetric problem setup, one can apply a trick to reduce the invariant layer memory complexity form O(N 2 ) to O(N ). A detailed introduction to the invariant layer and our implementation can be found in Appendix D and E. The full algorithm is outlined in Algorithm 1. for all i ∈ I in parallel do 7: Collect opponent agent's policy which is same as ith policy: f m-1 LST Mi (•), f m-1 F Ci (•) 8: for l ← 1 to N gd do 9: for t ← 1 to T -1 do 10: if Using Invariant Layer then 11: X t = f invariant (X t ) 12: end if 13: for j ← 1 to B in parallel do 14: Compute network prediction for ith player: V j xi,t = f m F Ci (f m LST Mi (X j t ; θ l-1 i )) 15: Compute ith optimal Control :u j, i,t = -R -1 i (G T i V j xi,t + Q T i x j i ) 16: Infer -ith players' network prediction and stop the gradient for them: V x-i,t = f m-1 F Ci (f m-1 LST Mi (X t ; θ i )) 17: Compute -ith optimal Control and stop the gradient for them: u j, -i,t = -R -1 -i (G T -i V j x-i,t + Q T -i x j -i ) 18: Sample noise ∆W j ∼ N (0, ∆t) 19: Propagate FSDE: X j t+1 = f F SDE (X j t , u j, i,t , u j, * -i,t , ∆W j , t) 20: Propagate BSDE: V j i,t+1 = f BSDE (X j t , u j, * i,t , ∆W j , t) 21: end for

22:

end for

23:

Compute loss: L = 1 B B j=1 (V j, T -V j i,T ) 2 24: Gradient Update: θ l , φ l 25: end for 26: end for 27: end for 28: end while 

4. SIMULATION RESULTS

In this section, we demonstrate the capability of SFDP on two different systems in simulation. We first apply the framework to an inter-bank lending/borrowing problem, which is a classical multiplayer non-cooperative game with an analytic solution. We compare against both the analytic solution and prior work (Han & Hu, 2019) . Different approaches introduced in Section 3.2 are compared empirically on this system. We also apply the framework to a variation of the problem for which no analytic solution exists. Finally, we showcase the general applicability of our framework in an autonomous racing problem in belief space. All experiment configurations can be found in Ap- We first consider an inter-bank lending and borrowing model (Carmona et al., 2013) where the dynamics of the log-monetary reserves of N banks is described by the diffusion process dX i t = a( X -X i t ) + u i t dt + σ(ρdW 0 t + 1 -ρ 2 dW i t ), Xt = 1 N N i=1 X i t , i ∈ I. The state X i t ∈ R denotes the log-monetary reserve of bank i at time t > 0. The control u i t denotes the cash flow to/from a central bank, where as a( X -X i t ) denotes the lending/borrowing rate of bank i from all other banks. The system is driven by N independent standard Brownian motion W i t , which denotes the idiosyncratic noise, and a common noise W 0 t . The cost function has the form, C i,t (X, u i ; u -i ) = 1 2 u 2 i -qu i ( X -X i ) + 2 ( X -X i ) 2 . ( ) The derivation of the FBSDEs and analytic solution can be found in Appendix F. We compare the result of implementation corresponding to Algorithm 2 on a 10-agent problem with analytic solution and previous work from Han & Hu (2019) with the same hyperparameters. Fig. 3 shows the performance of our method compared with analytic solution. The state and control trajectories outputted by the deep FBSDE solution are aligned closely with the analytic solution. Table 1 shows the numerical performance compared with prior work by Han & Hu (2019) . Our method outperforms by Relative Square Error (RSE) metrics and computation wall time. The RSE is defined as following: RSE = i∈I 1≤j≤B ( V i (0, X j (0)) -V i (0, X j (0))) 2 i∈I 1≤j≤B ( V i (0, X j (0)) -V i (0, X j (0))) 2 , ( ) Where V i is the analytic solution of value function for ith agents at intial state X j (0). The initial state X j (0) is new batch of data sampled from same distribution as X(0) in the training phase. The batch size B is 256 for all inter-bank simulations. V i is the approximated value function for ith agent by FBSDE controller, and V i is the average of analytic solution for ith agent over the entire batch. Time/Memory Complexity Analysis: We empirically verify the time and memory complexity of different implementation approaches introduced in 3.2, which is shown in Fig. 4 . Note that all 10 1 10 2 10 3 number of agents Figure 5 : RSE and total loss comparison between our FBSDE framework and that of baseline Han & Hu (2019) experiments hereon correspond to the symmetric SDFP implementation in Algorithm 1 We also test sample efficiency and generalization capability of the invariant layer on a 50-agent problem trained over 100 stages. The number of initial states is limited during the training and the evaluation criterion is the terminal cost of the test set in which the initial states are different from the initial states during training. Fig. 2 showcases the improvement in sample efficiency and generalization performance of invariant layer. We suspect this is due to the network needing to learn with respect to a specific permutation of the input, whereas permutation invariance is built into the network architecture with invariant layer. Importance sampling: An important distinction of SDFP from the baseline in Han & Hu (2019) is the importance sampling scheme, which helps the LSTM architecture achieve a fast convergence rate during training. However, the baseline, which uses fully connected layer as backbone, is not suitable for importance sampling, as it would lead to an extremely deep network with fully connected layers from gradient topology perspective. Sandler et al. (2018) mentioned that the information loss is inevitable for this kind of fully connected deep network with nonlinear activation. On the other hand, LSTM does not suffer from this problem because of the existence of long and short memory. We illustrate the benefits of importance sampling for LSTM backbone and gradient flow of fully connected layer backbone in Appendix I.

High dimension experiment:

We also analyze the performance of our framework and that of Han & Hu (2019) both with and without invariant layer on high dimensional problems. We first demonstrate the mitigation of the invariant layer on the curse of many agents. Fig. 5 demonstrates the ablation experiment of the two deep FBSDE frameworks (SDFP and Han & Hu (2019) ). In order to illustrate that invariant layer can mitigate the curse of dimensionality, we also integrate invariant layers on Han & Hu (2019) and shows the performance in the same figure 5. In this experiment, the weights of FBSDE frameworks with invariant layer are adjusted in order to dismiss the performance improvement resulting from increased weights from the invariant layers. The total cost and RSE are computed by averaging the corresponding values over the last twenty stages of each run. It can be observed from the plot that without invariant layer, the framework suffers from curse of many agents in the prediction of initial value as the RSE increases with respect to the number of agents. On the other hand, RSE increases at a slower rate with invariant layer. In terms of total cost, which is computed from the cost function defined in eq.2, our framework enjoys the benefits of importance sampling and invariant layer, and achieves better numerical results over the number of agents. We further analyse the influence of invariant layers in the training phase by demonstrating fig. 7 . Invariant layer helps mintage over-fitting phenomenon in the training and evaluation phase, meanwhile accelerating the training process, even though both of frameworks adopting same feature extracting backbone (LSTM) and importance sampling technique. We also show that the invariant layer accelerates training empirically on a 500-agent problem. Fig. 6 shows that the FBSDE frameworks converge much faster with invariant layer than without it. We suspect that the acceleration effect results from increased sample efficiency. Note that the comparison is done on a 500-agent problem because the framework does not scale to 1000 agents without the invariant layer. A comparison of the two frameworks with invariant layer only on a 1000-agent problem can be found in Fig 16 , which shows similar results to the 500-agent problem. Superlinear Simulation: We also consider a variant of dynamics in section 4.1, dX i t = a( X -X i t ) 3 + u i t dt + σ(ρdW 0 t + 1 -ρ 2 dW i t ), Xt = 1 N N i=1 X i t , i ∈ I. Due to the nonlinearity in the drift term, analytic solution or simple numerical representation of the Nash equilibrium does not exist (Han & Hu, 2019) . The drift rate a is set to 1.0 to compensate for the vanishing drift term caused by super-linearity. Heuristically, the the distribution of control and state should be more concentrated than that of the linear dynamics. We compare the state and control of a fixed agent i at terminal time against analytic solution and deep FBSDE solution of the linear dynamics with the same coefficients. Fig. 8 is generated by evaluating the trained deep FBSDE model with a batch size of 50000. It can be observed that the solution from super-linear dynamics is more concentrated as expected. The terminal control distribution plot verifies that the super-linear drift term pushes the state back to the average faster than linear dynamics and thus requires less control effort. Since the numerical solution is not available in the superlinear case, we compare the total loss and training loss between baseline Han & Hu (2019) and our algorithm in the appendix 12. The framework for the racing problem is trained with batch size of 64, and 100 time steps over a time horizon of 10 seconds. Since all the trials will run over 1 lapse of the circle, here we only show the first 8 second result for neatness. Fig. 13 demonstrate the capability of our framework. When there is no competition loss, both of cars can stay in the track. Since there is no competition between two cars, they demonstrate similar behaviors. When we add competition loss on both cars, both of them try to cut the corner in order to occupy the leading position as shown in the second plot in Fig. 13 . If competition loss is present in only one of the two cars, then the one with competition loss will dominate the game as shown in the botton subplots of Figure 13 . Notably the simulation is running in belief space where all states are estimated with observation noise and additive noise in the system. The results emphasizes the generalization ability of our framework on more complex systems with higher state and control dimensions. Fig. 9 shows a single trajectory of each car's posterior distribution.

5. CONCLUSION

In this paper, we propose a scalable deep learning framework for solving multi-agent stochastic differential game using fictitious play. The framework relies on the FBSDE formulation with importance sampling for sufficient exploration. In the symmetric game setup, an invariant layer is incorprated to render the framework agnostic to permutatoon of agents and further reduce the memory complexity. The scalability of this algorithm, along with a detailed sensitivity analysis, is demonstrated in an inter-bank borrowing/lending example. The framework achieves lower loss and scales to much higher dimensions than the state of the art. The general applicability of the framework is showcased on a belief space autonomous racing problem in simulation.

APPENDIX A MULTI-AGENT HJB DERIVATION

Applying Bellman's principle to the value function equation 3 as following V i (t, X(t)) = inf ui∈Ui E V i (t + dt, X(t + dt)) + t+dt t C i dτ = inf ui∈Ui E C i dt + V i (t, X(t)) + V i t (t, X(t))dt + V iT x (t, X(t))dX + 1 2 tr(V xx (t, X(t)ΣΣ T )dt = inf ui∈Ui E C i dt + V i (t, X(t)) + V i t (t, X(t))dt + V iT x (t, X(t))((f + GU )dt + ΣdW ) + 1 2 tr(V i xx (t, X(t))ΣΣ T )dt = inf ui∈Ui C i dt + V i (t, X(t)) + V i t (t, X(t))dt + V iT x (t, X(t))((f + GU )dt) + 1 2 tr(V i xx (t, X(t))ΣΣ T )dt ⇒ 0 = V i t (t, X(t)) + inf ui∈Ui C i + V iT x (t, X(t))(f + GU ) + 1 2 tr(V i xx (t, X(t))ΣΣ T ) Given the cost function assumption, the infimum can be obtained explicitly using optimal control u * i,m = -R -1 (G T i V i x + Q T i x). With that we can obtain the final form of the HJB PDE as V i + h + V iT x (f + GU 0,-i ) + 1 2 tr(V i xx ΣΣ T ) = 0, V i (T, X) = g(X(T )).

B FBSDE DERIVATION

Given the HJB PDE in equation 4, one can apply the nonlinear Feynman-Kac lemma Han & Hu (2019) to obtain a set of FBSDE as dX(t) = (f + GU 0,-i )dt + ΣdW , X(0) = x 0 dV i = -hdt + V iT x ΣdW, V (X(T )) = g(X(T )). Note that the forward process X is driven by the control of all agents other than i. This means that agent i searches the state space with Brownian motion only to respond to other agents' strategies. To increase the efficiency of the search, one can add any control from agent i to guide its exploration, as long as the backward process is compensated for accordingly. In this work, since we consider problems with a closed form solution of the optimal control u i,m , we add it to the forward process for importance sampling from a new set of FBSDEs. dX = (f + GU * ,-i )dt + ΣdW , X(0) = x 0 dV i = -(h + V iT x GU * ,0 )dt + V T x ΣdW, V (T ) = g(X(T )).

C CONTINOUS TIME EXTENDED KALMAN FILTER

The Partial Observable Markov Decision Process is generally difficult to solve within infinite dimensional space belief. Commonly, the Value function does not have explicit parameterized form. Kalman filter overcome this challenge by presuming the noise distribution is Gaussian distribution. In order to deploy proposed Forward Backward Stochastic Differential Equation (FBSDE) model in the Belief space, we need to utilize extended Kalman filter in continuous time Jazwinski (1970) correspondingly. Given the partial observable stochastic system: dx dt = f (x, u, w, t), and z = h(x, v, t) Where f is the stochastic state process featured by a Gaussian noise w ∼ N (0, Q), h is the observation function while v ∼ N (0, R) is the observation noise. Next, we consider the linearization of the stochastic dynamics in equation 20 represented as follows: A = ∂f ∂x x, L = ∂f ∂w x, C = ∂h ∂x x, M = ∂h ∂v x, Q = LQL T , R = M RM T (16) one can write the posterior mean state x and prior covariance matrix P -estimation update rule by Simon (2006) : x(0) = E[x(0)], P -(0) = E[(x(0) -x)(x(0) -x) T ] K = P C T R-1 ẋ = f (x, u, w 0 , t) + K[z -h(x, v 0 , t)] Ṗ -= AP -+ P -A T + Q -P -C T R-1 CP - We follow the notation in (Simon, 2006) , where x is the real state, x is the mean of state estimated by Kalman filter based on the noisy sensor observation, P -represents for the covariance matrix of the estimated state, nominal noise values are given as w 0 = 0 and v 0 = 0, where superscript + is the posterior estimation andis the prior estimation. Then we can define a Gaussian belief dynamics as b(x k , P - k ) by the mean state x and variance P -of normal distribution N (x k , P - k ) The belief dynamics results in a decoupled FBSDE system as follows: db k = g(b k , u k , 0)dt + Σ(b k , u k , 0)dW, dW ∼ N (0, I) dV = -C i dt + V i T x ΣdW where: g(b k , u k ) = b(t, X(t), u i,m (t); u -i,m ) vec(A k P - k + P - k A T k + Qk -P - k C T k R-1 k C k P - k ) Σ(b k , u k ) = K k C k P - k dt 0 V (T ) = g(X(T )) X(0) = E[X(0)] P -(0) = E[(X(0) -X)(X(0) -X) T ]

D DEEP SETS

A function f maps its domain from X to Y. Domain X is a vector space R d and Y is a continuous space R. Assume the function take a set as input:X = {x 1 ...x N }, then the function f is indifferent if it satisfies property (Zaheer et al., 2017) . Property 1. A function f : X → Y defined on sets is permutation invariant to the order of objects in the set. i.e. For any permutation function π:f ({x 1 ...x N }) = f ( x π(1) ...x π(N ) ) In this paper, we discuss when f is a nerual network strictly. Theorem 1 X has elements from countable universe. A function f (X) is a valid permutation invariant function, i.e invariant to the permutation of X, iff it can be decomposed in the from ρ( x∈X φ(x)), for appropriate function ρ and φ. In the symmetric multi-agent system, each agents is not distinguishable. This property gives some hints about how to extract the features of -ith agents by using neural network. The states of -ith agents can be represented as a set:X = {X 1 , X 2 , ..., X i-1 , X i+1 , ..., X N }. We want to design a neural network f which has the property of permutation invariant. Specifically, φ is represented as a one layer neural network and ρ is a common nonlinear activation function. 

E INVARIANT LAYER ARCHITECTURE

The architecture of invariant layer is described in Fig. 10 . The input of the layer is the states at time step t. The Invariant Model module in Fig. 10 is described in Appendix D, where φ is a neural network and ρ is nonlinear activation function. The specific configuration of neural network in Invariant model can be found in J. Noticing that all the agents has the access to the global states, we define the state input features of neural network for ith agent as: X t,i = {x i , x 1 , x 2 ..., x i-1 , x i+1 , ...x N } , with shape of [BS, N ]. In the other word, we always put own feature at first place. For each agent i, there exists such feature tensor, then the shape of input tensor will become [BS, N, N ] for invariant layer. In invariant layer, we first separate the input feature X t into two parts: X t,i and X t,-i . Then the features of -ith agents X t,-i will be sent to the invariant model. The shape of X t,-i will be [B, N, N -1] where N is the number of agents. First,we could use neural network to map the feature into N f dimension space, where N f is the feature dimensions. Then the shape of the tensor will become [BS, N, N -1, N f ], After summing up the features of all the element in the set, the dimension of the tensor would reduce to [BS, N, 1, N f ], and we denote this feature tensor as F 1 . However, the memory complexity is O(N 2 × N f ) which is not tolerable when the number of agent N increases. Alternatively, we can simply mapping the feature tensor [BS, N ] into desired feature dimension N f , then the tensor would become [BS, N, N f ], and we denote it to be F 2 . Now we create another tensor which is the average of features of element in set with size [BS, 1, N f ] and we denote it to be F2 . Then we denote F 2 = ( F2 ×N -F 2 )/(N -1) which has size of [BS, N, N f ]. We can find that F 2 = F 1 , and the memory complexity of computing F 2 is just O(N ). The derivation is true if the system is symmetric and the agents are not distinguishable. The trick can be extended to high state dimension for individual agent.  V i,t + inf ui∈Ui   N j=1 [a( X -X j ) + u 2 j ]V xj + 1 2 u 2 i -qu i ( X -X i ) + 2 ( X -X i ) 2   + 1 2 tr(V xx,i ΣΣ T ) = 0. By computing the infimum explicitly, the optimal control of player i is:u i (X, t) = q( X -X i ) - V x,i (X, t). The final form of HJB can be obtained as V i,t + 1 2 tr(V xx,i ΣΣ T ) + a( X -X i )V x,i + j =i [a( X -X j ) + u j ]V x,j + 2 ( X -X i ) 2 - 1 2 (q( X -X i ) -V x,i ) 2 = 0 Applying Feynman-Kac lemma to equation 22, the corresponding FBSDE system is dX(t) = (f (X(t), t) + G(X(t), t)u(t))dt + Σ(t, X(t))dW t , X(0) = x 0 dV i = -[ 2 ( X -X i ) 2 - 1 2 (q( X -X i ) -V x,i ) 2 + u i ]dt + V T xi ΣdW, V (T ) = g(X(T )). ( ) Figure 13 : 2 car racing problem with 8 second time horizon.

G BELIEF CAR RACING

The full stochastic model can be written as dx = (f (x) + G(x)u)dt + Σ(x)dw, z = h(x) + m f (x) =    v cos θ v sin θ -c drag v 0    , G(x) = Σ(x) =    0 0 0 0 1 0 0 v/L    , h(x) = x (24) Where dw is standard brownian motion. We consider the problem of two cars racing on circle track. The cost function of each car is designed as J t = exp x 2 a 2 + y 2 b 2 -1 track cost + ReLU -v velocity cost + exp -d) collision cost Where d is Euclidean distance between two cars. In this showcase, we use continuous time extended Kalman Filter to propagate belief space dynamics described in equation 19. The detailed algorithm for Belief space deep fictitious play FBSDE can be found in Appendix. We introduce the concept of game by using an additional competition cost: J competition = exp(- cos(θ) sin(θ) T x 1 -x 2 y 1 -y 2 ) Where x i , y i is the x, y position of ith car. When ith car is leading, the competition loss will be minor, and it will increase exponentially when the car is trailing. Thanks to decoupled BSDE structure, each car can measure this competition loss separately and optimize the value function individually.

H ANALYTIC SOLUTION FOR INTER-BANK BORROWING/LENDING PROBLEM

The analytic solution for linear inter-bank problem was derived in Carmona et al. (2013) . We provide them here for completeness. Assume the ansatz for HJB function is described as: Where η(t), µ(t) are two scalar functions. The optimal control under this ansatz is: V i (t, X) = η(t) 2 ( X -X i ) 2 = µ(t)i ∈ I α i (t, X) = q + η(t)(1 - 1 N ) ( X -X i ) By pluginging the ansatz into HJB function derived in 22, one can have, η(t) = 2(a + q)η(t) + (1 - 1 N 2 )η 2 (t) -( -q 2 ), η(T ) = c, μ(t) = - 1 2 σ 2 (1 -ρ 2 )(1 - 1 N )η(t), µ(T ) = 0. There exists the analytic solution for the Riccati equation described above as, η(t) = -( -q 2 )(e (δ + -δ -)(T -t) -1) -c(δ + e (δ + -δ -)(T -t) -δ -) (δ -e (δ + -δ -)(T -t) -δ + ) -c(1 -1/N 2 )(e (δ + -δ -)(T -t) ) -1 . ( ) Where δ ± = -(a + q) ± √ R and R = (a + q) 2 + (1 -1/N 2 )( -q 2 ) I IMPORTANCE SAMPLING Fig. 14 demonstrates how fully connected layers with importance sampling would lead to a extreme deep fully connected neural network. Fig. 15 demonstrates how importance sampling helps increase convergence rate in FBSDE with LSTM backbone. The experiment is conducted with 50 agents and 50 stages. All the configuration is identical except the existence of importance sampling.

J EXPERIMENT CONFIGURATIONS

This Appendix elaborates the experiment configurations for section 4. For all the simulation in section 4, the number of SGD iteration is fixed as N SGD = 100. We are using Adam as optimizer In section 4.1, For the prediction of initial value function, all of frameworks are using 2 layers feed forward network with 128 hidden dimension. For the baseline framework, we followed the suggested configuration motioned in Han et al. (2018) . At each time steps, V x,i is approximated by three layers of feed forward network with 64 hidden dimensions. We add batch norm Ioffe & Szegedy (2015) after each affine transformation and before each nonlinear activation function. For Deep FBSDE with LSTM backbone, we are using two layer LSTM parametrized by 128 hidden state. If the framework includes the invariant layer, the number of mapping features is chosen to be 256. The hyperparameters of the dynamics is listed as following: a = 0.1, q = 0.1, c = 0.5, = 0.5, ρ = 0.2, σ = 1, T = 1. (29) In the simulation, the time horizon is separated into 40 time-steps by Euler method. Learning rate is chosen to be 1E-3 which is the default learning rate for Adam optimizer. The initial state for each agents are sampled from the uniform distribution [δ 0 , δ 1 ]. Where δ 0 is the constant standard deviation of state X(t) during the process. In the evaluation, we are using 256 new sampled trajectory which are different from training trajectory to evaluate the performance in RSE error and total cost error. The number of stage is set to be 100 which is enough for all framework to converge. In section 4.2, the hyperparameter is listed as following: c drag = 0.01, L = 0.1, c = 0.5, T = 10.0 (30) The observation noise is sampled from Gaussian noise m ∼ N (0, 0.01I). The time horizon is enrolled into 100 time-steps by Euler method. In this experiments, the initial value V i is approximated a single trainable scale and V x,i (t) is approximated by two layers of LSTM parametrized with 32 hidden dimensions. The number of stage is set to be 10. for l ← 1 to N gd do 9: for t ← 1 to T -1 do 10: for j ← 1 to B in parallel do 11: Compute network prediction for ith player: V i xi,j,t = f m F Ci (f m LST Mi (X j t ; θ l-1 i )) 12: Compute ith optimal Control:u j, i,t = -R -1 i (G T i V i xi,j,t + Q T i x j i ) 13: Infer -ith players' network prediction: V i x-i,j,t = f m-1 F C-i (f m-1 LST M-i (X t ; θ -i )) 14: Compute -ith optimal Control:u j, -i,t = -R -1 -i (G T -i V j x-i,t + Q T -i x j -i ) 15: Sample noise ∆W j ∼ N (0, ∆t) 16: Propagate FSDE: X j t+1 = f F SDE (X j t , u j, i,t , u j, * -i,t , ∆W j , t) 17: Propagate BSDE: V i j,t+1 = f BSDE (V i j,t , X j t , u j, * i,t , ∆W j , t) end for 25: end while



Agent and player are used interchangeably in this paper The experiment is conducted on Nvidia TITAN RTX



Figure 1: SDFP framework for N Players. Each NN block has the architecture in Fig. 11.

Figure 2: Sample efficiency between FBSDE framework w/ and w/o invariant layer.

Scalable Deep Fictitious Play FBSDE for symmetric simplification 1: Hyper-parameters:N : Number of players, T : Number of timesteps, M : Number of stages in fictitious play, N gd : Number of gradient descent steps per stage, U 0 : the initial strategies for players in set I, B: Batch size, : training threshold, ∆t: time discretization 2: Parameters:V (x 0 ; φ): Network weights for initial value prediction, θ: Weights and bias of fully connected layers and LSTM layers. 3: θ: Initialize trainable papermeters:θ 0 , φ 0 4: while LOSS is above certain threshold do 5:for m ← 1 to M do 6:

Figure 3: Comparison of SDFP and analytic solution for the inter-bank problem. Both the state (left) and control (right) trajectories are aligned with the analytic solution (represented by dots).

Figure 4: Time and memory complexity comparison between batch, iterate and invariant layer+batch implementations. Time complexity is measured by per-iteration time.

Figure 6: RSE and total loss trajectory comparison between our FBSDE framework and that of Han & Hu (2019) w/ and w/o invariant layer for 500 agents.

Figure 7: RSE and training loss trajectory comparison between our FBSDE framework and the extension of Pereira et al. (2019) w/ and w/o invariant layer for 500 agents.

Figure 8: Terminal time step state X and control U distribution of ith agent for linear and superlinear dynamics.

Figure 9: One belief space racing trajectory. The solid line represents the mean and the circles represent the variance.

Figure 10: Invariant layer architecture.

Figure 11: FBSDE Network for a Single Agent. Note that the same FC is shared across all timesteps.

Figure 14: The gradient path of FBSDE model w/ and w/o importance sampling. The figure on the left is FBSDE with importance sampling and the figure on the right is FBSDE without importance sampling. One can identify that the framework with importance sampling would lead to long chain of gradient.

Figure 16: The Total loss and RSE of 1000 agents simulation

Scalable Deep Fictitious Play FBSDE 1: Hyper-parameters:N : Number of players, T : Number of timesteps, M : Number of stages in fictitious play, N gd : Number of gradient descent steps per stage, U 0 : the initial strategies for players in set I, B: Batch size, : training threshold, ∆t: time discretization 2: Parameters:V (x 0 ; φ): Network weights for initial value prediction, θ: Weights and bias of fully connected layers and LSTM layers. 3: Initialize trainable papermeters:θ 0 , φ 0 4: while LOSS is above certain threshold do 5: for m ← 1 to M do 6: for all i ∈ I in parallel do 7:Collect opponent agent's policy f m-1 LST M-i (•), f m-1 F C-i (•)8:

Comparison with previous work on the 10-agent inter-bank problem. we plot the results of 3 repeated runs with different seeds with the line and shaded region showing the mean and mean±standard deviation respectively. The hyperparameters and dynamics coefficients used in the inter-bank experiments are the same asHan & Hu (2019) unless otherwise noted.

