PROBE INTO MULTI-AGENT ADVERSARIAL REINFORCEMENT LEARNING THROUGH MEAN-FIELD OPTIMAL CONTROL Anonymous

Abstract

Multi-agent adversarial reinforcement learning (MaARL) has shown promise in solving adversarial games. However, the theoretical tools for MaARL's analysis is still elusive. In this paper, we take the first step to theoretically understand MaARL through mean-field optimal control. Specifically, we model MaARL as a mean-field quantitative differential game between two dynamical systems with implicit terminal constraints. Based on the game, we respectively study the optimal solution and the generalization of the fore-mentioned game. We first establish a two-sided extremism principle (TSEP) as a necessary condition for the optimal solution of the game. We then show that this TSEP is also sufficient given that the terminal time is sufficiently small. Based on the TSEP, a generalization bound for MaARL is further proposed. This bound does not explicitly rely on the dimensions, norms, or other capacity measures of the model, which are usually prohibitively large in deep learning. To our best knowledge, this is the first work on the theory of MaARL.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton, 1988; Sutton and Barto, 1998; Barto et al., 1991) , aimed at single-agent environments, has been successfully deployed in many application areas, including ethology (Dayan and Daw, 2008) , economics (Jasmin et al., 2011 ), psychology (Leibo et al., 2018) , and system control (Arel et al., 2010) . It studies how artificial systems learn to predict the optimal action according to the current state. The agent in the system determines the best course of action to maximize its rewards, which also moves its state to the next state. However, real-world RL agents inhabit natural environments also populated by other agents. Agents in these environments can interact with each other and modify each other's rewards via their actions. Based on this point, multi-agent adversarial reinforcement learning (MaARL) is proposed by Uther and Veloso (2003) , in which adversarial neural networks are employed for solving games in adversarial environments (Mandlekar et al., 2017) . MaARL is well suited for multi-party game problems such as autonomous driving (Behzadan and Munir, 2019; Pan et al., 2019) , AI gaming (Mandlekar et al., 2017; Pinto et al., 2017; Zhang et al., 2020) , and auction games (Bichler et al., 2021) . Moreover, adversarial neural networks can also improve feature robustness and sample efficiency (Ma et al., 2018) . Despite the empirical popularity of MaARL, its theoretical understanding remains blank. We attribute such a gap between theory and practice to the lack of analysis tool: the optimization objective is defined by a dynamical system, which is too complex to analyze directly. In this paper, we aim to provide new theoretical tools for the analysis of MaARL. We probe into MaARL from the view of mean-field optimal control. Specifically, our contributions can be summarized as follows: 1. We propose to model MaARL as a mean-field quantitative differential game (Pontryagin, 1985) ; and thus, its corresponding training process is regarded as how to achieve the optimal control of this game. The mean-field two-sided extremism principle (TSEP) (Guo et al., 2005) is then presented, which relies on the loss function and terminal constraints. This mean-field TSEP serves as the necessary conditions of the convergence (or equivalently, the optimality) of the mean-field quantitative differential game; when the terminal time is small enough, this mean-field TSEP is also a unique solution, and thus serves as the sufficient conditions of the convergence. 2. The optimal objective function value is characterized by the viscosity solution (E et al., 2019) of a mean-field Hamilton-Jacobi-Issacs (HJI) equation (Guo et al., 2005) . We then prove that this viscosity solution is unique. The HJI equation gives a global characterization of adversarial reinforcement learning, while the previously given mean-field TSEP is a local special case. 3. Based on the TSEP, a generalization error bound for MaARL is proved. The bound is of the order O(1/ √ N ), where N is the number of samples. They do not explicitly rely on the dimensions, norms, or other capacity measures of the network parameter, which are usually prohibitively large in deep learning. To the best of our knowledge, this is the first work on developing theoretical foundations for adversarial reinforcement learning. Our work may inspire novel designs of optimization methods for adversarial reinforcement learning. Moreover, the techniques may be of independent interest in modeling other adversarial learning algorithms, including generative adversarial networks (Goodfellow et al., 2020; Liu and Tuzel, 2016; Mao et al., 2017) , and solving partial differential equations (Zang et al., 2020) .

2. RELATED WORKS.

Mean-field optimal control. Since the work of Fornasier and Solombrino (2014) which introduces the concept of the mean-field optimal control and describes it as a rigorous limiting process, various applications of mean-field optimal control in different scenarios were proposed. Fornasier et al. (2019) focus on the role of a government of a large population of interacting agents as a mean-field optimal control problem derived from deterministic finite agent dynamics, Burger et al. (2021) derive a framework to compute optimal controls for problems with states in the space of probability measures, and Albi et al. (2022) studied the problem of mean-field selective optimal control for multi-population dynamics based on transient leadership. In terms of the development of mathematical tools for mean-field optimal control, Bonnet and Frankowska (2022) investigate some of the fine properties of the value function associated with an optimal control problem in the Wasserstein space of probability measures, and Bonnet and Rossi (2021) provide sufficient conditions under which the controlled vector fields solution of optimal control problems formulated on continuity equations are Lipschitz regular in space. See (Bonnet et al., 2022; Zhou and Xu, 2020; Carrillo et al., 2020) for more reference. Deep learning theory based on dynamics. Previous works have been devoted to establish the theoretical foundations of deep learning by the dynamical system viewpoint since E (2017). Based on the PMP and the method of successive approximation (Kantorovitch, 1939) , new optimization methods are developed by Li et al. (2018) ; Li and Hao (2018) . Sonoda and Murata (2017) study the continuum limit of training neural networks, and Chang et al. (2018b; a) ; Haber and Ruthotto (2017) contribute to the design of network architecture based on dynamical systems and differential equations. E et al. (2019) propose to employ mean-field optimal control formulation for explaining deep learning. They prove the mean-field optimality conditions of both the Hamilton-Jacobi-Bellman type and the Pontryagin type (Pontryagin, 1987) . Similar results are given by Persio and Garbelli (2021) through associating deep learning with stochastic optimal control (Guo et al., 2005) from the perspective of mean-field games (Lasry and Lions, 2007) . These mean-field results reflect the probabilistic nature of deep learning. Compared with above works, this paper models an MaARL algorithm as a mean-field quantitative differential game between two dynamical systems, rather than a single dynamical system.

3. PRELIMINARIES

One-agent reinforcement learning. One-agent reinforcement learning aims to solve a K-step decision problem. Specifically, the agent named D z starts from state x 0 = x ∈ R n1 . At step k, the agent are able to take action a k based on its current state x k and moves to state x k+1 = x k + f k (x k , a k ), where f k is the transition function. The error of action a k is penalized by L k (x k , a k ) from the environment. A final penalty Φ(x K , y) applies at the last step K, where y represents some known prior information, e.g., the engine power of a vehicle in autonomous driving.As a summary, the overall penalties during the agent course is as follows: E (x,y)∼µ Φ(x K , y) + K-1 k=0 L k (x k , a k ) . In deep reinforcement learning, the action a k is represented by a deep neural network parameterized by θ k , i.e., a k = a k (x k , θ k ), and we can view the agent D z as a mapping D z (x; θz = {θ k } K-1 k=0 ) from the initial state x 0 = x to the last-step state x K . Also, in practice, the terminal state is usually subject to a constraint represented by function g(x K ), such as a vehicle (the agent) is controlled to reach a certain area (the constraint). In this way, one-agent reinforcement learning can be modelled as a mean-field optimal control problem with trainable parameters {θ k } K-1 k=0 , as follows, inf θ E (x,y)∼µ Φ(x K , y) + K-1 k=0 L k (x k , θ k ) s.t. x k+1 = x k + f k (x k , θ k ), x 0 = x ∈ R n1 , g(x K ) = 0. Multi-agent adversarial reinforcement learning. In this setting, besides the original deeplearning-based agent We use D z (x z ; θz = {θ k z } K-1 k=0 ) : R n1 → R n1 (Eq. 1), an adversarial deep-learning- based agent D d (x d ; θd = {θ k d } K-1 k=0 ) : R n2 → R n2 exists, a k z (x k z , θ k z ) and a k d (x k d , θ k d ) to denote the action of the original and the adversarial agents at step k, respectively. The penalty at step k now relies on the states and actions both of the original agent D z and of the adversarial agent D d , represented as L k (x k z , x k d , θ k z , θ k d ). Similarly, the terminal cost function can be represented as Φ(D z (x z ; θz ), D d (x d ; θd ), y). In MaARL, θz is trained to maximize the loss, while θd is trained to minimize the loss. We can then formulate the adversarial reinforcement learning problem as inf θz∈ Θz sup θd ∈ Θd E (xz,x d ,y)∼µ Φ(x K z , x K d , y) + K-1 k=0 L(x k z , x k d , θ k z , θ k d ) s.t. x k+1 z = x k z + f z (x k z , θ k z ), x 0 z = x z , g z (x K z ) = 0, x k+1 d = x k d + f d (x k d , θ k d ), k = 0, . . . , K -1, x 0 d = x d , g d (x K d ) = 0. The goal of MaARL is to find the optimal parameters θz and θd satisfying Eq. ( 2), such that two agents reach a Nash equilibrium (Maskin, 1999) . In this paper, we consider a currently popular offline setting (Agarwal et al., 2020) of MaARL, where the model is learned on N data points {(x zi , x di , y i )} i=1,••• ,N sampled from the distribution µ.

4. MAARL AS A MEAN-FIELD DIFFERENTIAL GAME

Given the current formulation of MaARL (Eq. 2), it is not easy to provide a theoretical analysis due to its discrete iterations. To faciliate analysis, we consider the dynamical systems viewpoint and translate problem (2) into the following continuous form. inf θz sup θ d J(θ z , θ d ) = inf θz sup θ d E (xz,x d ,y)∼µ Φ(x(t f ), y) + t f 0 L(x(t), θ z (t), θ d (t))dt , s.t. dx(t) dt = f (x(t), θ z , θ d ), x(0) = (x T z , x T d ) T , x(t f ) ∈ S := {x g(x) = 0}, where f (x, θ z , θ d ) = (f T z (x, θ z ), f T d (x, θ d ) T ), g(x(t f )) = (g T z (x z (t f )), g T d (x d (t f ))) T , x z (⋆) : [0, t f ] → R n1 , x d (⋆) : [0, t f ] → R n2 , θ z (⋆) : [0, t f ] → R r1 , θ d (⋆) : [0, t f ] → R r2 , g z (⋆) : R n1 → R p1 , g d (⋆) : R n2 → R p2 , Φ, L and f are all functions of appropriate input and output dimensions. Thus, x : [0, t f ] → R n , g : R n → R p , n = n 1 + n 2 and p = p 1 + p 2 . We define U z (U d ) as the set of admissable strategy θ z (θ d ) that satisfies the terminal constraint g z (x z (t f )) = 0 (g d (x d (t f )) = 0). We note that the above problem (Eq. ( 3)) is a special case of the mean-field differential games, and name it as the mean-field quantitative differential game. We believe that Eq. ( 3) is a reasonable modeling of MaARL, since most dynamic systems in MaARL scenarios are naturally described in terms of continuous time because of physical laws (e.g., the trajectory of the vehicle in autonomous driving). Furthermore, as the mean-field quantitative differential game is a special case of the mean-field differential games, methodology from this area can be borrowed and can offer theoretical insight into this problem. Our goal is to characterize the optimal strategy (θ * z , θ * d ) of Eq. ( 3) and the corresponding optimal trajectory x * (t) for any (θ z , θ d ) ∈ U z × U d that satisfies J(θ * z , θ d ) ≤ J(θ * z , θ * d ) ≤ J(θ z , θ * d ), where Eq. ( 4) is called the saddle point condition. Furthermore, as Eq. ( 3) is a characterization of the expected penalty while empirically only the penalty from the sample is available, another goal of us is to characterize the gap between the sampled penalty and the expected penalty. The rest of the paper is organized as follows: in Section 5, we characterize the optimal solution of Eq. (3) through the mean-field two-sided extremism principle (TSEP). In Section 5, we characterize the optimal objective function value of Eq. ( 3) through the mean-field HJI function. Finally, in Section 7, we derive the generalization bound between the sampled penalty and the expected penalty.

5. MODELING OPTIMAL SOLUTION USING MEAN-FIELD TSEP

In this section, we characterize the optimality of the mean-field quantitative differential game (3) through a two-sided extremism principle (TSEP) with terminal constraints. We prove that satisfying such a TSEP is a necessary condition for being the optimal solution of Eq. ( 3). With additional mild assumptions, we will show the TSEP is also a sufficient condition. We first introduce the Hamilton function of Eq. (3) as follows, H(x(t), θ z (t), θ d (t), ψ(t)) := -L(x(t), θ z (t), θ d (t)) + ψ T (t)f (x(t), θ z (t), θ d (t)), Intuitively, H : R n × Θ z × Θ d × R n → R is the total energy of the dynamical system and ψ ∈ R n represents the momentum. We are now ready to prove derive the necessary condition of being the optimal solution of Eq. (3). Theorem 5.1 Under the assumptions, i) f is bounded and f, L are continuous w.r.t. θ z , θ d ; ii) f, L and Φ are continuously differentiable w.r.t x, and the distribution µ has bounded support. Let (θ * z , θ * d ) ∈ U z × U d be the optimal strategy of problem (3), x * (t) be the corresponding optimal trajectory, then there exists ψ * : [0, t f ] → R n and ξ ∈ R p such that for t ∈ [0, t f ], 1) ẋ * (t) = f (x, θ * z , θ * d ), x * (0) = x 0 , ψ * (t) = -∇ x H(x * (t), θ * z (t), θ * d (t), ψ * (t)), ψ * (t f ) = -∇ x Φ(x * (t f ), y 0 ) -ξ T ∇ x g(x * (t f )) (5) 2) E (x0,y0)∼µ H(x * (t), θ * z (t), θ * d (t), ψ * (t)) = sup θz∈Θz inf θ d ∈Θ d E (x0,y0)∼µ H(x * (t), θ z , θ d , ψ * (t)) = inf θ d ∈Θ d sup θz∈Θz E (x0,y0)∼µ H(x * (t), θ z , θ d , ψ * (t)), a.e., where f (•), L(•) and Φ(•) are defined in Eq. ( 3) and H(•) is the Hamilton function. Theorem 5.1 introduces the necessary conditions for the convergence of the unique global solution to the mean-field TSEP, relying on the loss function and terminal constraints. Since the necessary conditions for optimality have been provided by the TSEP, a natural question is to understand when sufficient conditions for optimality can be also provided. This part presents one simple case where it is sufficient, i.e., an optimal solution exists, when does the mean-field TSEP admit a unique solution? Theorem 5.2 Suppose that i) f is bounded, g is continuously differentiable w.r. iii) H(x, θ z , θ d , ψ) is strongly concave in θ z , strongly convex in θ d and uniform in x ∈ R n , ψ ∈ R n . Then for sufficiently small t f , if (θ 1 z , θ 1 d ) and (θ 2 z , θ 2 d ) are solutions of the mean-field TSEP derived in Theorem 5.1 and are continuously w.r.t time t, then (θ 1 z , θ 1 d ) = (θ 2 z , θ 2 d ). Theorem 5.2 shows that small t f roughly corresponds to the regime where the reachable set of the forward dynamics is small. Hence, the solution is unique. We then assume the continuity of θ 1 z , θ 1 d , θ 2 z , θ 2 d with respect to t in Theorem 5.2. In fact, when θ 1 z , θ 1 d , θ 2 z , θ 2 d are discontinuous on at most a set with zero measure, we can also conclude for a.e. t ∈ [0, t f ] that (θ 1 z (t), θ 1 d (t)) = (θ 2 z (t), θ 2 d (t)).

6. MODELING OPTIMAL OBJECTIVE FUNCTION VALUE VIA MEAN-FIELD HJI EQUATION

In this section, we study mean-field HJI equation from another perspective. This section presents (1) the mean-field HJI equation for MaARL; and (2) the relationship between the HJI equation and the TSEP.

6.1. OPTIMAL OBJECTIVE FUNCTION VALUE OBEYS MEAN-FIELD HJI EQUATION

To simplify the notations, we define v * (t, µ) := J(θ * z , θ * d , t, µ), where J, θ * z , and θ * d are defined in Section 4. One can easily observe that v * (t, µ) corresponds to the optimal objective function value with sample distribution µ and time t. we then have following theorem characterizing v. Theorem 6.1 Under the assumptions i) f, L and Φ are bounded, and the distribution µ ∈ P 2 (R n+m ); ii) f, L and Φ are Lipschitz continuous w.r.t x and the Lipschitz constant of f and L are independent of θ z , θ d . Suppose the optimal value function v * (t, µ) of Eq. ( 7) exists, then it is the unique viscosity solution (see the definition in Appendix B) to the following mean-field HJI equation ∂ t v(t, µ) + inf θz∈Θz sup θ d ∈Θ d R n+m [∂ µ v(t, µ)(x, y)] T [f (x, θ z , θ d ), 0] + L(x, θ z , θ d )dµ(x, y) = 0, v(t f , µ) = R n+m Φ(x, y)dµ(x, y), ) where f (•), L(•) and Φ(•) are defined in (3). The optimal value function v * (t, µ) is the solution to Eq. ( 8) in Theorem 6.1, revealing the dynamic programming principle, which shows that for any optimal trajectory, starting from any intermediate state in the trajectory, the remaining trajectory is also optimal. Theorem 6.1 also establishes the uniqueness of the HJI equation with regards to viscosity, and identifies the value function for the mean-field optimal control problem as the unique solution of the HJI equation.

6.2. CONNECTION BETWEEN HJI AND TSEP

In Theorem 5.1, we prove the necessary condition of being the optimal solution of Eq. ( 3) is characterized by the TSEP, while Theorem 6.1 shows the optimal objective value is the unique viscosity solution of the mean-field HJI equation. One may wonder what is the connection between the TSEP and the mean-field HJI equation. In this section, we will show that the TSEP can be understood as a local result compared to the global characterization of the HJI equation. To see this, we will first introduce some basic knowledge on the Wasserstein space and its derivation rules.

6.2.1. DERIVATIVE IN WASSERSTEIN SPACE

Let D represent the Fréchet derivative on Banach spaces. Namely, if F : U → V is a mapping between two Banach spaces (U, ∥ • ∥ U ) and (V, ∥ • ∥ V ), then DF (x) : U → V is a linear operator satisfies ∥F (x + y) -F (x) -DF (x)(y)∥ V ∥y∥ U → 0, as ∥y∥ U → 0. Denote X ∈ R n+m as a random variable, we use the shorthand L 2 (Ω, R n+m ) for L 2 ((Ω, F, P), R n+m ) to represent the set of R n+m -valued square integrable random variables with respect to a probability measure P. Then we equip this Hilbert space with the norm ∥X∥ L 2 := (E∥X∥ 2 ) 1/2 . As we assumed in the previous section, x 0 ∈ R n , y 0 ∈ R m are random variables and (x 0 , y 0 ) ∼ µ ∈ P 2 (R n+m ), where P 2 (R n+m ) denotes the integrable probability measure defined on the Euclidean space R n+m . The space P 2 (R n+m ) can be equipped with a metric by 2-Wasserstein distance W 2 (µ, ν) := inf ∥X -Y ∥ L 2 X, Y ∈ L 2 (Ω, R n+m ) with P X = µ, P Y = ν . For µ ∈ P 2 (R n+m ), define ∥µ∥ L 2 := ( R n+m ∥w∥ 2 µ(dw)) 1/2 . Now the variable X ∈ L 2 (Ω, R n+m ) if and only if its law P X ∈ P 2 (R n+m ). For any function u : P 2 (R n+m ) → R, we can lift it into its "extension" U ∈ L 2 (Ω, R n+m ) (Cardaliaguet, 2012) by U (X) = u(P X ), ∀X ∈ L 2 (Ω, R n+m ). In particular, we have that u is C 1 (P 2 (R n+m )), if the lifted function U is Fréchet differentiable with continuous derivatives. Since L 2 (Ω, R n+m ) can be identified with its dual, if the Fréchet derivative DU (X) exists, by Riesz' theorem, it can be identified with an element of L 2 (Ω, R n+m ), DU (X)(Y ) = E[DU (X) • Y ], ∀Y ∈ L 2 (Ω, R n+m ). One may check that the law of DU (X) does not depend on X but only on the law of X, thus the derivative of u at µ = P X is defined as DU (X) = ∂ µ u(P X )(X), for some function ∂ µ u(P X ) : R n+m → R n+m .

6.3. DERIVE THE CHARACTERIZATION

In what follows, we provide the connection between the HJI equation and TSEP. We will show that the TSEP can be understood as a local result compared to the global characterization of the HJI equation. For the value function v(t, µ) in deduced HJI (equation 8), consider the lifted function V (t, X), where X = (x, y) ∼ µ. We define the Hamiltonian for the lifted HJI equation as H(X, D X V (t, X)) = inf θz∈Θz sup θ d ∈Θ d E µ D X V (t, X) T [f (x, θ z , θ d ), 0] + L(x, θ z , θ d ) . ( ) Suppose θ † z (X, D X V (t, X)) and θ † d (X, D X V (t, X) ) are the corresponding optimal strategies and define P = D X V (t, X), we have H(X, P ) = E µ P T [f (x, θ † z (X, P ), θ † d (X, P )), 0] + L(x, θ † z (X, P ), θ † d (X, P )) , E µ ∇ θz,θ d [f (x, θ † z (X, P ), θ † d (X, P )), 0]P + ∇ θz,θ d L(x, θ † z (X, P ), θ † d (X, P )) = 0, where the last equation follows from the first order optimality condition. Define X t = (x t , y), P t = D X V (t, X t ), we can apply the characteristic evolution equations (Subbotina, 2006 ) Ẋt = D P H(X t , P t ), Ṗt = -D X H(X t , P t ). ( ) Plugging equation 11 into equation 12, and let θ * z (t) = θ † z (X t , P t ), θ * d (t) = θ † d (X t , P t ) and p t is the first n components of P t , we have ẋt = f (x t , θ * z (t), θ * d (t)), ṗt = -∇ x f (x t , θ * z (t), θ * d (t))p t -∇ x L(x t , θ * z (t), θ * d (t)). ( ) If we let ψ = -p, the first two equalities of equation 5 in Theorem 5.1 is converted to equation 13. The Hamilton equation in TSEP can be regarded as the characteristic equations for the HJI equation originating from µ 0 , which justifies the claim that the TSEP constitutes a local condition as compared to the HJI equation.

7. GENERALIZATION BOUND

In this section, we establish generalization bounds for MaARL in the offline setting both from the perspective of the global minimum of the loss function and from the perspective of algorithmic stability.

7.1. GENERALIZATION BOUND FROM TSEP

We define the loss function of each training sample X i := (x zi , x di , y i ), i = 1, • • • , N as J 0 (θ z , θ d ; X i ) =Φ(x z (t f ), x d (t f ), y i ) + t f 0 L(x z (t), x d (t), θ z (t), θ d (t))dt, where x z (0) = x zi , x d (0) = x di . Now J(θ z , θ d ) = E X0∼µ J 0 (θ z , θ d ; X 0 ), and we define J N (θ z , θ d ) = 1 N N i=1 J 0 (θ z , θ d ; X i ). We then estimate the generalization bounds for offline MaARL based on the TSEP. The necessary condition of Hamiltonian for the sampled version is expressed as 1 N N i=1 H(x θ N z ,θ N d ,i (t), θ N z (t), θ N d (t), ψ θ N z ,θ N d ,i (t)) = inf θ d ∈Θ d sup θz∈Θz 1 N N i=1 H(x θ N z ,θ N d ,i (t), θ z , θ d , ψ θ N z ,θ N d ,i (t)), a.e., where t ∈ [0, t f ], θ N z and θ N d are the solution of sampled TSEP. Note that if Θ z and Θ d are sufficiently large, e.g. Θ z = R r1 , Θ d = R r2 , the solution θ * z , θ * d of TSEP satisfies F (θ * z , θ * d )(t) := E µ0 ∇ θz,θ d H(x θ * z ,θ * d t , ψ θ * z ,θ * d t , θ * z (t), θ * d (t)) = 0, a.e. t ∈ [0, t f ], while the solution θ N z , θ N d of sampled TSEP satisfies F N (θ N z , θ N d )(t) := 1 N N i=1 ∇ θz,θ d H(x θ N z ,θ N d ,i t , ψ θ N z ,θ N d ,i t , θ N z (t), θ N d (t)) = 0, a.e. t ∈ [0, t f ]. Now, F N is a random approximation of F and EF N (θ z , θ d )(t) = F (θ z , θ d )(t) for all θ z , θ d and a.e. t ∈ [0, t f ]. Let (U, ∥ • ∥ U ), (V, ∥ • ∥ V ) be Banach spaces and F : U → V . We first provide the definition of stability, which is the primary condition that ensures the approximation of F N to F . Definition 7.1 For ρ > 0 and x ∈ U , S ρ (x) := {y ∈ U : ∥x -y∥ U < ρ}. The mapping F is stable on S ρ (x) if there exists a constant K ρ > 0 such that, ∥y -z∥ U ≤ K ρ ∥F (y) -F (z)∥ V , ∀y, z ∈ S ρ (x). Notice that in this case, we are only concerned about whether θ * z and θ * d follow the first-order optimality condition. We define θ = (θ T z , θ T d ) T and redefine F (θ * z , θ * d )(•) and F N (θ N z , θ N d )(•) as F (θ)(•), F (θ N )(•) , respectively. Then we obtain the following Theorem 7.1, which describes the convergence of the sampled solution to the mean-field solution as the number of samples increases. Theorem 7.1 Assuming that f, L, and Φ are bounded and Lipschitz continuous with respect to x and the Lipschitz constants of f and L are independent of θ z , θ d . Let (θ * z , θ * d ) be a solution of F = 0 (Eq. ( 17)), which is stable on S ρ (((θ * z ) T , (θ * d ) T ) T ) for some ρ > 0. Then there exists positive constants s 0 , C, K 1 , K 2 , ρ 1 < ρ and a random variable θ N := ((θ N z ) T , (θ N d ) T ) T ∈ S ρ (((θ * z ) T , (θ * d ) T ) T ) , such that for s ∈ (0, s 0 ], the following holds. P ∥θ * z -θ N z ∥ L ∞ ≥ Cs ≤ 4 exp - N s 2 K 1 + K 2 s , P ∥θ * d -θ N d ∥ L ∞ ≥ Cs ≤ 4 exp - N s 2 K 1 + K 2 s , P |J(θ * z , θ * d ) -J(θ N z , θ N d )| ≥ s ≤ 4 exp - N s 2 K 1 + K 2 s , P F N (θ N ) ̸ = 0 ≤ 4 exp - N s 2 0 K 1 + K 2 s 0 . ( ) The loss function ( 14) is uniformly bounded under the given assumptions, then we can apply the Hoeffding's inequality (Corollary 2 in Pinelis and Sakhanenko (1986) ). Using Theorem 6 in E et al. (2019) and rewriting θ as (θ T z , θ T d ) T , this theorem can be proved. Let s ≤ 1, set the right-hand side of Eq. ( 18) to be less than δ. Solving for ϵ immediately yields the following bound. Corollary 7.1 Under the assumptions and notations of Theorem 7.1, for any 0 < δ ≤ max s∈(0,min{1,s0}] 4 exp - N s 2 K 1 + K 2 s , the following inequality holds with probability at least 1 -δ. ∥θ * z -θ N z ∥ L ∞ < C K 1 + K 2 N log 4 δ , ∥θ * d -θ N d ∥ L ∞ < C K 1 + K 2 N log 4 δ , |J(θ * z , θ * d ) -J(θ N z , θ N d )| < K 1 + K 2 N log 4 δ . Corollary 7.1 basically shows that the difference between the optimizer over the whole distribution and the optimizer over finite samples is bounded, and has order O(1/ √ N ) with a total of N samples. This bound is independent of the training algorithm.

7.2. GENERALIZATION BOUND VIA ALGORITHMIC STABILITY

In the end, we estimate the generalization bounds for offline ARL from the view of algorithmic stability. In the rest of this section, we redefine the integral form in J, J N , J 0 as the discrete sum form (2) and redefine r 1 , r 2 as the total dimension of θ z , θ d . We define the generalization error by taking the expectation with respect to the randomized algorithm er(θ z , θ d ) := E A J(θ z , θ d ) -J N (θ z , θ d ) . We update θ z and θ d alternately, i.e. from the initial value (θ z,0 , θ d,0 ), update θ z by M z,1 steps to get (θ z,Mz,1 , θ d,0 ), then update θ d by M d,1 steps to get (θ z,Mz,1 , θ d,M d,1 ). Keep going until the algorithm converges, we can get (θ z,Mz,2 , θ d,M d,2 ), (θ z,Mz,3 , θ d,M d,3 ) • • • (θ z,Mz,n , θ d,M d,n ). Consider Stochastic Gradient Langevin Dynamics (SGLD), which is a popular variant of stochastic gradient methods adding isotropic Gaussian noise in each iteration, e.g. θ z,k+1 = θ z,k -η k ∇ θz J N (θ z,k , θ d,0 ) + 2η k β N (0, I r1 ). We have the following generalization bound in expectation of random draw of training data. Theorem 7.2 Suppose that J 0 (θ z , θ d ; X) is uniformly bounded by C, and ∥∇ θz J 0 (θ z , θ d ; X) -∇ θz J 0 (θ z , θ d ; X ′ )∥ ≤ L z , ∥∇ θ d J 0 (θ z , θ d ; X) -∇ θ d J 0 (θ z , θ d ; X ′ )∥ ≤ L d , ∀X, X ′ , then we have the following generalization bound E[er(θ z,Mz,n , θ d,M d,n )] ≤ 2 N n i=1 min (k 1 , M z,i -M z,i-1 ) + √ βL z C N n i=1   Mz,i-Mz,i-1 j=k1+1 η j   1/2 + 2 N n i=1 min (k 2 , M d,i -M d,i-1 ) + √ βL d C N n i=1   M d,i -M d,i-1 j=k2+1 η j   1/2 , ( ) where M z,0 = M d,0 = 0, k 1 and k 2 are chosen to satisfy η k1 ≤ ln 2/βL 2 z , η k2 ≤ ln 2/βL 2 d . Theorem 7.2 obtains a bound of O(1/N ), which matches the generalization bounds of stochastic gradient descent ascent (SGDA) for minimax problems in Lei et al. (2021) . This bound relies on the aggregated step sizes and does not explicitly depend on the dimensions, norms, or other capacity measures of the parameter, which are usually excessively large in deep learning.

8. CONCLUSION

Adversarial reinforcement learning (MaARL) has shown superior performance in solving adversarial games, but the theoretical understanding of MaARL is still premature. This paper studies the convergence and generalization of MaARL under the mean-field optimal control framework. We first model MaARL as a mean-field quantitative differential game problem. We prove the necessary conditions for the convergence of MaARL from two perspectives, two-sided extremism principle (TSEP) and Hamilton-Jacobi-Issacs (HJI) equation. The uniqueness of the solutions to a meanfield TSEP and the HJI equation are also established. Further, we present the connection between a mean-field TSEP and a mean-field HJI equation. In this way, we show that the TSEP is actually a local special case compared to the global characterization of the HJI equation. We also prove two generalization bounds of orders O(1/ √ N ) and O(1/N ) from two aspects, global minimum of the loss function and algorithmic stability, respectively, where N is the number of initial states used in training. Both bounds do not explicitly rely on the dimensions, norms, or other capacity measures of the network parameter, which are usually prohibitively large in deep learning. The bounds illustrate how the algorithmic randomness facilitates the generalization of MaARL. To the best of our knowledge, this is the first theoretical work on the convergence and generalization of MaARL.

A PROOF OF RESULTS IN SECTION 5

Before proving Theorem 5.2, we write the express in Theorem 5.1 more compactly. For each control process θ z ∈ L ∞ ([0, t f ], Θ z ) and θ d ∈ L ∞ ([0, t f ], Θ d ), we denote by x θz,θ d := {x θz,θ ). We have the following lemma, which provides an estimate of the difference between x θ 1 z ,θ 1 d , ψ θ 1 z ,θ 1 d and x θ 2 z ,θ 2 d , ψ θ 2 z ,θ 2 d . Lemma A.1 Let θ 1 z , θ 2 z ∈ L ∞ ([0, t f ], Θ z ) and θ 1 d , θ 2 d ∈ L ∞ ([0, t f ], Θ d ). Then there exists a constant T 0 such that for all t f ∈ [0, T 0 ), it holds that: ∥x θ 1 z ,θ 1 d -x θ 2 z ,θ 2 d ∥ L ∞ + ∥ψ θ 1 z ,θ 1 d -ψ θ 2 z ,θ 2 d ∥ L ∞ ≤ C(t f )(∥θ 1 z -θ 2 z ∥ L ∞ + ∥θ 1 d -θ 2 d ∥ L ∞ ) , where C(t f ) > 0 satisfies C(t f ) → 0 as t f → 0.  L ∞ ≤ Kt f ∥δx∥ L ∞ + Kt f ∥δθ z ∥ L ∞ + Kt f ∥δθ d ∥ L ∞ . If t f ≤ T 0 := 1/K, we have ∥δx∥ L ∞ ≤ Kt f 1 -Kt f (∥δθ z ∥ L ∞ + ∥δθ d ∥ L ∞ ). Similarly, ∥δψ t ∥ ≤ K∥δx t f ∥ + K t f t ∥δx s ∥ + ∥δψ s ∥ + ∥δθ z (s)∥ + ∥δθ d (s)∥ds, ∥δψ∥ L ∞ ≤ (K + Kt f )∥δx∥ L ∞ + Kt f (∥δψ∥ L ∞ + ∥δθ z ∥ L ∞ + ∥δθ d ∥ L ∞ ), hence ∥δψ∥ L ∞ ≤ K(1 + t f ) 1 -Kt f ∥δx∥ L ∞ + Kt f 1 -Kt f (∥δθ z ∥ L ∞ + ∥δθ d ∥ L ∞ ), which combined with equation 22 proves the lemma. We can now prove Theorem 5.2. Proof A.2 (Proof of Theorem 5.2) By uniform strong concavity and the second assumption of Theorem 5.2, there exists a λ 0 > 0 such that )∥∥θ 1 d (t) -θ 2 d (t)∥ ≤K(∥δx∥ L ∞ + ∥δψ∥ L ∞ )(∥δθ z ∥ L ∞ + ∥δθ d ∥ L ∞ ). Combining the above with Lemma A.1, we have ∥δθ z ∥ 2 L ∞ + ∥δθ d ∥ 2 L ∞ ≤ KC(t f ) λ 0 (∥δθ z ∥ L ∞ + ∥δθ d ∥ L ∞ ) 2 ≤ 2KC(t f ) λ 0 (∥δθ z ∥ 2 L ∞ + ∥δθ d ∥ 2 L ∞ ). C(t f ) → 0 as t f → 0, by taking t f sufficiently small, so that 2KC(t f ) < λ 0 , which implies ∥δθ z ∥ L ∞ = ∥δθ d ∥ L ∞ = 0.



where θz ∈ Θz and θd ∈ Θd are parameters of D z and D d respectively, and x z and x d are initial states of D z and D d respectively.

t ≤ t f } and ψ θz,θ d := {ψ θz,θ d t : 0 ≤ t ≤ t f } the solutions of Hamilton's Equation equation 5, i.e. x H(x θz,θ d , θ z (t), θ d (t), ψ θz,θ d ) -ξ∇ x g(x θz,θ d t f

Proof A.1 (Proof of Lemma A.1) Denote δθ z := θ 1 z -θ 2 z , δθ d := θ 1 d -θ 2 d , δx := x θ 1

t x with bounded and Lipschitz partial derivatives, µ has bounded support in R n × R m ; ii) f , L and Φ are twice continuously differentiable w.r.t x, θ z and θ d with bounded and Lipschitz partial derivatives, and ∂f /∂θ z ∂θ d , ∂L/∂θ z ∂θ d ≡ 0;

annex

B PROOF OF THEOREM 6.1Now we introduce the definition of viscosity solution. Consider a function v(t, P X ) : [0, t f ] × P 2 (R n+m ) → R, the Hamiltonian H(X, ∂ P X v(t, P X )(X)) :Then the lifted function V (t, X) = v(t, P X ) satisfiesWe say that a bounded, uniformly continuous function u : [0,is a viscosity solution to the lifted equation equation 24, namely:For further details we refer the interested readers to (E et al., 2019) .Proof B.1 (Proof of Theorem 6.1) Suppose v ′ (t, µ) is a viscosity solution to equation 8 andis the corresponding optimal strategy. We first fix θ ′ z , consider(25) By Theorem 1 and Theorem 2 in (E et al., 2019) , v ′ (t, µ) is the unique viscosity solution to equation 25 satisfies 

