PROBE INTO MULTI-AGENT ADVERSARIAL REINFORCEMENT LEARNING THROUGH MEAN-FIELD OPTIMAL CONTROL Anonymous

Abstract

Multi-agent adversarial reinforcement learning (MaARL) has shown promise in solving adversarial games. However, the theoretical tools for MaARL's analysis is still elusive. In this paper, we take the first step to theoretically understand MaARL through mean-field optimal control. Specifically, we model MaARL as a mean-field quantitative differential game between two dynamical systems with implicit terminal constraints. Based on the game, we respectively study the optimal solution and the generalization of the fore-mentioned game. We first establish a two-sided extremism principle (TSEP) as a necessary condition for the optimal solution of the game. We then show that this TSEP is also sufficient given that the terminal time is sufficiently small. Based on the TSEP, a generalization bound for MaARL is further proposed. This bound does not explicitly rely on the dimensions, norms, or other capacity measures of the model, which are usually prohibitively large in deep learning. To our best knowledge, this is the first work on the theory of MaARL.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton, 1988; Sutton and Barto, 1998; Barto et al., 1991) , aimed at single-agent environments, has been successfully deployed in many application areas, including ethology (Dayan and Daw, 2008) , economics (Jasmin et al., 2011 ), psychology (Leibo et al., 2018) , and system control (Arel et al., 2010) . It studies how artificial systems learn to predict the optimal action according to the current state. The agent in the system determines the best course of action to maximize its rewards, which also moves its state to the next state. However, real-world RL agents inhabit natural environments also populated by other agents. Agents in these environments can interact with each other and modify each other's rewards via their actions. Based on this point, multi-agent adversarial reinforcement learning (MaARL) is proposed by Uther and Veloso (2003) , in which adversarial neural networks are employed for solving games in adversarial environments (Mandlekar et al., 2017) . MaARL is well suited for multi-party game problems such as autonomous driving (Behzadan and Munir, 2019; Pan et al., 2019) , AI gaming (Mandlekar et al., 2017; Pinto et al., 2017; Zhang et al., 2020) , and auction games (Bichler et al., 2021) . Moreover, adversarial neural networks can also improve feature robustness and sample efficiency (Ma et al., 2018) . Despite the empirical popularity of MaARL, its theoretical understanding remains blank. We attribute such a gap between theory and practice to the lack of analysis tool: the optimization objective is defined by a dynamical system, which is too complex to analyze directly. In this paper, we aim to provide new theoretical tools for the analysis of MaARL. We probe into MaARL from the view of mean-field optimal control. Specifically, our contributions can be summarized as follows: 1. We propose to model MaARL as a mean-field quantitative differential game (Pontryagin, 1985) ; and thus, its corresponding training process is regarded as how to achieve the optimal control of this game. The mean-field two-sided extremism principle (TSEP) (Guo et al., 2005) is then presented, which relies on the loss function and terminal constraints. This mean-field TSEP serves as the necessary conditions of the convergence (or equivalently, the optimality) of the mean-field quantitative differential game; when the terminal time is small enough, this mean-field TSEP is also a unique solution, and thus serves as the sufficient conditions of the convergence. 2. The optimal objective function value is characterized by the viscosity solution (E et al., 2019) of a mean-field Hamilton-Jacobi-Issacs (HJI) equation (Guo et al., 2005) . We then prove that this viscosity solution is unique. The HJI equation gives a global characterization of adversarial reinforcement learning, while the previously given mean-field TSEP is a local special case. 3. Based on the TSEP, a generalization error bound for MaARL is proved. The bound is of the order O(1/ √ N ), where N is the number of samples. They do not explicitly rely on the dimensions, norms, or other capacity measures of the network parameter, which are usually prohibitively large in deep learning. To the best of our knowledge, this is the first work on developing theoretical foundations for adversarial reinforcement learning. Our work may inspire novel designs of optimization methods for adversarial reinforcement learning. Moreover, the techniques may be of independent interest in modeling other adversarial learning algorithms, including generative adversarial networks (Goodfellow et al., 2020; Liu and Tuzel, 2016; Mao et al., 2017) , and solving partial differential equations (Zang et al., 2020) .

2. RELATED WORKS.

Mean-field optimal control. Since the work of Fornasier and Solombrino (2014) which introduces the concept of the mean-field optimal control and describes it as a rigorous limiting process, various applications of mean-field optimal control in different scenarios were proposed. Fornasier et al. ( 2019) focus on the role of a government of a large population of interacting agents as a mean-field optimal control problem derived from deterministic finite agent dynamics, Burger et al. ( 2021) derive a framework to compute optimal controls for problems with states in the space of probability measures, and Albi et al. ( 2022) studied the problem of mean-field selective optimal control for multi-population dynamics based on transient leadership. In terms of the development of mathematical tools for mean-field optimal control, Bonnet and Frankowska (2022) investigate some of the fine properties of the value function associated with an optimal control problem in the Wasserstein space of probability measures, and Bonnet and Rossi (2021) provide sufficient conditions under which the controlled vector fields solution of optimal control problems formulated on continuity equations are Lipschitz regular in space. See (Bonnet et al., 2022; Zhou and Xu, 2020; Carrillo et al., 2020) for more reference. Deep learning theory based on dynamics. Previous works have been devoted to establish the theoretical foundations of deep learning by the dynamical system viewpoint since E (2017). Based on the PMP and the method of successive approximation (Kantorovitch, 1939) 2019) propose to employ mean-field optimal control formulation for explaining deep learning. They prove the mean-field optimality conditions of both the Hamilton-Jacobi-Bellman type and the Pontryagin type (Pontryagin, 1987) . Similar results are given by Persio and Garbelli (2021) through associating deep learning with stochastic optimal control (Guo et al., 2005) from the perspective of mean-field games (Lasry and Lions, 2007) . These mean-field results reflect the probabilistic nature of deep learning. Compared with above works, this paper models an MaARL algorithm as a mean-field quantitative differential game between two dynamical systems, rather than a single dynamical system.

3. PRELIMINARIES

One-agent reinforcement learning. One-agent reinforcement learning aims to solve a K-step decision problem. Specifically, the agent named D z starts from state x 0 = x ∈ R n1 . At step k, the agent



, new optimization methods are developed by Li et al. (2018); Li and Hao (2018). Sonoda and Murata (2017) study the continuum limit of training neural networks, and Chang et al. (2018b;a); Haber and Ruthotto (2017) contribute to the design of network architecture based on dynamical systems and differential equations. E et al. (

