PROBE INTO MULTI-AGENT ADVERSARIAL REINFORCEMENT LEARNING THROUGH MEAN-FIELD OPTIMAL CONTROL Anonymous

Abstract

Multi-agent adversarial reinforcement learning (MaARL) has shown promise in solving adversarial games. However, the theoretical tools for MaARL's analysis is still elusive. In this paper, we take the first step to theoretically understand MaARL through mean-field optimal control. Specifically, we model MaARL as a mean-field quantitative differential game between two dynamical systems with implicit terminal constraints. Based on the game, we respectively study the optimal solution and the generalization of the fore-mentioned game. We first establish a two-sided extremism principle (TSEP) as a necessary condition for the optimal solution of the game. We then show that this TSEP is also sufficient given that the terminal time is sufficiently small. Based on the TSEP, a generalization bound for MaARL is further proposed. This bound does not explicitly rely on the dimensions, norms, or other capacity measures of the model, which are usually prohibitively large in deep learning. To our best knowledge, this is the first work on the theory of MaARL.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton, 1988; Sutton and Barto, 1998; Barto et al., 1991) , aimed at single-agent environments, has been successfully deployed in many application areas, including ethology (Dayan and Daw, 2008 ), economics (Jasmin et al., 2011 ), psychology (Leibo et al., 2018 ), and system control (Arel et al., 2010) . It studies how artificial systems learn to predict the optimal action according to the current state. The agent in the system determines the best course of action to maximize its rewards, which also moves its state to the next state. However, real-world RL agents inhabit natural environments also populated by other agents. Agents in these environments can interact with each other and modify each other's rewards via their actions. Based on this point, multi-agent adversarial reinforcement learning (MaARL) is proposed by Uther and Veloso (2003) , in which adversarial neural networks are employed for solving games in adversarial environments (Mandlekar et al., 2017) . MaARL is well suited for multi-party game problems such as autonomous driving (Behzadan and Munir, 2019; Pan et al., 2019) , AI gaming (Mandlekar et al., 2017; Pinto et al., 2017; Zhang et al., 2020) , and auction games (Bichler et al., 2021) . Moreover, adversarial neural networks can also improve feature robustness and sample efficiency (Ma et al., 2018) . Despite the empirical popularity of MaARL, its theoretical understanding remains blank. We attribute such a gap between theory and practice to the lack of analysis tool: the optimization objective is defined by a dynamical system, which is too complex to analyze directly. In this paper, we aim to provide new theoretical tools for the analysis of MaARL. We probe into MaARL from the view of mean-field optimal control. Specifically, our contributions can be summarized as follows: 1. We propose to model MaARL as a mean-field quantitative differential game (Pontryagin, 1985) ; and thus, its corresponding training process is regarded as how to achieve the optimal control of this game. The mean-field two-sided extremism principle (TSEP) (Guo et al., 2005) is then presented, which relies on the loss function and terminal constraints. 1

