ROBUST MULTI-AGENT REINFORCEMENT LEARNING DRIVEN BY CORRELATED EQUILIBRIUM

Abstract

In this paper we deal with robust cooperative multi-agent reinforcement learning (CMARL). While CMARL has many potential applications, only a trained policy that is robust enough can be confidently deployed in real world. Existing works on robust MARL mainly apply vanilla adversarial training in centralized training and decentralized execution paradigm. We, however, find that if a CMARL environment contains an adversarial agent, the performance of decentralized equilibrium might perform significantly poor for achieving such adversarial robustness. To tackle this issue, we suggest that when execution the non-adversarial agents must jointly make the decision to improve the robustness, therefore solving correlated equilibrium instead. We theoretically demonstrate the superiority of correlated equilibrium over the decentralized one in adversarial MARL settings. Therefore, to achieve robust CMARL, we introduce novel strategies to encourage agents to learn correlated equilibrium while maximally preserving the convenience of the decentralized execution. The global variables with mutual information are proposed to help agents learn robust policies with MARL algorithms. The experimental results show that our method can dramatically boost performance on the SMAC environments.

1. INTRODUCTION

Recently, reinforcement learning (RL) has achieved remarkable success in many practical sequential decision problems, such as Go (Silver et al., 2017) , chess (Silver et al., 2018) , real-time strategy games (Vinyals et al., 2019) , etc. In real-world, many sequential decision problems involve more than one decision maker (i.e. multi-agent), such as auto-driving, traffic light control and network routing. Cooperative multi-agent reinforcement learning (CMARL) is a key framework to solve these practical problems. Existing MARL methods for cooperative environments include policybased methods, e.g. MADDPG (Lowe et al., 2017 ), COMA (Foerster et al., 2017) , and value-based methods, e.g. VDN (Sunehag et al., 2018 ), QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) . However, before we actually apply CMARL's policy into real world applications, a question must be asked: are these learned policies safe or robust to be deployed? What will happen if some agents made mistakes or behaved adversarially against other agents? It will be most likely that the entire team might fail to achieve their goal or perform extremely poorly. (Lin et al., 2020) demonstrates the unrobustness in CMARL environment, where a learnt adversarial of one agent can hugely decrease the team's performance. Therefore, in practice, we expect to have a multi-agent team policy in a fully cooperative environment that is robust when some agent(s) make some mistakes and even behave adversarially. To the best of knowledge, very few existing works on this issue mainly use vanilla adversarial training strategy. Klima et al. (2018) considered a two-agent cooperative case, in order to make the policy robust, agents become competitive with a certain probability during training. Li et al. (2019) provided a robust MADDPG approach called M3DDPG, where each agent optimizes its policy based on other agents' influenced sub-optimal actions. Most state-of-the-art MARL algorithms utilize centralized training and decentralized execution (CTDE) routine, since this setting is common in real world cases. The robust MARL method M3DDPG also followed the CTDE setting. However, existing works on team mini-max normal form or extensive form games show that if the environment contains an adversarial agent, then the decentralized equilibrium from CTDE routine can be significantly worse than the correlated equilibrium. We furthermore extend this finding into stochastic team mini-max games. Inspired by this important observation, if we can urge agents to learn a correlated equilibrium (i.e. the non-adversarial agents jointly make the decision when execution), then we may achieve better performance than those following CTDE in robust MARL setting. In this work, we achieve the robust MARL via solving correlated equilibrium motivated by the latent variable model, where the introduction of a latent variable across all agents could help agents jointly make their decisions. Our contributions can be summarized as follows. • We demonstrate that in stochastic team mini-max games, the decentralized equilibrium can be arbitrarily worse than correlated one, and the gap can be significantly larger than in normal or extensive form games. • With this result, we point out that learning correlated equilibrium is indeed necessary in robust MARL. • We propose a simple strategy to urge agents to learn correlated equilibrium, and show that this method can yield significant performance improvements over vanilla adversarial training.

2. RELATED WORKS

Robust RL The robustness in RL involves the perturbations occurring in different cases, such as state or observation, environment, action or policy, opponent's policy, etc. (2018) focused on the case that agent's reward may be influenced by another agent, and adversarial training was implemented to solve the two-agent game to obtain a robust agent. Correlated Equilibrium Correlated equilibrium is a more general equilibrium in game theory compared to Nash equilibrium. In a cooperative task, if the team agents jointly make decisions together, then the optimal team policy is correlated equilibrium. 

3.1. BACKGROUND

A typical cooperative MARL problem can be formulated as a stochastic Markov game (S, {A i } n i=1 , r, P ), where S denotes the state space, {A i } is i-th agent's action space. The en-



For the robustness to state or observation perturbation, most works focused on adversarial attacks of image state/observation. Pattanaik et al. (2018) used gradient-based attack on image state, and vanilla adversarial training was adopted to obtain robust policy; Fischer et al. (2019) first trained a normal policy, and distilled it on adversarial states to achieve robustness; Ferdowsi et al. (2018) applied adversarial training to autonomous driving tasks that interfered agent's input sensors based on environment, and then conducted adversarial training; 2) For the robustness to environment, robust Markov decision process (MDP) could be used to formulate this problem. Many works (e.g. Wiesemann et al. (2013); Lim et al. (2013)) have studied this model and provided both theoretical analysis and algorithmic design. In deep RL scenario, Rajeswaran et al. (2016) used Monte Carlo approach to train agent, while Abdullah et al. (2019); Hou et al. (2020) adopted adversarial training to obtain a robust agent to all environments within a Wasserstein ball. Mankowitz et al. (2019) conducted adversarial training in MPO algorithm to optimize the performance in the worst performance environment. 3) To be against the perturbation of action or policy, Tessler et al. (2019); Gu et al. (2018); Vinitsky et al. (2020) considered the case that agent's action may be influenced by another action, and conducted adversarial training. 4) For the robustness to opponent, Pinto et al. (2017); Ma et al.

solve correlated equilibrium in extensive formal games. In deepRL scenario,  Celli et al. (2019)  applied vanilla hidden variable model to solve correlated equilibrium in simple repetitive environments, while information loss with hidden variable model was used inChen et al.  (2019)  to solve correlated equilibrium in normal multi-agent environment.

