HETEROGENEOUS-AGENT MIRROR LEARNING

Abstract

The necessity for cooperation among independent intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in the artificial intelligence (AI) research community. However, many research endeavours have been focused on developing practical MARL algorithms whose effectiveness has been studied only empirically, thereby lacking theoretical guarantees. As recent studies have revealed, MARL methods often achieve performance that is unstable in terms of reward monotonicity or suboptimal at convergence. To resolve these issues, in this paper,we introduce a novel framework named Heterogeneous-Agent Mirror Learning (HAML) that provides a general template for MARL actor-critic algorithms. We prove that algorithms derived from the HAML template satisfy the desired properties of the monotonic improvement of the joint reward and the convergence to Nash equilibrium. We verify the practicality of HAML by proving that the current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO, are in fact HAML instances. Next, as a natural outcome of our theory, we propose HAML extensions of two well-known RL algorithms, HAA2C (for A2C) and HADDPG (for DDPG), and demonstrate their effectiveness against strong baselines on StarCraftII and Multi-Agent MuJoCo tasks.

1. INTRODUCTION

While the policy gradient (PG) formula has been long known in the reinforcement learning (RL) community (Sutton et al., 2000) , it has not been until trust region learning (Schulman et al., 2015a) that deep RL algorithms started to solve complex tasks such as real-world robotic control successfully. Nowadays, methods that followed the trust-region framework, including TRPO (Schulman et al., 2015a) , PPO (Schulman et al., 2017) and their extensions (Schulman et al., 2015b; Hsu et al., 2020) , became effective tools for solving challenging AI problems (Berner et al., 2019) . It was believed that the key to their success are the rigorously described stability and the monotonic improvement property of trust-region learning that they approximate. This reasoning, however, would have been of limited scope since it failed to explain why some algorithms following it (e.g. PPO-KL) largely underperform in contrast to success of other ones (e.g. PPO-clip) (Schulman et al., 2017) . Furthermore, the trustregion interpretation of PPO has been formally rejected by recent studies both empirically (Engstrom et al., 2020) and theoretically (Wang et al., 2020) ; this revealed that the algorithm violates the trust-region constraints-it neither constraints the KL-divergence between two consecutive policies, nor does it bound their likelihood ratios. These findings have suggested that, while the number of available RL algorithms grows, our understanding of them does not, and the algorithms often come without theoretical guarantees either. Only recently, Kuba et al. (2022b) showed that the well-known algorithms, such as PPO, are in fact instances of the so-called mirror learning framework, within which any induced algorithm is theoretically sound. On a high level, methods that fall into this class optimise the mirror objective, which shapes an advantage surrogate by means of a drift functional-a quasi-distance between policies. Such an update provably leads them to monotonic improvements of the return, as well as the convergence to the optimal policy. The result of mirror learning offers RL researchers strong confidence that there exists a connection between an algorithm's practicality and its theoretical properties and assures soundness of the common RL practice. While the problem of the lack of theoretical guarantees has been severe in RL, in multi-agent reinforcement learning (MARL) it has only been exacerbated. Although the PG theorem has been successfully extended to the multi-agent PG (MAPG) version (Zhang et al., 2018) , it has only recently been shown that the variance of MAPG estimators grows linearly with the number of agents (Kuba et al., 2021) . Prior to this, however, a novel paradigm of centralised training for decentralised execution (CTDE) (Foerster et al., 2018; Lowe et al., 2017b) greatly alleviated the difficulty of the multi-agent learning by assuming that the global state and opponents' actions and policies are accessible during the training phase; this enabled developments of practical MARL methods by merely extending single-agent algorithms' implementations to the multi-agent setting. As a result, direct extensions of TRPO (Li & He, 2020) and PPO (de Witt et al., 2020a; Yu et al., 2021) have been proposed whose performance, although is impressive in some settings, varies according to the version used and environment tested against. However, these extensions do not assure the monotonic improvement property or convergence result of any kind (Kuba et al., 2022a) . Importantly, these methods can be proved to be suboptimal at convergence in the common setting of parameter sharing (Kuba et al., 2022a) which is considered as default by popular multi-agent algorithms (Yu et al., 2021) and popular multi-agent benchmarks such as SMAC (Samvelyan et al., 2019) due to the computational convenience it provides. In this paper, we resolve these issues by proposing Heterogeneous-Agent Mirror Learning (HAML)a template that can induce a continuum of cooperative MARL algorithms with theoretical guarantees for monotonic improvement as well as Nash equilibrium (NE) convergence. The purpose of HAML is to endow MARL researchers with a template for rigorous algorithmic design so that having been granted a method's correctness upfront, they can focus on other aspects, such as effective implementation through deep neural networks. We demonstrate the expressive power of the HAML framework by showing that two of existing state-of-the-art (SOTA) MARL algorithms, HATRPO and HAPPO (Kuba et al., 2022a) , are rigorous instances of HAML. This stands in contrast to viewing them as merely approximations to provably correct multi-agent trust-region algorithms as which they were originally considered. Furthermore, although HAML is mainly a theoretical contribution, we can naturally demonstrate its usefulness by using it to derive two heterogeneous-agent extensions of successful RL algorithms: HAA2C (for A2C (Mnih et al., 2016) ) and HADDPG (for DDPG (Lillicrap et al., 2015) ), whose strength are demonstrated on benchmarks of StarCraftII (SMAC) (de Witt et al., 2020a) 

2. PROBLEM FORMULATION

We formulate the cooperative MARL problem as cooperative Markov game (Littman, 1994) defined by a tuple ⟨N , S, A, r, P, γ, d⟩. Here, N = {1, . . . , n} is a set of n agents, S is the state space, A = × n i=1 A i is the products of all agents' action spaces, known as the joint action space. Although our results hold for general compact state and action spaces, in this paper we assume that they are finite, for simplicity. Further, r : S × A → R is the joint reward function, P : S × A × S → [0, 1] is the transition probability kernel, γ ∈ [0, 1) is the discount factor, and d ∈ P(S) (where P(X) denotes the set of probability distributions over a set X) is the positive initial state distribution. In this work, we will also use the notation P(X) to denote the power set of a set X. At time step t ∈ N, the agents are at state s t (which may not be fully observable); they take independent actions a i t , ∀i ∈ N drawn from their policies π i (• i |s t ) ∈ P(A i ), and equivalently, they take a joint action a t = (a 1 t , . . . , a n t ) drawn from their joint policy π( •|s t ) = n i=1 π i (• i |s t ) ∈ P(A). We write Π i ≜ {× s∈S π i (• i |s) |∀s ∈ S, π i (• i |s) ∈ P(A i )} to denote the policy space of agent i, and Π ≜ (Π 1 , . . . , Π n ) to denote the joint policy space. It is important to note that when π i (• i |s) is a Dirac delta ditribution, the policy is referred to as deterministic (Silver et al., 2014) and we write µ i (s) to refer to its centre. Then, the environment emits the joint reward r(s t , a t ) and moves to the next state s t+1 ∼ P (•|s t , a t ) ∈ P(S). The initial state distribution d, the joint policy π, and the transition kernel P induce the (improper) marginal state distribution ρ π (s) ≜ ∞ t=0 γ t Pr(s t = s|d, π). The agents aim to maximise the expected joint return, defined as J(π) = E s0∼d,a0:∞∼π,s1:∞∼P ∞ t=0 γ t r(s t , a t ) . We adopt the most common solution concept for multi-agent problems which is that of Nash equilibria (Nash, 1951; Yang & Wang, 2020) . We say that a joint policy π NE ∈ Π is a NE if none of the agents can increase the joint return by unilaterally altering its policy. More formally, π NE is a NE if ∀i ∈ N , ∀π i ∈ Π i , J(π i , π -i NE ) ≤ J(πNE).



and Multi-Agent MuJoCo (de Witt et al., 2020b) against strong baselines such as MADDPG (Lowe et al., 2017b) and MAA2C (Papoudakis et al., 2021).

