HETEROGENEOUS-AGENT MIRROR LEARNING

Abstract

The necessity for cooperation among independent intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in the artificial intelligence (AI) research community. However, many research endeavours have been focused on developing practical MARL algorithms whose effectiveness has been studied only empirically, thereby lacking theoretical guarantees. As recent studies have revealed, MARL methods often achieve performance that is unstable in terms of reward monotonicity or suboptimal at convergence. To resolve these issues, in this paper,we introduce a novel framework named Heterogeneous-Agent Mirror Learning (HAML) that provides a general template for MARL actor-critic algorithms. We prove that algorithms derived from the HAML template satisfy the desired properties of the monotonic improvement of the joint reward and the convergence to Nash equilibrium. We verify the practicality of HAML by proving that the current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO, are in fact HAML instances. Next, as a natural outcome of our theory, we propose HAML extensions of two well-known RL algorithms, HAA2C (for A2C) and HADDPG (for DDPG), and demonstrate their effectiveness against strong baselines on StarCraftII and Multi-Agent MuJoCo tasks.

1. INTRODUCTION

While the policy gradient (PG) formula has been long known in the reinforcement learning (RL) community (Sutton et al., 2000) , it has not been until trust region learning (Schulman et al., 2015a) that deep RL algorithms started to solve complex tasks such as real-world robotic control successfully. Nowadays, methods that followed the trust-region framework, including TRPO (Schulman et al., 2015a) , PPO (Schulman et al., 2017) and their extensions (Schulman et al., 2015b; Hsu et al., 2020) , became effective tools for solving challenging AI problems (Berner et al., 2019) . It was believed that the key to their success are the rigorously described stability and the monotonic improvement property of trust-region learning that they approximate. This reasoning, however, would have been of limited scope since it failed to explain why some algorithms following it (e.g. PPO-KL) largely underperform in contrast to success of other ones (e.g. PPO-clip) (Schulman et al., 2017) . Furthermore, the trustregion interpretation of PPO has been formally rejected by recent studies both empirically (Engstrom et al., 2020) and theoretically (Wang et al., 2020) ; this revealed that the algorithm violates the trust-region constraints-it neither constraints the KL-divergence between two consecutive policies, nor does it bound their likelihood ratios. These findings have suggested that, while the number of available RL algorithms grows, our understanding of them does not, and the algorithms often come without theoretical guarantees either. Only recently, Kuba et al. (2022b) showed that the well-known algorithms, such as PPO, are in fact instances of the so-called mirror learning framework, within which any induced algorithm is theoretically sound. On a high level, methods that fall into this class optimise the mirror objective, which shapes an advantage surrogate by means of a drift functional-a quasi-distance between policies. Such an update provably leads them to monotonic improvements of the return, as well as the convergence to the optimal policy. The result of mirror learning offers RL researchers strong confidence that there exists a connection between an algorithm's practicality and its theoretical properties and assures soundness of the common RL practice. While the problem of the lack of theoretical guarantees has been severe in RL, in multi-agent reinforcement learning (MARL) it has only been exacerbated. Although the PG theorem has been successfully extended to the multi-agent PG (MAPG) version (Zhang et al., 2018) , it has only recently

