REVISITING HIGHER-ORDER GRADIENT METHODS FOR MULTI-AGENT REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

This paper revisits Higher-Order Gradient (HOG) methods for Multi-Agent Reinforcement Learning (MARL). HOG methods are algorithms in which agents use higher-order gradient information to account for other agents' anticipated learning, and are shown to improve coordination in games with self-interested agents. So far, however, HOG methods are only applied to games with low-dimensional state spaces due to inefficient computation and preservation of higher-order gradient information. In this work, we solve these limitations and propose a HOG framework that can be applied to games with higher-dimensional state spaces. Moreover, we show that current HOG methods, when applied to games with common-interested agents, i.e., team games, can lead to miscoordination among the agents. To solve this, we propose Hierarchical Reasoning (HR) to improve coordination in team games, and we experimentally show that our proposed HR significantly outperforms state-of-the-art methods in standard multi-agent games. With our contributions, we greatly improve the applicability of HOG methods for MARL. For reproducibility, the code used for our work will be shared after the reviewing process.

1. INTRODUCTION

In multi-agent systems, the paradigm of agents' reasoning about other agents has been explored and researched extensively (Goodie et al., 2012; Liu & Lakemeyer, 2021) . Recently, this paradigm is being studied in the subfield of Multi-Agent Reinforcement Learning (MARL) (Wen et al., 2019; 2020; Konan et al., 2022) . Generally speaking, MARL deals with several agents simultaneously learning and interacting in an environment. In the context of MARL, reasoning can be interpreted as accounting for the anticipated learning of other agents (Zhang & Lesser, 2010) . As MARL uses gradient-based optimization, learning anticipation naturally leads to the usage of higher-order gradient information (Letcher et al., 2019) . The so-called Higher-Order Gradient (HOG) methods use this extra gradient information to predict and, in some cases, shape the learning of other agents (Letcher et al., 2019) . The importance of prediction and shaping has been frequently shown for various games, such as the Iterated Prisoner's Dilemma (IPD), where shaping ensures cooperation among the agents (Foerster et al., 2018a) . However, current HOG methods have clear limitations, as they can only work for specific types of games, and become inefficient when the dimensionality of the game increases. In this paper, we explore these limitations and propose a framework that can extend the application scope of HOG methods to a broader range of problem settings in MARL. The vast majority of existing HOG methods focus only on games with low-dimensional state spaces, e.g., matrix games (Foerster et al., 2018a; b; Willi et al., 2022) . There are two challenges that limit HOG methods from being applied to games with high-dimensional state spaces: inefficient computation and preservation of higher-order gradient information. Specifically, current implementations of HOG methods require multiple data sampling stages to compute higher-order gradient information (Foerster et al., 2018b) . Moreover, the higher-order gradient information is applied and, more importantly, preserved in the policy network's parameter space. As a result, existing HOG methods become very inefficient when applied to games that have high-dimensional state spaces, and therefore require high-dimensional parameter spaces. In this paper, to solve this, we propose an HOG framework where the higher-order gradient information are computed and preserved more efficiently. By comparing our proposed framework to existing HOG methods in well-controlled studies, we demonstrate that the overall performance and efficiency of our proposed framework stay consistent with increased dimensionality, unlike for existing HOG methods, where they get drastically worse. In addition to dimensionality limitations, the generalizability of HOG methods to various types of games is questionable. Originally, HOG methods are proposed to improve cooperation in games with self-interested agents (Zhang & Lesser, 2010; Foerster et al., 2018a) . So far, however, it is unclear how HOG methods perform when agents are fully cooperative, i.e., for common-interested agents in team games. We demonstrate that existing HOG methods have the tendency to lead to miscoordination among common-interested agents, causing a sub-optimal overall reward. To solve this, and improve the applicability of HOG methods to team games, we propose Hierarchical Reasoning (HR), a new HOG methodology explicitly developed for improving coordination in games with common-interested agents. Below, we summarize our contributions. • We propose HOG-MADDPG, a framework to make existing HOG methodologies, e.g., LA and LOLA, applicable to games with higher-dimensional state spaces by solving the limitations in computation and preservation of higher-order gradient information. With our framework, we develop two novel HOG methods, LA-MADDPG and LOLA-MADDPG, which apply the principles of LA and LOLA, respectively. • We demonstrate theoretically, in a two-agent two-action coordination game, and empirically, in a two-agent three-action coordination game, that the existing HOG methodologies can suffer from miscoordination among common-interested agents. To solve this, we propose the HR methodology and show, theoretically and empirically, that it overcomes miscoordination in the coordination games. • We apply the HR principle to our HOG-MADDPG framework and develop HR-MADDPG, a HOG method for common-interested agents. We show that HR-MADDPG outperforms the existing state-of-the-art methods on standard multi-agent games.

2. RELATED WORKS

When direct communication among agents is not possible, the standard tool for MARL agents to apply reasoning is Agents Modeling Agents (AMA) (Albrecht & Stone, 2018) . Although agents traditionally use AMA to only predict the behavior of others (He et al., 2016; Hong et al., 2018) , recent studies have extended AMA to further consider multiple levels of reasoning over the predicted behaviors (Wen et al., 2019; 2020) . However, these approaches do not explicitly account for the other agents' anticipated learning, which has shown to be beneficial in games where interaction among self-interested agents naturally leads to worst-case outcomes (Foerster et al., 2018a) . HOG methods, on the other hand, are a range of methods that use higher-order gradient information to predict and, in some cases, shape the anticipated learning of other agents directly. 2018), respectively. However, as we explain in Section 4, these methods have only been applied to simple games due to the challenges in computation and preservation of higher-order gradient information. Furthermore, the impact of current HOG methods on coordination among common-interested agents has not yet been fully investigated. Current investigations are limited to convergence and non-convergence to stable and unstable fixed points in differential games, respectively (Letcher et al., 2019) . However, we demonstrate in Section 5.1 that in the case of a two-agent, two-action coordination game with unstable fixed points, HOG methods can converge to miscoordination points. The focus of this work is to extend current HOG methodology so that it can be used for games with higher-dimensional state spaces and common-interested agents.

3. BACKGROUND

We formulate the MARL setup as a Markov Game (MG) (Littman, 1994 ). An MG is a tuple (N , S, {A i } i∈N , {R i } i∈N , T , ρ, γ), where N is the set of agents (|N | = n), S is the set of states,

