REVISITING HIGHER-ORDER GRADIENT METHODS FOR MULTI-AGENT REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

This paper revisits Higher-Order Gradient (HOG) methods for Multi-Agent Reinforcement Learning (MARL). HOG methods are algorithms in which agents use higher-order gradient information to account for other agents' anticipated learning, and are shown to improve coordination in games with self-interested agents. So far, however, HOG methods are only applied to games with low-dimensional state spaces due to inefficient computation and preservation of higher-order gradient information. In this work, we solve these limitations and propose a HOG framework that can be applied to games with higher-dimensional state spaces. Moreover, we show that current HOG methods, when applied to games with common-interested agents, i.e., team games, can lead to miscoordination among the agents. To solve this, we propose Hierarchical Reasoning (HR) to improve coordination in team games, and we experimentally show that our proposed HR significantly outperforms state-of-the-art methods in standard multi-agent games. With our contributions, we greatly improve the applicability of HOG methods for MARL. For reproducibility, the code used for our work will be shared after the reviewing process.

1. INTRODUCTION

In multi-agent systems, the paradigm of agents' reasoning about other agents has been explored and researched extensively (Goodie et al., 2012; Liu & Lakemeyer, 2021) . Recently, this paradigm is being studied in the subfield of Multi-Agent Reinforcement Learning (MARL) (Wen et al., 2019; 2020; Konan et al., 2022) . Generally speaking, MARL deals with several agents simultaneously learning and interacting in an environment. In the context of MARL, reasoning can be interpreted as accounting for the anticipated learning of other agents (Zhang & Lesser, 2010) . As MARL uses gradient-based optimization, learning anticipation naturally leads to the usage of higher-order gradient information (Letcher et al., 2019) . The so-called Higher-Order Gradient (HOG) methods use this extra gradient information to predict and, in some cases, shape the learning of other agents (Letcher et al., 2019) . The importance of prediction and shaping has been frequently shown for various games, such as the Iterated Prisoner's Dilemma (IPD), where shaping ensures cooperation among the agents (Foerster et al., 2018a) . However, current HOG methods have clear limitations, as they can only work for specific types of games, and become inefficient when the dimensionality of the game increases. In this paper, we explore these limitations and propose a framework that can extend the application scope of HOG methods to a broader range of problem settings in MARL. The vast majority of existing HOG methods focus only on games with low-dimensional state spaces, e.g., matrix games (Foerster et al., 2018a; b; Willi et al., 2022) . There are two challenges that limit HOG methods from being applied to games with high-dimensional state spaces: inefficient computation and preservation of higher-order gradient information. Specifically, current implementations of HOG methods require multiple data sampling stages to compute higher-order gradient information (Foerster et al., 2018b) . Moreover, the higher-order gradient information is applied and, more importantly, preserved in the policy network's parameter space. As a result, existing HOG methods become very inefficient when applied to games that have high-dimensional state spaces, and therefore require high-dimensional parameter spaces. In this paper, to solve this, we propose an HOG framework where the higher-order gradient information are computed and preserved more efficiently. By comparing our proposed framework to existing HOG methods in well-controlled studies, we demonstrate that the overall performance and efficiency of our proposed framework stay

