CONTEXT AND HISTORY AWARE OTHER-SHAPING

Abstract

Cooperation failures, in which self-interested agents converge to collectively worst-case outcomes, are a common failure mode of Multi-Agent Reinforcement Learning methods. Methods such as Model-Free Opponent Shaping and The Good Shepherd address this issue by shaping their co-player's learning into mutual cooperation. However, these methods fail to capture important co-player learning dynamics or do not scale well to co-players parameterised by deep neural networks. To address these issues, we propose Context and History Aware Other-Shaping (CHAOS). A CHAOS agent is a meta-learner that learns to shape its coplayer over multiple trials. CHAOS considers both the context (inter-episode information), and history (intra-episode information) to shape co-players. CHAOS also successfully scales to shaping co-players parameterised by deep neural networks. In a set of experiments, we show that CHAOS achieves state-of-the-art shaping in matrix games. We provide extensive ablations, motivating the importance of both context and history. CHAOS also successfully shapes on a gridworld-based game, demonstrating CHAOS' scalability empirically. Finally, we provide empirical evidence that, counterintuitively, the widely-used Coin Game environment does not require history to learn shaping because states are often indicative of past actions, making it unsuitable for investigating shaping.

1. INTRODUCTION

Multi-agent learning has shown great success in strictly competitive (Silver et al., 2016) and fully cooperative settings (Foerster et al., 2019; Rashid et al., 2018) . In competitive games, agents can learn Nash equilibrium strategies by iteratively best-responding to suitable mixtures of past opponents. Similarly, best-responding to rational co-players leads to the desirable equilibria in cooperative games (assuming joint training). In contrast, Nash equilibria often coincide with globally worst welfare outcomes in general-sum games, rendering the aforementioned learning paradigms ineffective. For example, in the iterated prisoner's dilemma (IPD) (Axelrod & Hamilton, 1981; Harper et al., 2017) , naive best-response dynamics converge on unconditional mutual defection (Foerster et al., 2018) . The above methods ignore a crucial factor: when multiple learning agents interact in a shared environment, the actions of one agent influence the environment and, often, the reward of other agents, which in turn influences their learning dynamics. For example, a car merging into the middle lane in heavy traffic makes it unattractive for fellow collision-averse motorists to move to the middle lane at the same time. Our paper investigates methods which allow agents to exploit this interconnection between their actions and the learning outcome of other agents and leverage it to their advantage. Such "shaping" methods explicitly account for other agent's learning and have achieved promising results, e.g. discovering the prosocial tit-for-tat strategy in the IPD (Foerster et al., 2018; Letcher et al., 2019b; Willi et al., 2022; Balaguer et al., 2022; Lu et al., 2022) . However, early shaping methods are myopic (only shape the next learning step of the co-player), require white-box access to the co-player's parameters and require higher-order derivatives. To overcome these shortcomings, both Model-Free Opponent Shaping (Lu et al., 2022, M-FOS) and The Good Shepherd (Balaguer et al., 2022, GS) frame shaping as a meta reinforcement learning problem. In these approaches, the metaagent learns to shape others by observing full training runs of the co-players in each meta-training episode before updating its policy. M-FOS and GS showed promising empirical success. However, both methods have shortcomings: M-FOS's meta-agent outputs a policy parameterisation for the inner-agent (similar to HyperNetworks (Ha et al., 2017) ). This limits M-FOS to games where the

