CONTEXT AND HISTORY AWARE OTHER-SHAPING

Abstract

Cooperation failures, in which self-interested agents converge to collectively worst-case outcomes, are a common failure mode of Multi-Agent Reinforcement Learning methods. Methods such as Model-Free Opponent Shaping and The Good Shepherd address this issue by shaping their co-player's learning into mutual cooperation. However, these methods fail to capture important co-player learning dynamics or do not scale well to co-players parameterised by deep neural networks. To address these issues, we propose Context and History Aware Other-Shaping (CHAOS). A CHAOS agent is a meta-learner that learns to shape its coplayer over multiple trials. CHAOS considers both the context (inter-episode information), and history (intra-episode information) to shape co-players. CHAOS also successfully scales to shaping co-players parameterised by deep neural networks. In a set of experiments, we show that CHAOS achieves state-of-the-art shaping in matrix games. We provide extensive ablations, motivating the importance of both context and history. CHAOS also successfully shapes on a gridworld-based game, demonstrating CHAOS' scalability empirically. Finally, we provide empirical evidence that, counterintuitively, the widely-used Coin Game environment does not require history to learn shaping because states are often indicative of past actions, making it unsuitable for investigating shaping.

1. INTRODUCTION

Multi-agent learning has shown great success in strictly competitive (Silver et al., 2016) and fully cooperative settings (Foerster et al., 2019; Rashid et al., 2018) . In competitive games, agents can learn Nash equilibrium strategies by iteratively best-responding to suitable mixtures of past opponents. Similarly, best-responding to rational co-players leads to the desirable equilibria in cooperative games (assuming joint training). In contrast, Nash equilibria often coincide with globally worst welfare outcomes in general-sum games, rendering the aforementioned learning paradigms ineffective. For example, in the iterated prisoner's dilemma (IPD) (Axelrod & Hamilton, 1981; Harper et al., 2017) , naive best-response dynamics converge on unconditional mutual defection (Foerster et al., 2018) . The above methods ignore a crucial factor: when multiple learning agents interact in a shared environment, the actions of one agent influence the environment and, often, the reward of other agents, which in turn influences their learning dynamics. For example, a car merging into the middle lane in heavy traffic makes it unattractive for fellow collision-averse motorists to move to the middle lane at the same time. Our paper investigates methods which allow agents to exploit this interconnection between their actions and the learning outcome of other agents and leverage it to their advantage. Such "shaping" methods explicitly account for other agent's learning and have achieved promising results, e.g. discovering the prosocial tit-for-tat strategy in the IPD (Foerster et al., 2018; Letcher et al., 2019b; Willi et al., 2022; Balaguer et al., 2022; Lu et al., 2022) . However, early shaping methods are myopic (only shape the next learning step of the co-player), require white-box access to the co-player's parameters and require higher-order derivatives. To overcome these shortcomings, both Model-Free Opponent Shaping (Lu et al., 2022, M-FOS) and The Good Shepherd (Balaguer et al., 2022, GS) frame shaping as a meta reinforcement learning problem. In these approaches, the metaagent learns to shape others by observing full training runs of the co-players in each meta-training episode before updating its policy. M-FOS and GS showed promising empirical success. However, both methods have shortcomings: M-FOS's meta-agent outputs a policy parameterisation for the inner-agent (similar to HyperNetworks (Ha et al., 2017) ). This limits M-FOS to games where the policies can be represented compactly, such as in infinitely-iterated matrix games. While M-FOS reports results in a higher-dimensional game (in which neural networks represent the policies), it uses a hierarchical architecture. GS does not output whole parameterisations but instead keeps its policy fixed during the entire duration of a trial. This prevents GS from using the training context to shape the co-player adaptively. To address both issues, we propose Context and History Aware Other-Shapingfoot_0 (CHAOS). In CHAOS, the meta-agent and the inner agent it controls are parameterised by a single recurrent neural network (RNN). A CHAOS agent meta-learns by retaining its hidden state throughout an entire meta-episode, similar to RL 2 (Duan et al., 2016) in single-agent RL. This hidden state enables CHAOS agents to react to two components of the co-player's learning: The context -inter-episode learning and the history -intra-episode behaviour. In shaping problems, history captures the coplayer's current policies whilst context captures the co-player's learning rules. Together these two enable CHAOS to dynamically shape agents. Combining the meta-agent and the inner agent into one recurrent meta-learner avoids outputting policy parameterisations, unlike M-FOS. We show that CHAOS discovers a ZD-extortion-like strategy in the finitely-iterated prisoner's dilemma (a more challenging setting than the infinitely-iterated PD where the environment is non-differentiable and where policies cannot be represented compactly). Moreover, we show that CHAOS matches or outperforms GS and M-FOS in iterated matrix games. CHAOS also matches state-of-the-art shaping against memory-based agents in the Coin Game, a grid-based environment where deep neural networks represent policies. To summarise our contributions • We introduce CHAOS, a shaping method capturing both learning context and history, suitable for high-dimensional games. • We formalise the concept of history and context for shaping and analyse their respective roles empirically. • We demonstrate state-of-the-art performance on a set of iterated matrix games. • We identify a fundamental problem in the widely-used Coin Game.

2. RELATED WORK

Opponent Shaping Many methods exist that explicitly account for their opponent's learning. Just like CHAOS, these approaches recognise that the actions of any one agent influence their co-players policy and seek to use this mechanism to their advantage (Foerster et al., 2018; Letcher et al., 2019a; Kim et al., 2021a; Willi et al., 2022) . However, in contrast to CHAOS, these approaches require privileged information to shape their opponents. These models are also myopic since anticipating many steps is intractable. Balaguer et al. (2022) and Lu et al. (2022) solve the issues above by framing opponent shaping as a meta reinforcement learning problem, which CHAOS inherits and builds upon. The specific differences to M-FOS and GS will be the subject of Section 4. Opponent Modelling Similarly to our work, opponent modelling tries to disentangle some aspects of other agents' policies from the environment (Mealing & Shapiro, 2017; Raileanu et al., 2018; Tacchetti et al., 2018) . In contrast to our work, these approaches do not consider agents as learners. Furthermore, they do not observe agents at different stages of learning and thus, whilst modelling as non-stationary, do not observe learning dynamics (Synnaeve & Bessière, 2011). Finally, CHAOS does not explicitly model any aspect of the opponent. Multi-Agent Meta-Learning Multi-agent meta-learning approaches have also shown success in mixed-games with other learners (Al-Shedivat et al., 2018; Kim et al., 2021b; Wu et al., 2021) . Similar to CHAOS, they take inspiration from meta reinforcement learning -their approach is to learn the optimal initial parameterisation for the shaper akin to Model-Agnostic Meta Learning (Finn et al., 2017) . In contrast, CHAOS uses an approach similar to RL 2 (Duan et al., 2016) , which trains an RNN-based agent to implement efficient learning for its next task. Furthermore, CHAOS is optimised using evolution strategies (Salimans et al., 2017) , allowing it to consider much longer time horizons than policy-gradient metods (Schulman et al., 2017) .



"Other" breaks with the line of seminal work on opponent shaping, but highlights the general-sum aspect

