ORACLES & FOLLOWERS: STACKELBERG EQUILIBRIA IN DEEP MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Stackelberg equilibria arise naturally in a range of popular learning problems, such as in security games or indirect mechanism design, and have received increasing attention in the reinforcement learning literature. We present a general framework for implementing Stackelberg equilibria search as a multi-agent RL problem, allowing a wide range of algorithmic design choices. We discuss how previous approaches can be seen as specific instantiations of this framework. As a key insight, we note that the design space allows for approaches not previously seen in the literature, for instance by leveraging multitask and meta-RL techniques for follower convergence. We propose one such approach using contextual policies and evaluate it experimentally on standard and novel benchmark domains, showing greatly improved sample efficiency compared to previous approaches. Finally, we explore the effect of adopting designs outside the borders of our framework.

1. INTRODUCTION

Stackelberg equilibria are an important concept in economics, and in recent years have received increasing attention in computer science and specifically in the multiagent learning community. In these equilibria, we have an asymmetric setting: A leader who commits to a strategy, and one or more followers who respond. The leader aims to maximize their reward, knowing that followers in turn will best-respond to the leader's choice of strategy. These equilibria appear in a wide range of settings. In security games, a defender wishes to choose an optimal strategy considering attackers will adapt to it (An et al., 2017; Sinha et al., 2018) . In mechanism design, we aim to design a mechanism that allocates resources in an efficient manner, knowing that participants may strategize (Nisan & Ronen, 1999; Swamy, 2007; Brero et al., 2021a) . More broadly, many multi-agent system design problems can be viewed as Stackelberg equilibrium problems: we as the designer take on the role of the Stackelberg leader, wishing to design a system that is robust to agent behavior. We are particularly interested in Stackelberg equilibria in sequential decision making settings, i.e. stochastic Markov games, and using multi-agent reinforcement learning techniques to learn these equilibria. In this paper we: 1. Introduce a new theoretical framework for framing Stackelberg equilibria as a multi-agent reinforcement learning problem, 2. Discuss how a range of existing approaches fit into this paradigm, as well as where there remain large unexplored areas in the design space, 3. Reveal a novel approach to accelerating follower best-response convergence, borrowing ideas from multitask and meta-RL, including an experimental evaluation, and 4. Elaborate on several important conditions for Stackelberg convergence, and demonstrate how things can fail when these conditions are not met. Our main theorem (Theorem 1) allows a black-box reduction from learning Stackelberg equilibria into separate leader and follower learning problems. This framing encompasses and generalizes several prior approaches from the literature, in particular Brero et al. ( 2022), and gives a large design space beyond what has been explored previously. Our second main technical contribution is applying contextual policies, a common tool in multitask and meta-RL (Wang et al., 2016) , to the follower learning problem. In doing so, followers can generalize, and quickly adapt to leader

