ORACLES & FOLLOWERS: STACKELBERG EQUILIBRIA IN DEEP MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Stackelberg equilibria arise naturally in a range of popular learning problems, such as in security games or indirect mechanism design, and have received increasing attention in the reinforcement learning literature. We present a general framework for implementing Stackelberg equilibria search as a multi-agent RL problem, allowing a wide range of algorithmic design choices. We discuss how previous approaches can be seen as specific instantiations of this framework. As a key insight, we note that the design space allows for approaches not previously seen in the literature, for instance by leveraging multitask and meta-RL techniques for follower convergence. We propose one such approach using contextual policies and evaluate it experimentally on standard and novel benchmark domains, showing greatly improved sample efficiency compared to previous approaches. Finally, we explore the effect of adopting designs outside the borders of our framework.

1. INTRODUCTION

Stackelberg equilibria are an important concept in economics, and in recent years have received increasing attention in computer science and specifically in the multiagent learning community. In these equilibria, we have an asymmetric setting: A leader who commits to a strategy, and one or more followers who respond. The leader aims to maximize their reward, knowing that followers in turn will best-respond to the leader's choice of strategy. These equilibria appear in a wide range of settings. In security games, a defender wishes to choose an optimal strategy considering attackers will adapt to it (An et al., 2017; Sinha et al., 2018) . In mechanism design, we aim to design a mechanism that allocates resources in an efficient manner, knowing that participants may strategize (Nisan & Ronen, 1999; Swamy, 2007; Brero et al., 2021a) . More broadly, many multi-agent system design problems can be viewed as Stackelberg equilibrium problems: we as the designer take on the role of the Stackelberg leader, wishing to design a system that is robust to agent behavior. We are particularly interested in Stackelberg equilibria in sequential decision making settings, i.e. stochastic Markov games, and using multi-agent reinforcement learning techniques to learn these equilibria. In this paper we: 1. Introduce a new theoretical framework for framing Stackelberg equilibria as a multi-agent reinforcement learning problem, 2. Discuss how a range of existing approaches fit into this paradigm, as well as where there remain large unexplored areas in the design space, 3. Reveal a novel approach to accelerating follower best-response convergence, borrowing ideas from multitask and meta-RL, including an experimental evaluation, and 4. Elaborate on several important conditions for Stackelberg convergence, and demonstrate how things can fail when these conditions are not met. Our main theorem (Theorem 1) allows a black-box reduction from learning Stackelberg equilibria into separate leader and follower learning problems. This framing encompasses and generalizes several prior approaches from the literature, in particular Brero et al. ( 2022), and gives a large design space beyond what has been explored previously. Our second main technical contribution is applying contextual policies, a common tool in multitask and meta-RL (Wang et al., 2016) , to the follower learning problem. In doing so, followers can generalize, and quickly adapt to leader policies. We validate this approach in experiments and show greatly reduced sample complexity compared to previous inner loop-outer loop approaches. We also show how violating the conditions of our theorem can lead to complete failure of the learning process, consistently across underlying algorithms. In the remainder of the paper, we will introduce Stackelberg equilibria and Markov games in Section 2. In Section 3, we motivate and define our framework for learning Stackelberg equilibria using multi-agent RL and discuss its scope and limitations. We define our novel contextual policy oracle in 4 and empirically evaluate it in Section 5 on existing and novel benchmark domains. 1 

2. PRELIMINARIES

Markov games. We consider partially observable stochastic Markov games, essentially a multiagent generalization of a partially observable Markov Decision Process (POMDP). Definition 1 (Markov Game). A Markov Game M with n agents is a tuple (S, A, T, r, Ω, O, γ), consisting of a state space S, an action space A = (A 1 , ..., A n ), a (stochastic) transition function T : S × A → S, a (stochastic) reward function r : S × A → R n , an observation space Ω = (Ω 1 , ..., Ω n ), a (stochastic) observation function O : S × A → Ω, and a discount factor γ. At each step t of the game, every agent i chooses an action a i,t from their action space A i , the game state evolves according to the joint action (a 1,t , . . . , a n,t ) and the transition function T , and agents receive observations and reward according to O and R. An agent's behavior in the game is characterized by its policy π i : o i → a i , which maps observations to actions.foot_0 Each agent in a Markov



To keep notation concise we discuss here the memory-less case, but all our results generalize to a stateful leader policy in a straightforward manner, as we discuss in Appendix B.



.1 PRIOR WORK Learning Stackelberg Equilibria. Most prior work on Stackelberg equilibria focus on singleshot settings such as normal-form games, a significantly simpler setting than Markov games, often in security games. A broad line of work focuses on computing Stackelberg equilibria, such as Paruchuri et al. (2008); Xu et al. (2014); Blum et al. (2014); Li et al. (2022). Among the first works on learning Stackelberg equilibria was Letchford et al. (2009), who focus on Bayesian games. Peng et al. (2019) give results for matrix games with sample access only. Wang et al. (2022) show an approach differentiating through the so-called KKT conditions, again for normal-form games. Bai et al. (2021) give lower and upper bounds on learning Stackelberg equilibria in general-sum games, including "bandit RL" games with one step for the leader, and sequential decision-making for the followers. Few works in this area consider Markov games: Zhong et al. (2021) show algorithms that find Stackelberg equilibria in Markov games, but assume myopic followers, a significant limitation compared to the general case. Brero et al. (2021a;b; 2022) use an inner-outer loop approach, which they call the Stackelberg POMDP, primarily aimed at indirect mechanism design. Mechanism Design. One of the first works specifically discussing Stackelberg equilibria in a learning context is Swamy (2007), who design interventions in traffic patterns. More recently, several strands of work have focused on using multi-agent RL techniques to learn optimal mechanism design, often framing this is a bi-level or inner-outer-loop optimization problem. Zheng et al. (2022) use a bi-level RL approach to design optimal tax policies in a simulated world but without provably Stackelberg properties. Yang et al. (2022) use a meta-gradient approach in a specific incentive design setting. Shu & Tian (2018) and Shi et al. (2019) learn leader policies in a similar "Stackelberg Markov games" setting. Both use a form of modeling other agents coupled with rule-based followers. Balaguer et al. (2022) use an inner-loop outer-loop gradient descent approach for mechanism design on iterated matrix games (which we also use as an experimental testbed). They mainly focus on the case where both the environment transition as well as the follower learning behavior is differentiable, and otherwise fall back to evolutionary strategies for the leader. Interestingly, none of these recent works explicitly mention Stackelberg equilibria. As a direct corollary of our own work, we show thatBalaguer et al. (2022)  and Zheng et al. (2022) may not give Stackelberg equilibria, but Balaguer et al. (2022) could do so with minor modifications.

