EFFICIENTLY COMPUTING NASH EQUILIBRIA IN ADVERSARIAL TEAM MARKOV GAMES

Abstract

Computing Nash equilibrium policies is a central problem in multi-agent reinforcement learning that has received extensive attention both in theory and in practice. However, in light of computational intractability barriers in general-sum games, provable guarantees have been thus far either limited to fully competitive or cooperative scenarios, or impose strong assumptions that are difficult to meet in most practical applications. In this work, we depart from those prior results by investigating infinite-horizon adversarial team Markov games, a natural and well-motivated class of games in which a team of identically-interested playersin the absence of any explicit coordination or communication-is competing against an adversarial player. This setting allows for a unifying treatment of zero-sum Markov games and Markov potential games, and serves as a step to model more realistic strategic interactions that feature both competing and cooperative interests. Our main contribution is the first algorithm for computing stationary ϵ-approximate Nash equilibria in adversarial team Markov games with computational complexity that is polynomial in all the natural parameters of the game, as well as 1/ϵ. The proposed algorithm is based on performing independent policy gradient steps for each player in the team, in tandem with best responses from the side of the adversary; in turn, the policy for the adversary is then obtained by solving a carefully constructed linear program. Our analysis leverages non-standard techniques to establish the KKT optimality conditions for a nonlinear program with nonconvex constraints, thereby leading to a natural interpretation of the induced Lagrange multipliers.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) offers a principled framework for analyzing competitive interactions in dynamic and stateful environments in which agents' actions affect both the state of the world and the rewards of the other players. Strategic reasoning in such complex multi-agent settings has been guided by game-theoretic principles, leading to many recent landmark results in benchmark domains in AI (Bowling et al., 2015; Silver et al., 2017; Vinyals et al., 2019; Moravčík et al., 2017; Brown & Sandholm, 2019; 2018; Brown et al., 2020; Perolat et al., 2022) . Most of these remarkable advances rely on scalable and decentralized algorithms for computing Nash equilibria (Nash, 1951)-a standard game-theoretic notion of rationality-in two-player zero-sum games. Nevertheless, while single-agent RL has enjoyed rapid theoretical progress over the last few years (e.g., see (Jin et al., 2018; Agarwal et al., 2020; Li et al., 2021; Luo et al., 2019; Sidford et al., 2018) , and references therein), a comprehensive understanding of the multi-agent landscape still remains elusive. Indeed, provable guarantees for efficiently computing Nash equilibria have been thus far limited to either fully competitive settings, such as two-player zero-sum games (Daskalakis et al., 2020; Wei et al., 2021; Sayin et al., 2021; Cen et al., 2021; Sayin et al., 2020; Condon, 1993) , or environments in which agents are striving to coordinate towards a common global objective (Claus & Boutilier, 1998; Wang & Sandholm, 2002; Leonardos et al., 2021; Ding et al., 2022; Zhang et al., 2021b; Chen et al., 2022; Maheshwari et al., 2022; Fox et al., 2022) . However, many real-world applications feature both shared and competing interests between the agents. Efficient algorithms for computing Nash equilibria in such settings are much more scarce, and typically impose restrictive assumptions that are difficult to meet in most applications (Hu & Wellman, 2003; Bowling, 2000) . In fact, even in stateless two-player (normal-form) games, computing approximate Nash equilibria is computationally intractable (Daskalakis et al., 2009; Rubinstein, 2017; Chen et al., 2009; Etessami & Yannakakis, 2010 )-subject to well-believed complexity-theoretic assumptions. As a result, it is common to investigate equilibrium concepts that are more permissive than Nash equilibria, such as coarse correlated equilibria (CCE) (Aumann, 1974; Moulin & Vial, 1978) . Unfortunately, recent work has established strong lower bounds for computing even approximate (stationary) CCEs in turn-based stochastic two-player games (Daskalakis et al., 2022; Jin et al., 2022) . Those negative results raise a central question: Are there natural multi-agent environments incorporating both competing and shared interests for which we can establish efficient algorithms for computing (stationary) Nash equilibria? (⋆) Our work makes concrete progress in this fundamental direction. Specifically, we establish the first efficient algorithm leading to Nash equilibria in adversarial team Markov games, a well-motivated and natural multi-agent setting in which a team of agents with a common objective is facing a competing adversary.

1.1. OUR RESULTS

Before we state our main result, let us first briefly introduce the setting of adversarial team Markov games; a more precise description is deferred to Section 2.1. To address Question (⋆), we study an infinite-horizon Markov (stochastic) game with a finite state space S in which a team of agents N A := [n] with a common objective function is competing against a single adversary with opposing interests. Every agent k ∈ [n] has a (finite) set of available actions A k , while B represents the adversary's set of actions. We will also let γ ∈ [0, 1) be the discounting factor. Our goal will be to compute an (approximate) Nash equilibrium; that is, a strategy profile so that no player can improve via a unilateral deviation (see Definition 2.1). In this context, our main contribution is the first polynomial time algorithm for computing Nash equilibria in adversarial team Markov games: Theorem 1.1 (Informal). There is an algorithm (IPGMAX) that, for any ϵ > 0, computes an ϵapproximate stationary Nash equilibrium in adversarial team Markov games, and runs in time poly |S|, n k=1 |A k | + |B|, 1 1 -γ , A few remarks are in order. First, our guarantee significantly extends and unifies prior results that only applied to either two-player zero-sum Markov games or to Markov potential games; both of those settings can be cast as special cases of adversarial team Markov games (see Section 2.3). Further, the complexity of our algorithm, specified in Theorem 1.1, scales only with k∈N A |A k | instead of k∈N A |A k |, bypassing what is often referred to as the curse of multi-agents (Jin et al., 2021) . Indeed, viewing the team as a single "meta-player" would induce an action space of size k∈N A |A k |, which is exponential in n even if each agent in the team has only two actions. In fact, our algorithm operates without requiring any (explicit) form of coordination or communication between the members of the team (beyond the structure of the game), a feature that has been motivated in practical applications (von Stengel & Koller, 1997) . Namely, scenarios in which communication or coordination between the members of the team is either overly expensive, or even infeasible; for an in depth discussion regarding this point we refer to (Schulman & Vazirani, 2017) .

1.2. OVERVIEW OF TECHNIQUES

To establish Theorem 1.1, we propose a natural and decentraliezd algorithm we refer to as Independent Policy GradientMax (IPGMAX). IPGMAX works in turns. First, each player in the team performs one independent policy gradient step on their value function with an appropriately selected

