EFFICIENTLY COMPUTING NASH EQUILIBRIA IN ADVERSARIAL TEAM MARKOV GAMES

Abstract

Computing Nash equilibrium policies is a central problem in multi-agent reinforcement learning that has received extensive attention both in theory and in practice. However, in light of computational intractability barriers in general-sum games, provable guarantees have been thus far either limited to fully competitive or cooperative scenarios, or impose strong assumptions that are difficult to meet in most practical applications. In this work, we depart from those prior results by investigating infinite-horizon adversarial team Markov games, a natural and well-motivated class of games in which a team of identically-interested playersin the absence of any explicit coordination or communication-is competing against an adversarial player. This setting allows for a unifying treatment of zero-sum Markov games and Markov potential games, and serves as a step to model more realistic strategic interactions that feature both competing and cooperative interests. Our main contribution is the first algorithm for computing stationary ϵ-approximate Nash equilibria in adversarial team Markov games with computational complexity that is polynomial in all the natural parameters of the game, as well as 1/ϵ. The proposed algorithm is based on performing independent policy gradient steps for each player in the team, in tandem with best responses from the side of the adversary; in turn, the policy for the adversary is then obtained by solving a carefully constructed linear program. Our analysis leverages non-standard techniques to establish the KKT optimality conditions for a nonlinear program with nonconvex constraints, thereby leading to a natural interpretation of the induced Lagrange multipliers.

1. INTRODUCTION

Multi-agent reinforcement learning (MARL) offers a principled framework for analyzing competitive interactions in dynamic and stateful environments in which agents' actions affect both the state of the world and the rewards of the other players. Strategic reasoning in such complex multi-agent settings has been guided by game-theoretic principles, leading to many recent landmark results in benchmark domains in AI (Bowling et al., 2015; Silver et al., 2017; Vinyals et al., 2019; Moravčík et al., 2017; Brown & Sandholm, 2019; 2018; Brown et al., 2020; Perolat et al., 2022) . Most of these remarkable advances rely on scalable and decentralized algorithms for computing Nash equilibria (Nash, 1951 )-a standard game-theoretic notion of rationality-in two-player zero-sum games. Nevertheless, while single-agent RL has enjoyed rapid theoretical progress over the last few years (e.g., see (Jin et al., 2018; Agarwal et al., 2020; Li et al., 2021; Luo et al., 2019; Sidford et al., 2018) , and references therein), a comprehensive understanding of the multi-agent landscape still remains elusive. Indeed, provable guarantees for efficiently computing Nash equilibria have been thus far limited to either fully competitive settings, such as two-player zero-sum games (Daskalakis et al., 2020; Wei et al., 2021; Sayin et al., 2021; Cen et al., 2021; Sayin et al., 2020; Condon, 1993) , or environments in which agents are striving to coordinate towards a common global objective (Claus & Boutilier, 1998; Wang & Sandholm, 2002; Leonardos et al., 2021; Ding et al., 2022; Zhang et al., 2021b; Chen et al., 2022; Maheshwari et al., 2022; Fox et al., 2022) . However, many real-world applications feature both shared and competing interests between the agents. Efficient algorithms for computing Nash equilibria in such settings are much more scarce, and typically impose restrictive assumptions that are difficult to meet in most applications (Hu & Wellman, 2003; Bowling, 2000) . In fact, even in stateless two-player (normal-form) games, computing approximate Nash equilibria is computationally intractable (Daskalakis et al., 2009; Rubinstein, 2017; Chen et al., 2009; Etessami & Yannakakis, 2010 )-subject to well-believed complexity-theoretic assumptions. As a result, it is common to investigate equilibrium concepts that are more permissive than Nash equilibria, such as coarse correlated equilibria (CCE) (Aumann, 1974; Moulin & Vial, 1978) . Unfortunately, recent work has established strong lower bounds for computing even approximate (stationary) CCEs in turn-based stochastic two-player games (Daskalakis et al., 2022; Jin et al., 2022) . Those negative results raise a central question: Are there natural multi-agent environments incorporating both competing and shared interests for which we can establish efficient algorithms for computing (stationary) Nash equilibria? (⋆) Our work makes concrete progress in this fundamental direction. Specifically, we establish the first efficient algorithm leading to Nash equilibria in adversarial team Markov games, a well-motivated and natural multi-agent setting in which a team of agents with a common objective is facing a competing adversary.

1.1. OUR RESULTS

Before we state our main result, let us first briefly introduce the setting of adversarial team Markov games; a more precise description is deferred to Section 2.1. To address Question (⋆), we study an infinite-horizon Markov (stochastic) game with a finite state space S in which a team of agents N A := [n] with a common objective function is competing against a single adversary with opposing interests. Every agent k ∈ [n] has a (finite) set of available actions A k , while B represents the adversary's set of actions. We will also let γ ∈ [0, 1) be the discounting factor. Our goal will be to compute an (approximate) Nash equilibrium; that is, a strategy profile so that no player can improve via a unilateral deviation (see Definition 2.1). In this context, our main contribution is the first polynomial time algorithm for computing Nash equilibria in adversarial team Markov games: Theorem 1.1 (Informal). There is an algorithm (IPGMAX) that, for any ϵ > 0, computes an ϵapproximate stationary Nash equilibrium in adversarial team Markov games, and runs in time poly |S|, n k=1 |A k | + |B|, 1 1 -γ , 1 ϵ . A few remarks are in order. First, our guarantee significantly extends and unifies prior results that only applied to either two-player zero-sum Markov games or to Markov potential games; both of those settings can be cast as special cases of adversarial team Markov games (see Section 2.3). Further, the complexity of our algorithm, specified in Theorem 1.1, scales only with k∈N A |A k | instead of k∈N A |A k |, bypassing what is often referred to as the curse of multi-agents (Jin et al., 2021) . Indeed, viewing the team as a single "meta-player" would induce an action space of size k∈N A |A k |, which is exponential in n even if each agent in the team has only two actions. In fact, our algorithm operates without requiring any (explicit) form of coordination or communication between the members of the team (beyond the structure of the game), a feature that has been motivated in practical applications (von Stengel & Koller, 1997) . Namely, scenarios in which communication or coordination between the members of the team is either overly expensive, or even infeasible; for an in depth discussion regarding this point we refer to (Schulman & Vazirani, 2017) .

1.2. OVERVIEW OF TECHNIQUES

To establish Theorem 1.1, we propose a natural and decentraliezd algorithm we refer to as Independent Policy GradientMax (IPGMAX). IPGMAX works in turns. First, each player in the team performs one independent policy gradient step on their value function with an appropriately selected learning rate η > 0. In turn, the adversary best responds to the current policy of the team. This exchange is repeated for a sufficiently large number of iterations T . Finally, IPGMAX includes an auxiliary subroutine, namely AdvNashPolicy(), which computes the Nash policy of the adversary; this will be justified by Proposition 1.1 we describe below. Our analysis builds on the techniques of Lin et al. (2020) -developed for the saddle-point problem min x∈X max y∈Y f (x, y)-for characterizing GDMAX. Specifically, GDMAX consists of performing gradient descent steps, specifically on the function ϕ(x) := max y∈Y f (x, y). Lin et al. (2020) showed that GDMAX converges to a point x, y * ( x) such that x is an approximate first-order stationary point of the Moreau envelope (see Definition 3.1) of ϕ(x), while y * ( x) is a best response to x. Now if f (x, •) is strongly-concave, one can show (by Danskin's theorem) that x, y * (x) is an approximate first-order stationary point of f . However, our setting introduces further challenges since the value function V ρ (π team , π adv ) is nonconvex-nonconcave. For this reason, we take a more refined approach. We first show in Proposition 3.1 that IPGMAX is guaranteed to converge to a policy profile πteam , • such that πteam is an ϵ-nearly stationary point of max πadv V ρ (π team , π adv ). Then, the next key step and the crux of the analysis is to show that πteam can be extended to an O(ϵ)-approximate Nash equilibrium policy: Proposition 1.1 (Informal). If πteam is an ϵ-nearly stationary point of max πadv V ρ (π team , π adv ), there exists a policy for the adversary πadv so that ( πteam , πadv ) is an O(ϵ)-approximate Nash equilibrium. In the special case of normal-form games, a similar extension theorem was recently obtained by Anagnostides et al. (2023) . In particular, that result was derived by employing fairly standard linear programming techniques. In contrast, our more general setting introduces several new challenges, not least due to the nonconvexity-nonconcavity of the objective function. Indeed, our analysis leverages more refined techniques stemming from nonlinear programming. More precisely, while we make use of standard policy gradient properties, similar to the single-agent MDP setting (Agarwal et al., 2021; Xiao, 2022) , our analysis does not rely on the so-called gradientdominance property (Bhandari & Russo, 2019) , as that property does not hold in a team-wise sense. Instead, inspired by an alternative proof of Shapley's theorem (Shapley, 1953) for two-person zerosum Markov games (Filar & Vrieze, 2012 , Chapter 3), we employ mathematical programming. One of the central challenges is that the induced nonlinear program has a set of nonconvex constraints. As such, even the existence of (nonnegative) Lagrange multipliers satisfying the KKT conditions is not guaranteed, thereby necessitating more refined analysis techniques. To this end, we employ the Arrow-Hurwiz-Uzawa constraint qualification (Theorem A.1) in order to establish that the local optima are contained in the set of KKT points (Corollary B.1). Then, we leverage the structure of adversarial team Markov games to characterize the induced Lagrange multipliers, showing that a subset of these can be used to establish Proposition 1.1; incidentally, this also leads to an efficient algorithm for computing a (near-)optimal policy of the adversary. Finally, we also remark that controlling the approximation error-an inherent barrier under policy gradient methods-in Proposition 1.1 turns out to be challenging. We bypass this issue by constructing "relaxed" programs that incorporate some imprecision in the constraints. A more detailed overview of our algorithm and the analysis is given in Section 3.

2. PRELIMINARIES

In this section, we introduce the relevant background and our notation. Section 2.1 describes adversarial team Markov games. Section 2.2 then defines some key concepts from multi-agent MDPs, while Section 2.3 describes a generalization of adversarial team Markov games, beyond identicallyinterested team players, allowing for a richer structure in the utilities of the team-namely, adversarial Markov potential games. Notation. We let [n] := {1, . . . , n}. We use superscripts to denote the (discrete) time index, and subscripts to index the players. We use boldface for vectors and matrices; scalars will be denoted by lightface variables. We denote by ∥ • ∥ := ∥ • ∥ 2 the Euclidean norm. For simplicity in the exposition, we may sometimes use the O(•) notation to suppress dependencies that are polynomial in the natural parameters of the game; precise statements are given in the Appendix. For the convenience of the reader, a comprehensive overview of our notation is given in A.3.

2.1. ADVERSARIAL TEAM MARKOV GAMES

An adversarial team Markov game (or an adversarial team stochastic game) is the Markov game extension of static, normal-form adversarial team games (Von Stengel & Koller, 1997) . The game is assumed to take place in an infinite-horizon discounted setting in which a team of identicallyinterested agents gain what the adversary loses. Formally, the game G is represented by a tuple G = (S, N , A, B, r, P, γ, ρ) whose components are defined as follows. • S is a finite and nonempty set of states, with cardinality S := |S|; • N is the set of players, partitioned into a set of n team agents N A := [n] and a single adversary • A k is the action space of each player in the team k ∈ [n], so that A := × k∈[n] A k , while B is the action space of the adversary. We also let A k := |A k | and B := |B|; 1 • r : S × A × B → (0, 1) is the (deterministic) instantaneous reward functionfoot_1 representing the (normalized) payoff of the adversary, so that for any (s, a, b) ∈ S × A × B, r(s, a, b) + n k=1 r k (s, a, b) = 0, and for any k ∈ [n], r k (s, a, b) = r team (s, a, b). (2) • P : S × A × B → ∆(S) is the transition probability function, so that P(s ′ |s, a, b) denotes the probability of transitioning to state s ′ ∈ S when the current state is s ∈ S under the action profile (a, b) ∈ A × B; • γ ∈ [0, 1) is the discount factor; and • ρ ∈ ∆(S) is the initial state distribution over the state space. We will assume that ρ is fullsupport, meaning that ρ(s) > 0 for all s ∈ S. In other words, an adversarial team Markov game is a subclass of general-sum infinite-horizon multi-agent discounted MDPs under the restriction that all but a single (adversarial) player have identical interests (see (2)), and the game is globally zero-sum-in the sense of (1). As we point out in Section 2.3, (2) can be relaxed in order to capture (adversarial) Markov potential games (Definition 2.2), without qualitatively altering our results.

2.2. POLICIES, VALUE FUNCTION, AND NASH EQUILIBRIA

Policies. A stationary-that is, time-invariant-policy π k for an agent k is a function mapping a given state to a distribution over available actions, π k : S ∋ s → π k (•|s) ∈ ∆(A k ). We will say that π k is deterministic if for every state there is some action that is selected with probability 1 under policy π k . For convenience, we will let Π team : S → ∆(A) and Π adv : S → ∆(B) denote the policy space for the team and the adversary respectively. We may also write Π : S → ∆(A) × ∆(B) to denote the joint policy space of all agents. Direct Parametrization. Throughout this paper we will assume that players employ direct policy parametrization. That is, for each player k ∈ [n], we let X k := ∆(A k ) S and π k = x k so that x k,s,a = π k (a|s). Similarly, for the adversary, we let Y := ∆(B) S and π adv = y so that y s,a = π adv (a|s). (Extending our results to other policy parameterizations, such as soft-max (Agarwal et al., 2021) , is left for future work.) Value Function. The value function V s : Π ∋ (π 1 , . . . , π n , π adv ) → R is defined as the expected cumulative discounted reward received by the adversary under the joint policy (π team , π adv ) ∈ Π and the initial state s ∈ S, where π team := (π 1 , . . . , π n ). In symbols, V s (π team , π adv ) := E (πteam,πadv) ∞ t=0 γ t r(s (t) , a (t) , b (t) ) s 0 = s , where the expectation is taken over the trajectory distribution induced by π team and π adv . When the initial state is drawn from a distribution ρ, the value function takes the form V ρ (π team , π adv ) := E s∼ρ V s (π team , π adv ) . Nash Equilibrium. Our main goal is to compute a joint policy profile that is an (approximate) Nash equilibrium, a standard equilibrium concept in game theory formalized below. Definition 2.1 (Nash equilibrium). A joint policy profile π ⋆ team , π ⋆ adv ∈ Π is an ε-approximate Nash equilibrium, for ϵ ≥ 0, if V ρ (π ⋆ team , π ⋆ adv ) ≤ V ρ ((π ′ k , π ⋆ -k ), π ⋆ adv + ε, ∀k ∈ [n], ∀π ′ k ∈ Π k , V ρ (π ⋆ team , π ⋆ adv ) ≥ V ρ (π ⋆ team , π ′ adv ) -ε, ∀π ′ adv ∈ Π adv . That is, a joint policy profile is an (approximate) Nash equilibrium if no unilateral deviation from a player can result in a non-negligible-more than additive ϵ-improvement for that player. Nash equilibria always exist in multi-agent stochastic games (Fink, 1964) ; our main result implies an (efficient) constructive proof of that fact for the special case of adversarial team Markov games.

2.3. ADVERSARIAL MARKOV POTENTIAL GAMES

A recent line of work has extended the fundamental class of potential normal-form games (Monderer & Shapley, 1996) to Markov potential games (Marden, 2012; Macua et al., 2018; Leonardos et al., 2021; Ding et al., 2022; Zhang et al., 2021b; Chen et al., 2022; Maheshwari et al., 2022; Fox et al., 2022) . Importantly, our results readily carry over even if players in the team are not necessarily identically interested, but instead, there is some underlying potential function for the team; we will refer to such games as adversarial Markov potential games, formally introduced below. Definition 2. 2. An adversarial Markov potential game G = (S, N , A, B, {r k } k∈[n] , P, γ, ρ) is a multi-agent discounted MDP that shares all the properties of adversarial team Markov games (Section 2.1), with the exception that (2) is relaxed in that there exists a potential function Φ s , ∀s ∈ S, such that for any π adv ∈ Π adv , Φ s (π k , π -k ; π adv ) -Φ s (π ′ k , π -k ; π adv ) = V k,s (π k , π -k ; π adv ) -V k,s (π ′ k , π -k ; π adv ), for every agent k ∈ [n], every state s ∈ S, and all policies π k , π k ′ ∈ Π k and π -k ∈ Π -k .

3. MAIN RESULT

In this section, we sketch the main pieces required in the proof of our main result, Theorem 1.1. We begin by describing our algorithm in Section 3.1. Next, in Section 3.2, we characterize the strategy x ∈ X for the team returned by IPGMAX, while Section 3.3 completes the proof by establishing that x can be efficiently extended to an approximate Nash equilibrium. The formal proof of Theorem 1.1 is deferred to the Appendix.

3.1. OUR ALGORITHM

In this subsection, we describe in detail our algorithm for computing ϵ-approximate Nash equilibria, IPGMAX, in adversarial team Markov games (Algorithm 1). IPGMAX takes as input a precision parameter ϵ > 0 (Line 1) and an initial strategy for the team (x (0) 1 , . . . , x (0) n ) = x (0) ∈ X := × n k=1 X k (Line 2). The algorithm then proceeds in two phases: • In the first phase the team players are performing independent policy gradient steps (Line 7) with learning rate η, as defined in Line 3, while the adversary is then best responding to their joint strategy (Line 6). Both of these steps can be performed in polynomial time under oracle access to the game (see Remark 2). This process is repeated for T iterations, with T as defined in Line 4. We note that Proj (•) in Line 7 stands for the Euclidean projection, ensuring that each player selects a valid strategy. The first phase is completed in Line 9, where we set x according to the iterate at time t ⋆ , for some 0 ≤ t ⋆ ≤ T -1. As we explain in Section 3.2, selecting uniformly at random is a practical and theoretically sound way of setting t ⋆ . • In the second phase we are fixing the strategy of the team x ∈ X , and the main goal is to determine a strategy ŷ ∈ Y so that ( x, ŷ) is an O(ϵ)-approximate Nash equilibrium. This is accomplished in the subroutine AdvNashPolicy( x), which consists of solving a linear program-from the perspective of the adversary-that has polynomial size. Our analysis of the second phase of IPGMAX can be found in Section 3.3. It is worth stressing that under gradient feedback, IPGMAX requires no communication or coordination between the players in the team. Algorithm 1 Independent Policy GradientMax (IPGMAX) 1: Precision ϵ > 0 2: Initial Strategy x (0) ∈ X 3: Learning rate η := ϵ 2 (1-γ) 9 32S 4 D 2 ( n k=1 A k +B) 3 4: Number of iterations T := 512S 8 D 4 ( n k=1 A k +B) 4 ϵ 4 (1-γ) 12 5: for t ← 1, 2, . . . , T do 6: y (t) ← arg max y∈Y V ρ x (t-1) , y 7: x (t) k ← Proj X k x (t-1) k -η∇ x k V ρ x (t-1) , y (t) ▷ for all agents i ∈ [n] 8: end for 9: x ← x (t ⋆ ) 10: ŷ ← AdvNashPolicy( x) ▷ defined in Algorithm 2 11: return ( x, ŷ)

3.2. ANALYZING INDEPENDENT POLICY GRADIENTMAX

In this subsection, we establish that IPGMAX finds an ϵ-nearly stationary point x of ϕ(x) := max y∈Y V ρ (x, y) in a number of iterations T that is polynomial in the natural parameters of the game, as well as 1/ϵ; this is formalized in Proposition 3.1. First, we note the by-now standard property that the value function V ρ is L-Lipschitz continuous and ℓ-smooth, where L := √ n k=1 A k +B (1-γ) 2 and ℓ := 2( n k=1 A k +B) (1-γ) 3 (Lemma C.1 ). An important observation for the analysis is that IPGMAX is essentially performing gradient descent steps on ϕ(x). However, the challenge is that ϕ(x) is not necessarily differentiable; thus, our analysis relies on the Moreau envelope of ϕ, defined as follows. Definition 3.1 (Moreau Envelope). Let ϕ(x) := max y∈Y V ρ (x, y). For any 0 < λ < 1 ℓ the Moreau envelope ϕ λ of ϕ is defined as ϕ λ (x) := min x ′ ∈X ϕ(x ′ ) + 1 2λ ∥x -x ′ ∥ 2 . ( ) We will let λ := 1 2ℓ . Crucially, the Moreau envelope ϕ λ , as introduced in (4), is ℓ-strongly convex; this follows immediately from the fact that ϕ(x) is ℓ-weakly convex, in the sense that ϕ(x) + ℓ 2 ∥x∥ 2 is convex (see Lemma A.1). A related notion that will be useful to measure the progress of IPGMAX is the proximal mapping of a function f , defined as prox f : X ∋ x → arg min x ′ ∈X f (x ′ ) + 1 2 ∥x ′ -x∥ 2 ; the proximal point of ϕ/(2ℓ) is well-defined since ϕ is ℓ-weakly convex (Proposition A.1). We are now ready to state the convergence guarantee of IPGMAX. Proposition 3.1. Consider any ϵ > 0. If η = 2ϵ 2 (1 -γ) and T = (1-γ) 4 8ϵ 4 ( n k=1 A k +B) 2 , there exists an iterate t ⋆ , with 0 ≤ t ⋆ ≤ T -1, such that x (t ⋆ ) -x(t ⋆ ) 2 ≤ ϵ, where x(t ⋆ ) := prox ϕ/(2ℓ) (x (t ⋆ ) ). The proof relies on the techniques of Lin et al. (2020) , and it is deferred to Appendix C. The main takeaway is that O(1/ϵ 4 ) iterations suffice in order to reach an ϵ-nearly stationary point of ϕin the sense that it is ϵ-far in ℓ 2 distance from its proximal point. A delicate issue here is that Proposition 3.1 only gives a best-iterate guarantee, and identifying that iterate might introduce a substantial computational overhead. To address this, we also show in Corollary C.1 that by randomly selecting ⌈log(1/δ)⌉ iterates over the T repetitions of IPGMAX, we are guaranteed to recover an ϵnearly stationary point with probability at least 1 -δ, for any δ > 0.

3.3. EFFICIENT EXTENSION TO NASH EQUILIBRIA

In this subsection, we establish that any ϵ-nearly stationary point x of ϕ, can be extended to an O(ϵ)-approximate Nash equilibrium ( x, ŷ) for any adversarial team Markov game, where ŷ ∈ Y is the strategy for the adversary. Further, we show that ŷ can be computed in polynomial time through a carefully constructed linear program. This "extendibility" argument significantly extends a seminal characterization of Von Stengel & Koller (1997) , and it is the crux in the analysis towards establishing our main result, Theorem 1.1. To this end, the techniques we leverage are more involved compared to (Von Stengel & Koller, 1997) , and revolve around nonlinear programming. Specifically, in the spirit of (Filar & Vrieze, 2012, Chapter 3) , the starting point of our argument is the following nonlinear program with variables (x, v) ∈ X × R S : (Q-NLP) min s∈S ρ(s)v(s) + ℓ∥x -x∥ 2 s.t. r(s, x, b) + γ s ′ ∈S P(s ′ |s, x, b)v(s ′ ) ≤ v(s), ∀(s, b) ∈ S × B; (Q1) x ⊤ k,s 1 = 1, ∀(k, s) ∈ [n] × S; and (Q2) x k,s,a ≥ 0, ∀k ∈ [n], (s, a) ∈ S × A k . (Q3) Here, we have overloaded notation so that r(s, Importantly, unlike standard MDP formulations, we have incorporated a quadratic regularizer in the objective function; this term ensures the following property. Proposition 3.2. For any fixed x ∈ X , there is a unique optimal solution v ⋆ to (Q-NLP). Further, if x := prox ϕ/(2ℓ) ( x) and ṽ ∈ R S is the corresponding optimal, then ( x, ṽ) is the global optimum of (Q-NLP). The uniqueness of the associated value vector is a consequence of Bellman's optimality equation, while the optimality of the proximal point follows by realizing that (Q-NLP) is an equivalent formulation of the proximal mapping. These steps are formalized in Appendix B.2. Having established the optimality of ( x, ṽ), the next step is to show the existence of nonnegative Lagrange multipliers satisfying the KKT conditions (recall Definition A.2); this is non-trivial due to the nonconvexity of the feasibility set of (Q-NLP). To do so, we leverage the so-called Arrow-Hurwicz-Uzawa constraint qualification (Theorem A.1)-a form of "regularity condition" for a nonconvex program. Indeed, in Lemma B.3 we show that any feasible point of (Q-NLP) satisfies that constraint qualification, thereby implying the existence of nonnegative Lagrange multipliers satisfying the KKT conditions for any local optimum (Corollary B.1), and in particular for ( x, ṽ): Proposition 3.3. There exist nonnegative Lagrange multipliers satisfying the KKT conditions at ( x, ṽ). Now the upshot is that a subset of those Lagrange multipliers λ ∈ R S×B can be used to establish the extendibility of x to a Nash equilibrium. Indeed, our next step makes this explicit: We construct a linear program whose sole goal is to identify such multipliers, which in turn will allow us to efficiently compute an admissible strategy for the adversary ŷ. However, determining λ exactly seems too ambitious. For one, IPGMAX only granted us access to x, but not to x. On the other hand, the Lagrange multipliers λ are induced by ( x, ṽ). To address this, the constraints of our linear program are phrased in terms of ( x, v), instead of ( x, ṽ), while to guarantee feasibility we appropriately relax all the constraints of the linear program; this relaxation does not introduce a large error since ∥ x -x∥ ≤ ϵ (Proposition 3.1), and the underlying constraint functions are Lipschitz continuous-with constants that depend favorably on the game G; we formalize that in Lemma B.4. This leads to our main theorem, summarized below (see Theorem B.1 for a precise statement). Theorem 3.1. Let x be an ϵ-nearly stationary point of ϕ. There exist a linear program, (LP adv ), such that: (i) It has size that is polynomial in G, and all the coefficients depend on the (single-agent) MDP faced by the adversary when the team is playing a fixed strategy x; and (ii) It is always feasible, and any solution induces a strategy ŷ such that ( x, ŷ) is an O(ϵ)approximate Nash equilibrium. The proof of this theorem carefully leverages the structure of adversarial team Markov games, along with the KKT conditions we previously established in Proposition 3.3. The algorithm for computing the policy for the adversary is summarized in Algorithm 2 of Appendix B. A delicate issue with Theorem 3.1, and in particular with the solution of (LP adv ), is whether one can indeed efficiently simulate the environment faced by the adversary. Indeed, in the absence of any structure, determining the coefficients of the linear program could scale exponentially with the number of players; this is related to a well-known issue in computational game theory, revolving around the exponential blow-up of the input space as the number of players increases (Papadimitriou & Roughgarden, 2008) . As is standard, we bypass this by assuming access to natural oracles that ensure we can efficiently simulate the environment faced by the adversary (Remark 2).

4. FURTHER RELATED WORK

In this section, we highlight certain key lines of work that relate to our results in the context of adversarial team Markov games. We stress that the related literature on multi-agent reinforcement learning (MARL) is too vast to even attempt to faithfully cover here. For some excellent recent overviews of the area, we refer the interested reader to (Yang & Wang, 2020; Zhang et al., 2021a) and the extensive lists of references therein. Team Games. The study of team games has been a prolific topic of research in economic theory and group decision theory for many decades; see, e.g., (Marschak, 1955; Groves, 1973; Radner, 1962; Ho & Chu, 1972) . A more modern key reference point to our work is the seminal paper of Von Stengel & Koller (1997) that introduced the notion of team-maxmin equilibrium (TME) in the context of normal-form games. A TME profile is a mixed strategy for each team member so that the minimal expected team payoff over all possible responses of the adversary-who potentially knows the play of the team-is the maximum possible. While TME's enjoy a number of compelling properties, being the optimal equilibria for the team given the lack of coordination, they suffer from computational intractability even in 3-player team games (Hansen et al., 2008; Borgs et al., 2010) . 3Nevertheless, practical algorithms have been recently proposed and studied for computing them in multiplayer games (Zhang & An, 2020a; b; Basilico et al., 2017) . It is worth pointing out that team equilibria are also useful for extensive-form two-player zero-sum games where one of the players has imperfect recall (Piccione & Rubinstein, 1997) . The intractability of TME has motivated the study of a relaxed equilibrium concept that incorporates a correlation device (Farina et al., 2018; Celli & Gatti, 2018; Basilico et al., 2017; Zhang & An, 2020b; Zhang & Sandholm, 2021; Zhang et al., 2022b; Carminati et al., 2022; Zhang et al., 2022a) ; namely, TMECor. In TMECor players are allowed to select correlated strategies. Despite the many compelling aspects of TMECor as a solution concept in team games, even ex ante coordination or correlated randomization-beyond the structure of the game itself-can be overly expensive or even infeasible in many applications (Von Stengel & Koller, 1997) . Further, even TMECor is NPhard to compute (in the worst-case) for imperfect-information extensive-form games (EFGs) (Chu & Halpern, 2001) , although fixed-parameter-tractable (FPT) algorithms have recently emerged for natural classes of EFGs (Zhang & Sandholm, 2021; Zhang et al., 2022b) . On the other hand, the computational aspects of the standard Nash equilibrium (NE) in adversarial team games is not well-understood, even in normal-form games. In fact, it is worth pointing out that Von Neumann's celebrated minimax theorem (von Neumann & Morgenstern, 2007) does not apply in team games, rendering traditional techniques employed in two-player zero-sum games of little use. Indeed, Schulman & Vazirani (2017) provided a precise characterization of the duality gap between the two teams based on the natural parameters of the problem, while Kalogiannis et al. (2021) showed that standard no-regret learning dynamics such as gradient descent and optimistic Hedge could fail to stabilize to mixed NE even in binary-action adversarial team games. Finally, we should also point out that although from a complexity-theoretic standpoint our main result (Theorem 1.1) establishes a fully polynomial time approximate scheme (FPTAS), since the dependence on the approximation error ϵ is poly(1/ϵ), an improvement to poly(log(1/ϵ)) is precluded even in normal-form games unless CLS ⊆ P (an unlikely event); this follows as adversarial team games capture potential games (Kalogiannis et al., 2021) , wherein computing mixed Nash equilibria is known to be complete for the class CLS = PPAD ∩ PLS (Babichenko & Rubinstein, 2021) . Multi-agent RL. Computing Nash equilibria has been a central endeavor in multi-agent RL. While some algorithms have been proposed, perhaps most notably the Nash-Q algorithm (Hu & Wellman, 1998; 2003) , convergence to Nash equilibria is only guaranteed under severe restrictions on the game. More broadly, the long-term behavior of independent policy gradient methods (Schulman et al., 2015) is still not well-understood. Before all else, from the impossibility result of Hart & Mas-Colell, universal convergence to Nash equilibria is precluded even for normal-form games; this is aligned with the computational intractability (PPAD-completeness) of Nash equilibria even in two-player general-sum games (Daskalakis et al., 2009; Chen et al., 2009) . Surprisingly, recent work has also established hardness results in turn-based stochastic games, rendering even the weaker notion of (stationary) CCEs intractable (Daskalakis et al., 2022; Jin et al., 2022) . As a result, the existing literature has inevitably focused on specific classes of games, such as Markov potential games (Leonardos et al., 2021; Ding et al., 2022; Zhang et al., 2021b; Chen et al., 2022; Maheshwari et al., 2022; Fox et al., 2022) or two-player zero-sum Markov games (Daskalakis et al., 2020; Wei et al., 2021; Sayin et al., 2021; Cen et al., 2021; Sayin et al., 2020) . As we pointed out earlier, adversarial Markov team games can unify and extend those settings (Section 2.3). More broadly, identifying multi-agent settings for which Nash equilibria are provably efficiently computable is recognized as an important open problem in the literature (see, e.g., (Daskalakis et al., 2020) ), boiling down to one of the main research question of this paper (Question (⋆)). We also remark that certain guarantees for convergence to Nash equilibria have been recently obtained in a class of symmetric games (Emmons et al., 2022) -including symmetric team games. Finally, weaker solution concepts relaxing either the Markovian or the stationarity properties have also recently attracted attention (Daskalakis et al., 2022; Jin et al., 2021) .

5. CONCLUSIONS

Our main contribution in this paper is the first polynomial algorithm for computing (stationary) Nash equilibria in adversarial team Markov games, an important class of games in which a team of uncoordinated but identically-interested players is competing against an adversarial player. We argued that this setting serves as a step towards modeling more realistic multi-agent applications that feature both competing and cooperative interests. There are many interesting directions for future research. One caveat of our main algorithm (IPGMAX) is that it requires a separate subroutine for computing the optimal policy of the adversary. It is plausible that a carefully designed two-timescale policy gradient method can efficiently reach a Nash equilibrium, which would yield fully model-free algorithms for adversarial team Markov games by obviating the need to solve a linear program. Techniques from the literature on constrained MDPs (Ying et al., 2022) could also be useful for computing the policy of the adversary in a more scalable way. Furthermore, exploring different solution concepts-beyond Nash equilibria-could also be a fruitful avenue for the future. Indeed, allowing some limited form of correlation between the players in the team could lead to more efficient algorithms; whether that form of coordination is justified (arguably) depends to a large extent on the application at hand. Finally, returning to Question (⋆), a more ambitious agenda revolves around understanding the fundamental structure of games for which computing Nash equilibria is provably computationally tractable.



To ease the notation, and without any essential loss of generality, we will assume throughout that the action space does not depend on the state. Assuming that the reward is positive is without any loss of generality (see Claim D.6). Hansen et al. (2008);Borgs et al. (2010) establish FNP-hardness and inapproximability for general 3player games, but their argument readily applies to 3-player team games as well.



x, b) := E a∼xs [r(s, a, b] and P(s ′ |s, x, b)) := E a∼xs [P(s ′ |s, a, b)]. For a fixed strategy x ∈ X for the team, this program describes the (discounted) MDP faced by the adversary. A central challenge in this formulation lies in the nonconvexity-nonconcavity of the constraint functions, witnessed by the multilinear constraint (Q1).

ACKNOWLEDGMENTS

We are grateful to the anonymous ICLR reviewers for their valuable feedback. Ioannis Anagnostides thanks Gabriele Farina and Brian H. Zhang for helpful discussions. Ioannis Panageas would like to acknowledge a start-up grant. Part of this project was done while he was a visiting research scientist at the Simons Institute for the Theory of Computing for the program "Learning and Games". Vaggos Chatziafratis was supported by a start-up grant of UC Santa Cruz, the Foundations of Data Science Institute (FODSI) fellowship at MIT and Northeastern, and part of this work was carried out at the Simons Institute for the Theory of Computing. Emmanouil V. Vlatakis-Gkaragkounis is grateful for financial support by the Google-Simons Fellowship, Pancretan Association of America and Simons Collaboration on Algorithms and Geometry. This project was completed while he was a visiting research fellow at the Simons Institute for the Theory of Computing. Additionally, he would like to acknowledge the following series of NSF-CCF grants under the numbers 1763970/2107187/1563155/1814873.

