EQUILIBRIUM FINDING VIA EXPLOITABILITY DESCENT WITH LEARNED BEST-RESPONSE FUNCTIONS

Abstract

There has been great progress on equilibrium finding research over the last 20 years. Most of that work has focused on games with finite, discrete action spaces. However, many games involving space, time, money, etc. have continuous action spaces. We study the problem of computing approximate Nash equilibria of games with continuous strategy sets. The main measure of closeness to Nash equilibrium is exploitability, which measures how much players can benefit from unilaterally changing their strategy. We propose a new method that minimizes an approximation of exploitability with respect to the strategy profile. This approximation is computed using learned best-response functions, which take the current strategy profile as input and return learned best responses. The strategy profile and best-response functions are trained simultaneously, with the former trying to minimize exploitability while the latter try to maximize it. We evaluate our method on various continuous games, showing that it outperforms prior methods.

1. INTRODUCTION

Most work concerning equilibrium computation has focused on games with finite, discrete action spaces. However, many games involving space, time, money, etc. have continuous action spaces. Examples include continuous resource allocation games (Ganzfried, 2021) , security games in continuous spaces (Kamra et al., 2017; 2018; 2019) , network games (Ghosh & Kundu, 2019) , military simulations and wargames (Marchesi et al., 2020) , and video games (Berner et al., 2019; Vinyals et al., 2019) . Moreover, even if the action space is discrete, it can be fine-grained enough to treat as continuous for the purpose of computational efficiency Borel (1938) ; Chen & Ankenman (2006) ; Ganzfried & Sandholm (2010b) . The first question that arises in the multiagent setting is what game-theoretic solution concept and performance metric should be used. For the former, we use the standard solution concept of a Nash equilibrium: that is, a strategy profile for which each strategy is a best response to the remaining players' strategies. The main measure of closeness to Nash equilibrium is exploitability, which measures how much players can benefit from unilaterally changing their strategy. Typically, we seek Nash equilibria, that is, strategy profiles for which the exploitability is zero. As some previous works in the literature have done, we can try to search for Nash equilibria by performing gradient descent on exploitability, since it is non-negative and its zero set is precisely the set of Nash equilibria. However, evaluating exploitability requires computing best responses to the current strategy profile, which is itself a nontrivial problem in complex games. We propose a new method that minimizes an approximation of the exploitability with respect to the strategy profile. This approximation is computed using learned best-response functions, which take the current strategy profile as input and return learned best responses. The strategy profile and best-response functions are trained simultaneously, with the former trying to minimize exploitability while the latter try to maximize it. We start by introducing some background needed to formulate the problem, including the definition of strategic-form games, strategies, equilibria, and exploitability. Next, we describe some prior methods in the literature and related research. We then describe various games that we use as benchmarks and discuss our experimental results. Finally, we present our conclusion and suggest directions for future research. Games and equilibria A strategic-form game is a tuple (I, X, u) where I is a set of players, X i is a strategy set for player i, and u : i X i → R I is a utility function. A strategy profile x ∈ i X i maps each player to a strategy for that player. A game is zero-sum if and only if i u(x) i = 0 for all strategy profiles x. x -i denotes x excluding player i's strategy. Player i's best-response utility b(x ) i = sup x ′ i u(x ′ i , x -i ) i is the highest utility they can attain given the other players' strategies. Their utility gap δ(x) i = b(x) i -u(x) i is the highest utility they can gain from unilaterally changing their strategy. x is an ε-equilibrium iff sup i δ(x) i ≤ ε. A 0-equilibrium is called a Nash equilibrium. In a Nash equilibrium, each player's strategy is a best response to the other players' strategies, that is, u(x ) i ≥ u(x ′ i , x -i ) i for all i ∈ I and x ′ i ∈ X i . Infinite games For some games, the X i might be infinite. The following theorems apply to such games: If for all i, X i is nonempty and compact, and u(x) i is continuous in x, a mixed strategy Nash equilibrium exists (Glicksberg, 1952) . If for all i, X i is nonempty, compact, and convex, and u(x) i is continuous in x and quasi-concave in x i , a pure strategy Nash equilibrium exists (Fudenberg & Tirole, 1991, p. 34) . Other results include the existence of a mixed strategy Nash equilibrium for games with discontinuous utilities under some mild semicontinuity conditions on the utility functions (Dasgupta & Maskin, 1986) , and the uniqueness of a pure Nash equilibrium for continuous games under diagonal strict concavity assumptions (Rosen, 1965) . Nikaido-Isoda function Nikaido & Isoda (1955) introduced the Nikaido-Isoda (NI) function ϕ(x, y) = i (u(y i , x -i ) i -u(x) i ). It is also sometimes called the Ky Fan function (Flåm & Antipin, 1996; Flåm & Ruszczyński, 2008; Hou et al., 2018) . Several papers have proposed algorithms Exploitability Let ψ(x) = sup y ϕ(x, y). Then ψ(x) ≥ 0. Furthermore, ψ(x) = 0 if and only if x is a Nash equilibrium. ψ is commonly known as the exploitability or Nash convergence metric (NashConv) in the literature (Lanctot et al., 2017; Lockhart et al., 2019; Walton & Lisy, 2021; Timbers et al., 2022) . It adds up the utility gaps δ(x) i = sup yi u(y i , x -i ) i -u(x) i of each player, and thus serves as a measure of closeness to Nash equilibrium. In a two-player zero-sum game, the exploitability reduces to the so-called "duality gap" (Grnarova et al., 2021) ψ(x) = sup x ′ 1 u(x ′ 1 , x 2 ) 1 -inf x ′ 2 u(x 1 , x ′ 2 ) 1 . Algorithms There are several algorithms in the literature for continuous games. Schaefer & Anandkumar (2019) introduced competitive gradient descent (CGD), a natural generalization of gradient descent to the two-player setting where the update is given by the Nash equilibrium of a regularized bilinear local approximation of the underlying game. It avoids oscillatory and divergent behaviors seen in alternating gradient descent. The convergence and stability properties of their method are robust to strong interactions between the players, without adapting the stepsize, which is not the case with previous methods. Ma et al. (2021) generalized CGD to more than two players by using a local approximation given by a multilinear polymatrix game that can



that use this function to find Nash equilibria, including Berridge & Krawczyk (1970); Uryasev & Rubinstein (1994); Krawczyk & Uryasev (2000); Krawczyk (2005); Flåm & Ruszczyński (2008); Gürkan & Pang (2009); von Heusinger & Kanzow (2009a;b); Qu & Zhao (2013); Hou et al. (2018); Raghunathan et al. (2019); Tsaknakis & Hong (2021).

Mescheder et al.  (2017)  introduced consensus optimization (CO) to improve the convergence properties of GANs. Unfortunately, CO can sometimes converge to undesirable points even in simple potential games.Balduzzi et al. (2018)  introduced symplectic gradient adjustment (SGA) to address some of the shortcomings of CO.Foerster et al. (2018) introduced learning with opponent-learning awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes a term that accounts for the impact of one agent's policy on the anticipated parameter update of the other agents. They show that LOLA leads to cooperation with high social welfare, while independent policy gradients, a standard multiagent RL approach, does not. The policy gradient finding is consistent with prior work, such asSandholm & Crites (1996).Letcher  et al. (2018)  introduced stable opponent shaping (SOS), which interpolates between LOLA and a different but similar learning method called LookAhead (Zhang & Lesser, 2010).

