EQUILIBRIUM FINDING VIA EXPLOITABILITY DESCENT WITH LEARNED BEST-RESPONSE FUNCTIONS

Abstract

There has been great progress on equilibrium finding research over the last 20 years. Most of that work has focused on games with finite, discrete action spaces. However, many games involving space, time, money, etc. have continuous action spaces. We study the problem of computing approximate Nash equilibria of games with continuous strategy sets. The main measure of closeness to Nash equilibrium is exploitability, which measures how much players can benefit from unilaterally changing their strategy. We propose a new method that minimizes an approximation of exploitability with respect to the strategy profile. This approximation is computed using learned best-response functions, which take the current strategy profile as input and return learned best responses. The strategy profile and best-response functions are trained simultaneously, with the former trying to minimize exploitability while the latter try to maximize it. We evaluate our method on various continuous games, showing that it outperforms prior methods.

1. INTRODUCTION

Most work concerning equilibrium computation has focused on games with finite, discrete action spaces. However, many games involving space, time, money, etc. have continuous action spaces. Examples include continuous resource allocation games (Ganzfried, 2021) , security games in continuous spaces (Kamra et al., 2017; 2018; 2019) , network games (Ghosh & Kundu, 2019), military simulations and wargames (Marchesi et al., 2020) , and video games (Berner et al., 2019; Vinyals et al., 2019) . Moreover, even if the action space is discrete, it can be fine-grained enough to treat as continuous for the purpose of computational efficiency Borel (1938); Chen & Ankenman (2006); Ganzfried & Sandholm (2010b) . The first question that arises in the multiagent setting is what game-theoretic solution concept and performance metric should be used. For the former, we use the standard solution concept of a Nash equilibrium: that is, a strategy profile for which each strategy is a best response to the remaining players' strategies. The main measure of closeness to Nash equilibrium is exploitability, which measures how much players can benefit from unilaterally changing their strategy. Typically, we seek Nash equilibria, that is, strategy profiles for which the exploitability is zero. As some previous works in the literature have done, we can try to search for Nash equilibria by performing gradient descent on exploitability, since it is non-negative and its zero set is precisely the set of Nash equilibria. However, evaluating exploitability requires computing best responses to the current strategy profile, which is itself a nontrivial problem in complex games. We propose a new method that minimizes an approximation of the exploitability with respect to the strategy profile. This approximation is computed using learned best-response functions, which take the current strategy profile as input and return learned best responses. The strategy profile and best-response functions are trained simultaneously, with the former trying to minimize exploitability while the latter try to maximize it. We start by introducing some background needed to formulate the problem, including the definition of strategic-form games, strategies, equilibria, and exploitability. Next, we describe some prior methods in the literature and related research. We then describe various games that we use as benchmarks and discuss our experimental results. Finally, we present our conclusion and suggest directions for future research.

