OFFLINE EQUILIBRIUM FINDING

Abstract

Offline reinforcement learning (Offline RL) is an emerging field that has recently begun gaining attention across various application domains due to its ability to learn behavior from earlier collected datasets. Offline RL proved very successful, paving a path to solving previously intractable real-world problems, and we aim to generalize this paradigm to a multi-agent or multiplayer-game setting. To this end, we formally introduce a problem of offline equilibrium finding (OEF) and construct multiple datasets across a wide range of games using several established methods. To solve the OEF problem, we design a model-based method that can directly apply any online equilibrium finding algorithm to the OEF setting while making minimal changes. We focus on three most prominent contemporary online equilibrium finding algorithms and adapt them to the OEF setting, creating three model-based variants: OEF-PSRO and OEF-CFR, which generalize the widely-used algorithms PSRO and Deep CFR to compute Nash equilibria (NEs), and OEF-JPSRO, which generalizes the JPSRO to calculate (Coarse) Correlated equilibria ((C)CEs). We further improve their performance by combining the behavior cloning policy with the model-based policy. Extensive experimental results demonstrate the superiority of our approach over multiple model-based and model-free offline RL algorithms and the necessity of the model-based method for solving OEF problems. We hope that our efforts may help to accelerate research in large-scale equilibrium finding.

1. INTRODUCTION

Game theory provides a universal framework for modeling interactions among cooperative and competitive players (Shoham & Leyton-Brown, 2008) . The canonical solution concept is Nash equilibrium (NE), describing a situation when no player increases their utility by unilaterally deviating. However, computing NE in two-player or multi-player general-sum games is PPADcomplete (Daskalakis et al., 2006; Chen & Deng, 2006) , which makes solving games both exactly and approximately difficult. The situation remains non-trivial even in two-player zero-sum games, no matter whether the players may perceive the state of the game perfectly (e.g., in Go (Silver et al., 2016) ) or imperfectly (e.g., in poker (Brown & Sandholm, 2018) or StarCraft II (Vinyals et al., 2019) ). In recent years, learning algorithms have demonstrated their superiority in solving largescale imperfect-information extensive-form games over traditional optimization methods, including linear or nonlinear programs. The most successful learning algorithms belong either to the line of research on counterfactual regret minimization (CFR) (Brown & Sandholm, 2018) , or policy space response oracles (PSRO) (Lanctot et al., 2017) . CFR is an iterative algorithm approximating NEs using repeated self-play. Several sampling-based CFR variants (Lanctot et al., 2009; Gibson et al., 2012) were proposed to solve large games efficiently. To scale up to even larger games, CFR could be embedded with neural network function approximation (Brown et al., 2019; Steinberger, 2019; Li et al., 2019; Agarwal et al., 2020) . The other algorithm, PSRO, generalizes the double oracle method (McMahan et al., 2003; Bošanský et al., 2014) by incorporating (deep) reinforcement learning (RL) methods as a best-response oracle (Lanctot et al., 2017; Muller et al., 2019) . The neural fictitious self-play (NFSP) can be seen as a special case of PSRO (Heinrich et al., 2015) . Both CFR and PSRO achieved great performance in solving large-scale imperfect-information games, e.g., poker (Brown & Sandholm, 2018; McAleer et al., 2020) . In this paper, we only focus on the imperfect-information extensive-form games. One of the critical components of these successes is the existence of efficient and accurate simulators. A simulator serves as an environment that allows an agent to collect millions to billions of trajectories for the training process. The simulator may be encoded using rules as in different poker variants (Lanctot et al., 2019) , or a video-game suite like StarCraft II (Vinyals et al., 2017) . However, in many real-world games such as football (Kurach et al., 2020; Tuyls et al., 2021) or table tennis (Ji et al., 2021) , constructing a sufficiently accurate simulator may not be feasible because of a plethora of complex factors affecting the game-play. These factors include the relevant laws of physics, environmental circumstances (e.g., wind speed), or physiological limits of (human) bodies rendering certain actions unattainable. Therefore, the football teams or the table tennis players may resort to watching previous matches to alter their strategies, which semantically corresponds to offline equilibrium finding (OEF). Recent years have witnessed several (often domain-specific) attempts to formalize offline learning in the context of games. For example, the StarCraft II Unplugged (Mathieu et al., 2021) offers a dataset of human game-plays in this two-player zero-sum symmetric game. A concurrent work (Cui & Du, 2022) investigates the necessary properties of offline datasets of two-player zero-sum games to successfully infer their NEs. However, neither work considers the significantly more challenging field of multi-player games. To this end, we propose a general problemoffline equilibrium finding (OEF) which aims to find the equilibrium strategy of the underlying game, given a fixed dataset collected by an unknown behavior strategy. It is a big challenge since it needs to build the relationship between an equilibrium strategy and an offline dataset. To solve this problem, we introduce an environment model as the intermediary between them. More specifically, our main contributions include i) proposing a novel problem -OEF, and constructing OEF datasets from widely accepted game domains using different behavior strategies; ii) proposing a model-based method that can generalize any online equilibrium finding algorithm to the OEF setting by introducing an environment model; iii) adapting several existing online equilibrium finding algorithms to the OEF setting for computing different equilibrium solutions; iv) applying the behavior cloning technique to further improve the performance; v) conducting extensive experiments to verify the effectiveness of our algorithms. The experimental results substantiate the superiority of our method over model-based and model-free offline RL algorithms and the effectiveness and the necessity of the model-based method for solving the OEF problem.

2. RATIONALE BEHIND OFFLINE EQUILIBRIUM FINDING

We begin by providing a motivating scenario of OEF and comparing it to three related lines of research -opponent modeling, empirical game theoretic analysis, and offline RL -in order to highlight the rationale behind the introduction of OEF. We then justify the choice of model-based methods for OEF. A more detailed overview of related works is provided in Appendix A. Motivating scenario. Assume that a table tennis player A will play against player B whom they never faced before. What may A do to prepare for the match? In this case, even though A knows the rules of table tennis, they remain unaware of specific circumstances of playing against B, such as which moves or actions B prefers or their subjective payoff function. Without this detailed game information, selfplay or other online equilibrium finding algorithms cannot work. If player A simply play the best response strategy against player B's previous strategy, the best response strategy may be exploited if player B change his strategy. Therefore, player A has to watch the matches that player B played against other players to learn their style and compute the equilibrium strategy, which minimizes exploitation. This process corresponds to the proposed OEF methodology. Why offline equilibrium finding? In games with complex dynamics like table tennis games or football games (Kurach et al., 2020) , it is difficult to build a realistic simulator or learn the policy during playing the game. A remedy is to learn the policy from the historical game data. Therefore, we propose the offline equilibrium finding (OEF) problem, which can be defined as Given a fixed dataset D collected by an unknown behavior strategy σ, find an equilibrium strategy profile σ * of the underlying game.



Figure 1: The game of table tennis.

