ESCHER: ESCHEWING IMPORTANCE SAMPLING IN GAMES BY COMPUTING A HISTORY VALUE FUNCTION TO ESTIMATE REGRET

Abstract

Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability. We show that the variance of the estimated regret of ESCHER is orders of magnitude lower than DREAM and other baselines. We then show that ESCHER outperforms the prior state of the art-DREAM and neural fictitious self play (NFSP)-on a number of games and the difference becomes dramatic as game size increases. In the very large game of dark chess, ESCHER is able to beat DREAM and NFSP in a head-to-head competition over 90% of the time.

1. INTRODUCTION

A core challenge in computational game theory is the problem of learning strategies that approximate Nash equilibrium in very large imperfect-information games such as Starcraft (Vinyals et al., 2019) , dark chess (Zhang & Sandholm, 2021), and Stratego (McAleer et al., 2020; Perolat et al., 2022) . Due to the size of these games, tabular game-solving algorithms such as counterfactual regret minimization (CFR) are unable to produce such equilibrium strategies. To sidestep the issue, in the past stochastic methods such as Monte-Carlo CFR (MCCFR) have been proposed. These methods use computationally inexpensive unbiased estimators of the regret (i.e., utility gradient) of each player, trading off speed for convergence guarantees that hold with high probability rather than in the worst case. Several unbiased estimation techniques of utility gradients are known. Some, such as external sampling, produce low-variance gradient estimates that are dense, and therefore are prohibitive in the settings mentioned above. Others, such as outcome sampling, produce high-variance estimates that are sparse and can be computed given only the realization of play, and are therefore more appropriate for massive games. However, even outcome-sampling MCCFR is inapplicable in practice. First, since it is a tabular method, it can only update regret on information sets that it has seen during training. In very large games, only a small fraction of all information sets will be seen during training. Therefore, generalization (via neural networks) is necessary. Second, to achieve unbiasedness of the utility gradient estimates, outcome-sampling MCCFR uses importance sampling (specifically, it divides the utility of each terminal state by a reach probability, which is often tiny), leading to estimates with extremely large magnitudes and high variance. This drawback is especially problematic when MCCFR is implemented using function approximation, as the high variance of the updates can cause instability of the neural network training. Deep CFR (Brown et al., 2019) addresses the first shortcoming above by training a neural network to estimate the regrets cumulated by outcome-sampling MCCFR, but is vulnerable to the second shortcoming, causing the neural network training procedure to be unstable. DREAM (Steinberger et al., 2020) improves on Deep CFR by partially addressing the second shortcoming by using a history-based value function as a baseline (Schmid et al., 2019) . This baseline greatly reduces the variance in the updates and is shown to have better performance than simply regressing on the MCCFR updates. However, DREAM still uses importance sampling to remain unbiased. So, while DREAM was shown to work in small artificial poker variants, it is still vulnerable to the high variance of the estimated counterfactual regret and indeed we demonstrate that in games with long horizons and/or large action spaces, this importance sampling term causes DREAM to fail. In this paper, we introduce Eschewing importance Sampling by Computing a History value function to Estimate Regret (ESCHER), a method that is unbiased, low variance, and does not use importance sampling. ESCHER is different from DREAM in two important ways, both of which we show are critical to achieving good performance. First, instead of using a history-dependent value function as a baseline, ESCHER uses one directly as an estimator of the counterfactual value. Second, ESCHER does not multiply estimated counterfactual values by an importance-weighted reach term. To remove the need to weight by the reach to the current information state, ESCHER samples actions from a fixed sampling policy that does not change from one iteration to the next. Since this distribution is static, our fixed sampling policy simply weights certain information sets more than others. When the fixed sampling policy is close to the balanced policy (i.e., one where each leaf is reached with equal probability), these weighting terms minimally affect overall convergence of ESCHER with high probability. We find that ESCHER has orders of magnitude lower variance of its estimated regret. In experiments with a deep learning version of ESCHER on the large games of phantom tic tac toe, dark hex, and dark chess, we find that ESCHER outperforms NFSP and DREAM, and that the performance difference increases to be dramatic as the size of the game increases. Finally, we show through ablations that both differences between ESCHER and DREAM (removing the bootstrapped baseline and removing importance sampling) are necessary in order to get low variance and good performance on large games.

2. BACKGROUND

We consider extensive-form games with perfect recall (Osborne & Rubinstein, 1994; Hansen et al., 2004; Kovařík et al., 2022 ). An extensive-form game progresses through a sequence of player actions, and has a world state w ∈ W at each step. In an N -player game, A = A 1 × • • • × A N is the space of joint actions for the players. A i (w) ⊆ A i denotes the set of legal actions for player i ∈ N = {1, . . . , N } at world state w and a = (a 1 , . . . , a N ) ∈ A denotes a joint action. At each world state, after the players choose a joint action, a transition function T (w, a) ∈ ∆ W determines the probability distribution of the next world state w ′ . Upon transition from world state w to w ′ via joint action a, player i makes an observation o i = O i (w ′ ). In each world state w, player i receives a utility u i (w). The game ends when the players reach a terminal world state. In this paper, we consider games that are guaranteed to end in a finite number of actions and that have zero utility at non-terminal world states. A history is a sequence of actions and world states, denoted h = (w 0 , a 0 , w 1 , a 1 , . . . , w t ), where w 0 is the known initial world state of the game. U i (h) and A i (h) are, respectively, the utility and set of legal actions for player i in the last world state of a history h. An information set for player i, denoted by s i , is a sequence of that player's observations and actions up until that time s i (h) = (a 0 i , o 1 i , a 1 i , . . . , o t i ). Define the set of all information sets for player i to be I i . The set of histories that correspond to an information set s i is denoted H(s i ) = {h : s i (h) = s i }, and it is

