ABSTRACTING IMPERFECT INFORMATION AWAY FROM TWO-PLAYER ZERO-SUM GAMES

Abstract

In their seminal work, Nayyar et al. (2013) showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned non-correspondence problem-thus, computing them can be treated as perfect information problems. This result yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning algorithms.

1. INTRODUCTION

In single-agent settings, dynamic programming (Bertsekas, 2000) serves as a bedrock for reinforcement learning (Sutton & Barto, 2018) , justifying approximating optimal policies by backward induction and facilitating a simple framework for decision-time planning. One might hope that dynamic programming could also provide similar grounding in multi-agent settings for well-defined notions of optimality, like optimal joint policies in common-payoff games, Nash equilibria in two-player zero-sum (2p0s) games, and team correlated equilibria in two-team zero-sum (2t0s) games. Unfortunately, this is not straightforwardly the case when there is imperfect information-i.e., one player has knowledge that another does not or two players act simultaneously. This difficulty arises from two causes, which we call the backward dependence problem and the non-correspondence problem. The backward dependence problem is that computing the expected return starting from a decision point generally requires knowledge about policies that were played up until now, in addition to the policies that will be played going forward. This is in stark contrast to perfect information settings, in which the expected return starting from a decision point is independent of the policy played before arriving at the decision point. As a result of this bidirectional temporal dependence, backward induction arguments that work in perfect information settings fail in imperfect information settings. In their seminal work, Nayyar et al. (2013) showed that the backward dependence problem can be resolved by having players publicly announce their policies as they play. Using this insight, a common-payoff game can be transformed into a public belief Markov decision processes (PuB-MDP). Importantly, deterministic optimal policies in the PuB-MDP can be mapped back to optimal joint policies of the original common-payoff game. Having players publicly announce their policies can also be used to transform 2p0s games into alternating Markov games (AMGs) with public belief states, which we call public belief alternating Markov games (PuB-AMGs). AMGs are fully observable turn-based games (like Go and chess) and, therefore, are amenable to dynamic programming-based approaches (Littman, 1996) . Unfortunately, computing Nash equilibria of PuB-AMGs carries little value because these Nash equilibria may not correspond with Nash equilibria in the original game (Ganzfried & Sandholm, 2015; Brown et al., 2020) . Indeed, as we will show, they may correspond with arbitrarily exploitable policies. We call this problem the non-correspondence problem. The main contribution of this work is showing that regularized minimax objectives that guarantee unique equilibria in subgames do not suffer from the non-correspondence problem. In other words, computing these uniqueness guaranteeing equilibria can be reduced to computing the associated equilibria in the PuB-AMG. Because guaranteeing uniqueness is as simple as adding entropy regularization (Perolat et al., 2021) , our result is straightforward to apply in practice. We view the primary importance of this reduction from imperfect information 2p0s games to perfect information games as three fold. 1. It is the first known reduction of its kind; specifically, it is the first regularized equilibrium preserving transformation from imperfect information 2p0s games to perfect information 2p0s games that is meaningful in the sense that regularized equilibria can be made arbitrarily close to Nash equilibria. 2. It yields a simple framework for decision-time planning in 2p0s games. The simplicity of this framework is in stark contrast to existing approaches (Brown & Sandholm, 2017a; b; Moravčík et al., 2017; Brown et al., 2020; Schmid et al., 2021) , which are hampered by complications that involve undesirable facets, such as discontinuous functions. 3. It can be applied across the whole class of 2t0s games (as depicted in Figure 1 • i ranges from 0 to N -1 and ι ∈ {0, . . . , N -1} denotes the acting player.foot_0 • A is the set of actions. • O i is the set of private observations for player i. • O pub is the set of public observations (common knowledge to all players).



Our usage of ι is informal but unambiguous in context.



Figure 1: Our contribution in the context of related work, at an abstract level.

We use the terminology finite-horizon sequential game to describe settings in which players act oneat-a-time and in which the game terminates after a fixed number of steps. This setting can express any (simultaneous-move) timeable perfect-recall finite-horizon game. See, for example, Kovarík et al. (2019) for more details.Symbolically, we say a setting is a finite-horizon sequential game if it can be described by a tuple⟨A, [O i ], O pub , [H i ], H pub , H, µ, [O i ], O pub , [R i ], T , T ⟩,

