ABSTRACTING IMPERFECT INFORMATION AWAY FROM TWO-PLAYER ZERO-SUM GAMES

Abstract

In their seminal work, Nayyar et al. (2013) showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned non-correspondence problem-thus, computing them can be treated as perfect information problems. This result yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning algorithms.

1. INTRODUCTION

In single-agent settings, dynamic programming (Bertsekas, 2000) serves as a bedrock for reinforcement learning (Sutton & Barto, 2018) , justifying approximating optimal policies by backward induction and facilitating a simple framework for decision-time planning. One might hope that dynamic programming could also provide similar grounding in multi-agent settings for well-defined notions of optimality, like optimal joint policies in common-payoff games, Nash equilibria in two-player zero-sum (2p0s) games, and team correlated equilibria in two-team zero-sum (2t0s) games. Unfortunately, this is not straightforwardly the case when there is imperfect information-i.e., one player has knowledge that another does not or two players act simultaneously. This difficulty arises from two causes, which we call the backward dependence problem and the non-correspondence problem. policies that will be played going forward. This is in stark contrast to perfect information settings, in which the expected return starting from a decision point is independent of the policy played before arriving at the decision point. As a result of this bidirectional temporal dependence, backward induction arguments that work in perfect information settings fail in imperfect information settings. In their seminal work, Nayyar et al. (2013) showed that the backward dependence problem can be resolved by having players publicly announce their policies as they play. Using this insight, a common-payoff game can be transformed into a public belief Markov decision processes (PuB-MDP). Importantly, deterministic optimal policies in the PuB-MDP can be mapped back to optimal joint policies of the original common-payoff game. Having players publicly announce their policies can also be used to transform 2p0s games into alternating Markov games (AMGs) with public belief states, which we call public belief alternating Markov games (PuB-AMGs). AMGs are fully observable turn-based games (like Go and chess) and, therefore, are amenable to dynamic programming-based approaches (Littman, 1996) . Unfortunately, computing Nash equilibria of PuB-AMGs carries little value because these Nash equilibria may not correspond with Nash equilibria in the original game (Ganzfried & Sandholm, 2015; Brown et al., 2020) . Indeed, as we will show, they may correspond with arbitrarily exploitable policies. We call this problem the non-correspondence problem. The main contribution of this work is showing that regularized minimax objectives that guarantee unique equilibria in subgames do not suffer from the non-correspondence problem. In other words, computing these uniqueness guaranteeing equilibria can be reduced to computing the associated equilibria in the PuB-AMG. Because guaranteeing uniqueness is as simple as adding entropy regularization (Perolat et al., 2021) , our result is straightforward to apply in practice. We view the primary importance of this reduction from imperfect information 2p0s games to perfect information games as three fold. 1. It is the first known reduction of its kind; specifically, it is the first regularized equilibrium preserving transformation from imperfect information 2p0s games to perfect information 2p0s games that is meaningful in the sense that regularized equilibria can be made arbitrarily close to Nash equilibria. 2. It yields a simple framework for decision-time planning in 2p0s games. The simplicity of this framework is in stark contrast to existing approaches (Brown & Sandholm, 2017a; b; Moravčík et al., 2017; Brown et al., 2020; Schmid et al., 2021) , which are hampered by complications that involve undesirable facets, such as discontinuous functions. 3. It can be applied across the whole class of 2t0s games (as depicted in Figure 1 ) because of the recent results of Carminati et al. (2022a) ; Zhang et al. (2022) ; Carminati et al. (2022b) , who showed that 2t0s games can be cast as 2p0s games.

2.1. FINITE-HORIZON SEQUENTIAL GAMES

We use the terminology finite-horizon sequential game to describe settings in which players act oneat-a-time and in which the game terminates after a fixed number of steps. This setting can express any (simultaneous-move) timeable perfect-recall finite-horizon game. See, for example, Kovarík et al. (2019) for more details. Symbolically, we say a setting is a finite-horizon sequential game if it can be described by a tuple ⟨A, [O i ], O pub , [H i ], H pub , H, µ, [O i ], O pub , [R i ], T , T ⟩,

where

• i ranges from 0 to N -1 and ι ∈ {0, . . . , N -1} denotes the acting player.foot_0 • A is the set of actions. • O i is the set of private observations for player i. • O pub is the set of public observations (common knowledge to all players). • H i = ∪ T -1 t=0 (O pub × O i ) t × O pub × O i is player i's action-observation histories (AOHs). • H pub = ∪ T -1 t=0 O t pub is the set of public histories. • H ⊂ H 0 × • • • × H N -1 is the set of histories. • µ ∈ ∆(H 0 (h 0 pub ) × • • • × H N -1 (h 0 pub )) is the initial history distribution. • O i : H → O i is player i's observation function. 2 • O pub : H → O pub is the public observation function. • R i : H × A → R is the player i's reward function. • T : H × A → ∆(H) is the transition function. • T is the time horizon at which the game terminates. For a given h pub ∈ H pub , we use H i (h pub ) to denote the set of AOHs for player i that are consistent with h pub and H(h pub ) to denote the set of histories that are consistent with h pub . Also, for a history h, we use h ι to denote the AOH for the acting player at history h. We use capitals of the same letters to denote random variables of the same types. Special Cases This work will make use of the following special cases. • Two team zero sum: Games in which {0, . . . , N -1} is a disjoint union of two blocks, where ∀i, j, R i = R j if i, j belong to the same block and R i = -R j if i, j belong to opposite blocks. • Common payoff: Games in which ∀i, j, R i = R j . • Two player zero sum: Games in which N = 2 and R 0 = -R 1 . In these special cases, the reward of all players is uniquely determined by the reward of any individual player. Thus, we will drop the player index i on the reward function and use R = R 0 . Subgames For a given finite-horizon sequential game, we use the term subgame to refer to a game that begins with initial history distribution µ ∈ ∆(H 0 (h pub )×• • •×H N -1 (h pub )) for some particular h pub and is otherwise the same as the original game.

2.2. FINITE-HORIZON FULLY OBSERVABLE SEQUENTIAL GAMES

We also introduce specialized notation for fully observable settings. We use the terminology finitehorizon fully observable sequential game to describe tuples ⟨A, S, s 0 , [R i ], T , T ⟩, where • S is the set of states. • s 0 ∈ S is the initial state. • R i : S × A → R is the player i's reward function. • T : S × A → ∆(S) is the transition function. • i, ι, A, and T are defined as they were in the finite-horizon sequential game formalism.

Special Cases

In the fully observable context, we are interested in the following settings. • Markov decision processes (MDPs): Games in which N = 1. • Alternating Markov games (AMGs): Games that are two player zero sum. As before, because the reward of one player uniquely determines the others in these settings, we will use R = R 0 . Subgames For a given finite-horizon fully observable sequential game, we use the term subgame to refer to a game that begins with initial state s 0 = s for some particular s and otherwise proceeds by the same rules of the original game.

3. BACKGROUND

The Backward Dependence Problem To illustrate the presence of the backward dependence problem in even very simple settings, we show a cooperative matching pennies game in Figure 2 . The goal of the game is for the blue player and the red player to select the same side of a coin. The blue player moves first; then, without observing the blue player's choice, the red player moves second. Because the red player does not observe the blue player's choice (as denoted by the dotted line between the two nodes), it must make the same decision at both nodes. Now, let us consider the value for the red player. In perfect-information settings, because the red player is at a penultimate decision point, such a value would be equal to the expected return for the best action, independent of any prior events. However, here, with imperfect information, the expected return for the best action is equal to max(p, 1 -p) where p is the probability that the blue player selects heads. If the blue player's policy is unknown, there is no way to compute this value, illustrating that the backward induction approach to learning in perfect information settings fails in imperfect information settings. ⟨ Ã, S, s0 , R, T , T ⟩, where • i = ι = 0. • Ã = {ã | ã : H ι (h pub ) → A, h pub ∈ H pub } is the set of prescriptions. • S = ∪ hpub ∆(H(h pub )) is the set of public belief states (PBSs). • s0 = µ is the initial PBS. • R : s, ã → r, where r = E H∼s R(H, ã(H ι )). • T = T . Nayyar et al. (2013) showed that optimal deterministic policies in the PuB-MDP correspond with optimal joint policies for the common payoff game. Indeed, for the matching pennies game described in Figure 2 , we can see that the PuB-MDP perspective resolves the backward dependence problem because the red player observes the blue player's decision rule. If the blue player plays heads with probability p, the red player can achieve a value of max(p, 1 -p) by best responding. By backward induction, the blue player can then determine that it is optimal to use p = 0 or p = 1. Thus, the players can arrive at an optimal joint policy of the original game. For a more detailed discussion on the PuB-MDP, see, e.g., (Sokota, 2020; Sokota et al., 2021) .

4. THE PUBLIC BELIEF ALTERNATING MARKOV GAME

Here, we introduce a construction analogous to the PuB-MDP that we call the PuB-AMG. Note that, while the name PuB-AMG is novel to this work, the idea already exists in literature (Brown et al., 2020; Buffet et al., 2020; Delage et al., 2021) . Let ⟨A, [O i ], O pub , [H i ], H pub , H, µ, [O i ], O pub , [R i ], T , T ⟩, be a finite-horizon 2p0s sequential game. Then we define the associated PuB-AMG as the following finite-horizon fully observable sequential game ⟨ Ã, S, s0 , R, T , T ⟩, where • i ranges from 0 to 1 and ι ∈ {0, 1} is the acting player. • Ã = {ã | ã : H ι (h pub ) → ∆(A), h pub ∈ H pub } is the set of public decision rules (or decision rules, for short). Note that public decision rules differ from prescriptions in that they map to a distribution over actions, rather than an action. • S, s0 , R, T , and T are defined in the same fashion as in the PuB-MDP. The Correspondence Mapping As alluded to earlier, a Nash equilibrium in the PuB-AMG may be undesirable because it does not necessarily correspond to a Nash equilibrium in the original game. Here, we make precise the notion of what we mean by correspond by defining a correspondence function Π ↓ that maps public belief (joint) policies to joint policies of the original game. Given a PuB-AMG policy π, Π ↓ (π) is the joint policy that, for each AOH h ι , plays actions with the probability that π would at h ι , assuming that h ι was reached using π. (See Section A for a more rigorous definition.) Importantly, Π ↓ (π) i can be implemented in practice by running πi under the assumption that the opponent is playing according to π-i . 1, -1 0, 0 R P -2, 2 S P -2, 2 2, -2 R P 0, 0 S S 0, 0 -1, 1 R P 2, -2 S R Figure 3: A perturbed variant of rock-paper-scissors. The Non-Correspondence Problem We can now make the noncorrespondence problem precise. To illustrate, we show the perturbed variant of rock-paper-scissors described in (Brown et al., 2020) in Figure 3 . In the game, the payouts are doubled if either player plays scissors. The unique Nash equilibrium of the game is (R, P, S) → (0.4, 0.4, 0.2). Similarly to before, the red player can compute the associated value for each of the blue player's decision rules. Thus, the blue player can determine that the Nash equilibrium policy maximizes its value. It is at this point that the non-correspondence problem becomes apparent. Because the red player is conditioning on the blue player's decision rule, it achieves the the optimal value by playing any best response to the red player. In the perturbed rock-paper-scissors game, all policies are best responses to the Nash equilibrium. Thus, there is nothing constraining the red player to the Nash equilibrium policy of the original game. A similar argument, detailed in Section B.1, leads to the following disappointing result. Proposition 1. A PuB-AMG Nash equilibrium π may correspond with a joint policy Π ↓ (π) that is maximally exploitable. At an intuitive level, the non-correspondence problem arises because there is an important distinction between the public belief game and the original game. Specifically, in the public belief game, players acting earlier are forced to reveal their decision rules to players acting later. As a result, later acting players are able to "slack off" without losing any value because the earlier acting players cannot deviate to exploit them. In common-payoff games, this is a non-issue because the interests of every player are aligned. However, in 2p0s games, where there are adversarial interests, this distinction changes the strategic nature of the game in a more fundamental sense.

5. CORRESPONDENCE OF UNIQUENESS-GUARANTEEING EQUILIBRIA

We will now show that the non-correspondence problem does not exist for minimax objectives that guarantee unique equilibria in subgames of the original game. The class of minimax objectives we consider generalizes the expected return minimax objective in that it includes objectives that may have dependence on policies beyond the actions they select. Definition 5.1. We use the term regularized minimax objective (or objective for short) to refer to mappings of the form J : π 0 , π 1 → E T -1 t=0 R(H t , A t , π(H t ι )) | π 0 , π 1 where R is a real-valued function. Definition 5.2. Given an objective J induced by R for some game, we can construct an objective J for the associated public belief state game that is equivalent to J in the sense that J(π) = J(Π ↓ (π)) by defining J : π0 , π1 → E T -1 t=0 R( St , Ãt ) | π0 , π1 , where R : (s, ã) → E H∼s E A∼ã(Hι) R(H, A, ã(H ι )). Next we formalize what we mean by uniqueness guaranteeing. Definition 5.3. For a particular game, we say a minimax objective J is uniqueness guaranteeing if, for any subgame, max π0 min π1 J(π 0 , π 1 ) is guaranteed to have a unique solution (π * 0 , π * 1 ). We can now state our main result. Theorem 1. If π is an equilibrium of a PuB-AMG under a uniqueness-guaranteeing objective, then its corresponding joint policy Π ↓ (π) is the equilibrium in the original game under the same uniqueness-guaranteeing objective. Proof. (Sketch) The first decision rule of any PuB-AMG equilibrium must correspond to the first decision rule of an equilibrium of the original game. Furthermore, if the objective is uniquenessguaranteeing, subgame equilibria must be restrictions of the equilibrium of the whole game. Thus, by forward induction, PuB-AMG equilibria must correspond to the equilibrium of the original game. We detail the proof for Theorem 1 in Section B.2. Due to recent work, Theorem 1 can be generalized to the entire class of 2t0s games. Corollary 1. Computing team-correlated equilibria of 2t0s games under a uniqueness guaranteeing objectives can be reduced to computing an equilibrium of a PuB-AMG under the same uniqueness guaranteeing objectives. Proof. This follows immediately from chaining the results of Carminati et al. (2022a) ; Zhang et al. (2022) ; Carminati et al. (2022b) , who provide a reduction from 2t0s games to 2p0s games via intrateam public policy announcements, with Theorem 1.

6. SUFFICIENT CONDITIONS FOR UNIQUENESS GUARANTEEING

Theorem 1 tells us that uniqueness guaranteeing is a sufficient condition to ameliorate the noncorrespondence problem. We now discuss objectives that are uniqueness guaranteeing and general sufficient conditions for guaranteeing uniqueness. Definition 6.1. We call the objective J induced by R : (h, a, δ) → R(h, a) -αKL(δ, ρ(h ι )) ι = 0 R(h, a) + αKL(δ, ρ(h ι )) ι = 1, for some reference policy ρ, a MiniMaxKL objective. Definition 6.2. We call the objective J induced by R : (h, a, δ) → R(h, a) + αH(δ) ι = 0 R(h, a) -αH(δ) ι = 1, for some reference policy ρ, a MiniMaxEnt objective. Remark 1. MiniMaxEnt is the special case of MiniMaxKL in which ρ is uniform. To our knowledge, MiniMaxKL objectives were first introduced by Perolat et al. ( 2021), who showed the following result using Lyapunov techniques. Theorem 2. (Perolat et al., 2021) MiniMaxKL objectives are uniqueness guaranteeing for interior ρ. While the MiniMaxKL equilibrium is likely the most useful uniqueness-guaranteeing equilibrium concept, it is conceivable that other uniqueness-guaranteeing concepts may be useful. Thus, we provide a generalization below. Theorem 3. Consider an objective J induced by R : (h, a, δ) → R(h, a) -ψ(δ, h ι ) ι = 0 R(h, a) + ψ(δ, h ι ) ι = 1 and define a policy greedification function g : [-M, M] |A| × H ι → ∆(A) where M ∈ R is the maximum of the absolute values of the expected returns of J and where g : q, h ι → arg max δ∈∆(A) ⟨δ, q⟩ -ψ(δ, h ι ). For each AOH at which a player acts h ι , g maps possible regularized action values q to the policy that is greedy with respect to the regularized objective under those regularized action values. If, for all h ι , ψ(•, h ι ) is continuous and g(•, h ι ) is i) well defined, ii) continuous, and iii) has an interior image, then the objective J is uniqueness guaranteeing. Proof. (Sketch) Consider the following facts: • Because of our assumptions on ψ and g, an equilibrium must exist. • Because g has an interior image, the best response to an equilibrium must be unique. • Because the game is two player zero sum, there cannot exist incompatible equilibria. These facts can only be simultaneously satisfied if there is a unique equilibrium. We detail the proof for Theorem 3 in Section B.3. Remark 2. The premises of Theorem 3 are satisfied if ψ(•, h ι ) is bounded and is strictly convex and differentiable on its interior with lim δ→δ ′ ||∇ δ ψ(δ, h ι )|| = +∞ for δ ′ on the boundary of ∆(A). Remark 3. One example of an objective covered by Theorem 3, but not by Theorem 2, is that which is induced by setting ψ(•, h ι ) to a sum of a reverse KL divergence to an interior point and a bounded differentiable convex function.

7. DISCUSSION

In the previous section, we showed that computing certain regularized equilibria in 2p0s games can be reduced to perfect information problems. This may give the impression that it is now possible to apply perfect information algorithms out-of-the-box to solve imperfect information problems. While this is technically true, it is not a good way to view the value of our reduction. Indeed, because the PuB-AMG has an action space of dimensionality on the order of |A| |Hι(hpub)| , naively pursuing such an approach would quickly become intractable as the size of the game increases. 3Instead, we see two main ways to approach solving regularized PuB-AMGs. The first is to adapt heuristic search value iteration (Smith & Simmons, 2004 ) into a tabular regularized PuB-AMG solver. Encouragingly, this has already been done for PuB-MDPs (Dibangoye et al., 2013b) and for unregularized Pub-AMGs (Horák & Bošanský, 2019; Buffet et al., 2020; Delage et al., 2021) . The second is to use the regularized PuB-AMG as a building block for expert iteration with function approximation (Anthony et al., 2017) . This approach would look almost identical to ReBeL (Brown et al., 2020) but have a few key differences: i) It would use a regularized objective, rather than an unregularized one as ReBeL does; ii) It would use the beliefs induced by its own policy at test time, rather than the fictitious beliefs that ReBeL uses; iii) It would (optionally) be able to perform re-planning (e.g., wherein a multi-ply search is only used to make the immediate decision), whereas ReBeL must play its search policy until the end of the subgame that was searched over.

8. EXPERIMENTS

We perform two experiments in which we naively tabularly solve small PuB-AMGs using magnetic mirror descent (Sokota et al., 2022a) to offer further evidence for our results. We show the results for perturbed rock-paper-scissors Figure 4 and include results for Kuhn poker, as well as the details of our solving procedures, in Section C.

Iterations Regularized Exploitability Exploitability

PuB-AMG Regularized Exploitability PuB-AMG Exploitability On the far left, we show exploitability in the PuB-AMG. The iterates from the unregularized objective (blue) trend and the objective with annealed regularization (orange) trend toward zero. The iterates from the objective with constant regularization converge to a constant positive exploitability. On the middle left, we show the regularized exploitability (i.e., exploitability under the regularized objective) in the PuB-AMG of the objective associated for the iterate. We observe that all objectives induce iterates that converge to zero. On the middle right, we show the exploitability in the original game. Because the noncorrespondence exists for the second moving player, the exploitabilities of the iterates from the unregularized objective (blue) remain high, despite that exploitability is going to zero in the PuB-AMG. The objectives with fixed regularization (green, red) induce iterates that converge to lower, but non-zero, exploitability values. The objective with annealed regularization (orange) induces iterates that converge to zero in exploitability. On the far right, we show the regularized exploitability in the original game. We observe that, in accordance Theorem 1, the approaches with non-zero regularization that converge to zero regularized exploitability in the PuB-AMG also converge to zero regularized exploitability in the original game. In contrast, the unregularized approach does not converge, despite converging in the PuB-AMG.

9. RELATED WORK

Public Belief States in Common-Payoff Games In the sense of providing reductions for multiagent problems using PBSs, our work is similar to those of Nayyar et al. (2013) , Dibangoye et al. (2013b), and Oliehoek (2013) . As discussed in the background, Nayyar et al. (2013) provided a reduction from solving common-payoff games to solving belief MDPs; independently, Dibangoye et al. (2013b) and Oliehoek (2013) discovered similar reductions. These ideas have been leveraged in a large body of work in decentralized control literature (Lessard & Nayyar, 2013; Nayyar et al., 2014; Arabneydi & Mahajan, 2014; Ouyang et al., 2015; Vasconcelos & Martins, 2016; Tavafoghi et al., 2016; Afshari & Mahajan, 2018; Gagrani & Nayyar, 2018; Tavafoghi et al., 2018; Zhang et al., 2019; Gupta, 2021) and machine learning literature (Dibangoye et al., 2013a; MacDermed & Isbell, 2013; Dibangoye et al., 2014a; b; Dibangoye & Buffet, 2018; Foerster et al., 2019; Lerer et al., 2020; Sokota et al., 2021; Fickinger et al., 2021; Sokota et al., 2022b; Kao et al., 2022) . Use cases include game solving (Dibangoye et al., 2013b) , expert iteration (Sokota et al., 2021) , and decision-time planning (Lerer et al., 2020; Fickinger et al., 2021; Sokota et al., 2022b) . Public Belief States in Two-Player Zero-Sum Games PBSs have also been used in many works in the context of 2p0s games. For our purposes, we taxonomize these into those concerned with solving the PuB-AMG (Wiggers et al., 2016; Horák & Bošanský, 2019; Buffet et al., 2020; Delage et al., 2021) and those concerned with sound decision-time planning and expert iteration (Burch et al., 2014; Moravcik et al., 2016; Brown & Sandholm, 2017a; b; Moravčík et al., 2017; Brown et al., 2018; Serrino et al., 2019; Brown et al., 2020; Schmid et al., 2021) . The former group is concerned with analyzing the structure of the PuB-AMG and using HSVI (Smith & Simmons, 2004) to solve the PuB-AMG. Our work is complementary in the sense that it shows that using an approach such as HSVI to solve a regularized PuB-AMG would yield a regularized equilibrium in the original game. The latter group can be broken down into two subgroups, those that use opt-out values to circumvent the non-correspondence problem (Brown & Sandholm, 2017a; b; Moravčík et al., 2017; Schmid et al., 2021) and those that use no-regret learning to circumvent the non-correspondence problem (Brown et al., 2020) . Both possess substantial downsides. For the opt-out value approach: i) the policy and value are discontinuous functions of the opt-out valuesfoot_3 , and ii) the opt-values must be approximated separately from self play. For the no-regret learning approach: i) the search policy must be played for the entire subgame that was searched over (i.e., re-planning is not allowed), ii) the search algorithm must be no regret, and iii) the policy is a discontinuous function of the PBS. In contrast, decision-time planning using a regularized objective in the PuB-AMG involves no opt-out values, involves no value discontinuities (as is shown in Section D), allows for re-planning, and is search algorithm agnostic. MiniMaxKL Objectives in Two-Player Zero-Sum Games As discussed, prior to this work, Perolat et al. ( 2021) made use of MiniMaxEnt and MiniMaxKL objectives; they have also been used in more recent works (Cen et al., 2021; Zeng et al., 2022; Sokota et al., 2022a; Perolat et al., 2022) . Our use case for MiniMaxKL objectives (eliminating the non-correspondence problem) differs substantially from that of these works (inducing the last iterate convergence of model-free algorithms). Public Belief States in Two-Team Zero-Sum Games In the sense of providing reductions for multi-agent problems using PBSs, our work is similar to a recent body of literature (Carminati et al., 2022a; Zhang et al., 2022; Carminati et al., 2022b) showing that solving 2t0s games can be reduced to solving a 2p0s game by using intra-team policy announcements. As articulated by Corollary 1, combining our results with these yields a reduction from computing regularized team correlated equilibria in 2t0s games to computing regularized equilibria in PuB-AMGs. Stackelberg Games Public belief games with two-time steps are closely related to Stackelberg games (von Stackelberg, 1934; Schelling, 1960; Gibbons, 1992) . A Stackelberg game is one in which a distinguished leader publicly commits to a strategy and a follower best responds to it, resulting in a bilevel optimization problem. As with a public belief game, when a Stackelberg game is two player zero sum, Stackelberg equilibrium coincides with Nash equilibrium for the leader, but the follower's best response is generally highly exploitable in the game without public commitments. While there exist tie-breaking procedures in Stackelberg literature (e.g., strong or weak Stackelberg equilibrium), they do not resolve the non-correspondence issue.

10. CONCLUSION AND FUTURE WORK

In this work, we provided a reduction from computing regularized equilibria of 2p0s games to computing regularized equilibria of PuB-AMGs. We see this contribution as resolving an important gap in literature between common-payoff games and 2p0s games. We see three impactful directions for future work. The first involves comparing a high performance implementation of a regularized-objective-in-the-PuB-AMG approach to expert iteration (Anthony et al., 2017) to those of existing approaches (Brown et al., 2020; Schmid et al., 2021) ; while we have shown here that a a regularized-objective-in-the-PuB-AMG approach possesses favorable properties in comparison to ReBeL (Brown et al., 2020) and Player of Games (Schmid et al., 2021) , verifying that these advantages manifest in practice would be a valuable contribution. Second, by providing a simpler approach to working with PBSs in 2p0s games, our work provides further motivation for developing new approaches for approximating PBSs at scale; while Sokota et al. (2022b) recently made progress in this direction by showing that fine-tuning can effectively approximate PBSs, amortized approximation of PBSs remains an open problem. Finally, third, we believe it may be possible to extend some weaker form of the results from our work to general sum settings.

A DEFINITIONS

First, we formalize our definition of the correspondence mapping Π ↓ in Algorithm 1 below.  Algorithm 1 Correspondence Mapping procedure Π ↓ (π) queue ← [s 0 ] π ← {} while len(queue) > 0 do s ← queue.pop() ã ← π(s) ▷ Assume π is deterministic for reachable s. 5 for h ∈ supp(s) do π(h ι ) = ã(h ι ) end for for s′ ∈ supp( T (s, ã)) do ▷ Continue along for h ∈ supp(s) do 6: ã(h ι ) = π(h ι ) ▷ Do what π does. In short, Algorithm 2 yields a PuB-AMG policy in which the agents play according to π irrespective of the public belief. Therefore, we have that R(π) = R(Π ↑ (π)). Note that, in contrast to the correspondence mapping, the canonical choice mapping is invariant to opponent policy. Thus, we also allow Π ↑ to be applied directly to individual player policies. We also introduce some additional definitions. Definition A.1. For a minimax objective J, the value of the game under J is max π ′ 0 min π ′ 1 J(π ′ 0 , π ′ 1 ) = min π ′ 1 max π ′ 0 J(π ′ 0 , π ′ 1 ). Remark 4. Uniqueness-guaranteeing objectives guarantee a well-defined value. This follows immediately from the fact that both players can guarantee the unique equilibrium value. Definition A.2. For a minimax objective J, the best response value to π 0 under J is min π ′ 1 J(π 0 , π ′ 1 ); analogously, the best response value to π 1 under J is max π ′ 0 J(π ′ 0 , π 1 ). We denote the best response to π i as BR(π i ). A policy is part of a Nash equilibrium if the best response value to it is equal to the value of the game. Definition A.3. For a minimax objective J, the exploitability π under J is: -min π ′ 1 J(π 0 , π ′ 1 ) + max π ′ 0 J(π ′ 0 , π 1 ) 2 . A joint policy is a Nash equilibrium if it has exploitability zero. Definition A.4. For a minimax objective J induced by R : (h, a, δ) → R(h, a) -ψ(δ, h ι ) ι = 0 R(h, a) + ψ(δ, h ι ) ι = 1, the action value for action a at AOH h t ι under joint policy π is Q(h ι , a) = (-1) I[ι=1] E R(H t , A t , a → I[a t = a]) + T t ′ >t R(H t ′ , A t ′ , π(A t ′ )) | π, h t ι , a t . In words, it is the expected future value to the acting player for taking a at h t ′ ι assuming that both players have played according to π up until now and will continue to play according to π hereinafter.

B THEORY

We now detail the proofs of our theoretical results. Proposition 1 A PuB-AMG Nash equilibrium π may correspond with a joint policy Π ↓ (π) that is maximally exploitable. Proof. We show that this worst case can be realized in the 2p0s rigged adversarial matching pennies game depicted in Figure 5 . The game starts with two options for the red player: it can either decide to make the game fair or to rig the game. Then, without having observed the red player's decision, the blue player decides whether to opt out of the game altogether, in which case both players receive a payout of 0, or to play adversarial matching pennies. If the blue player opts in and the game is rigged, the red player receives a payout of 1 independent of the blue player's selection. If the blue player opts in and the game is not rigged, the blue player receives a payout of 1 if the players select the same side of the coin; otherwise, if the players selected opposite sides of the coin, the red player receives a payout of 1. In the game, the blue player's only Nash equilibrium strategy is to opt out with probability one. The red player's Nash equilibria strategies require at least one of i) rigging the game with probability one and ii) mixing 50-50 between tails and heads. Now, consider the following PuB-AMG policy, where superscript denotes time step within the game: • π0 (∅) = (Fair, Unfair) → (1, 0). • π1 (π 0 ) = (Tails, Heads, Out) → (1/2, 1/2, 0) π0 (Fair) = 1 (Tails, Heads, Out) → (0, 0, 1) otherwise. • π2 (π 0 , π1 ) = (Tails, Heads) → (0, 1) π1 (Tails) ≥ 1/2 (Tails, Heads) → (1, 0) otherwise. We claim that π is a PuB-AMG Nash equilibrium. To see this, first consider that the expected return is 0: the red player always opts in, the blue player mixes evenly between heads and tails, and the red player always selects tails. Next consider that the red player has no incentive to deviate to an unfair game, because the blue player will opt out, yielding an expected return of zero. Also consider the blue player has no incentive to place additional mass on opting out, as it yields an expected return of zero. Furthermore, the blue player has no incentive to select a different mixture of heads and tails, as doing so will decrease its expected return since the red player best responds at the final time step. Lastly, consider that the red player is best responding at the final time step and, therefore, has no incentive to deviate. Then, consider that the corresponding policy π = Π ↓ (π) is as follows: • π 0 : (Fair, Unfair) → (1, 0). • π 1 : (Tails, Heads, Out) → (1/2, 1/2, 0). • π 2 : (Tails, Heads) → (0, 1). We claim that this policy is maximally exploitable. To see this, consider that a red player that always rigs the game achieves an expected return of one against the blue player's policy, and consider that a blue player that always selects heads achieves an expected return of one against the red player's policy.

B.2 THEOREM 1: CORRESPONDENCE OF UNIQUENESS-GUARANTEEING EQUILIBRIA

To prove Threom 1, we first require some lemmas. Lemma 1. The best response value to π i in the original game is equal to the best response value of Π ↑ (π i ) in PuB-AMG. Proof. This follows because player -i has no mechanism to exploit Π ↑ (π i ) beyond that of the original game, since Π ↑ (π i ) ignores belief information. More formally, consider min π′ 1 J(Π ↑ (π 0 ), π′ 1 ) = J(Π ↑ (π 0 ), BR(Π ↑ (π 0 ))) = J(Π ↓ (Π ↑ (π 0 ), BR(Π ↑ (π 0 )))) = J(Π ↓ (Π ↑ (π 0 ), BR(Π ↑ (π 0 )))) 0 , Π ↓ (Π ↑ (π 0 ), BR(Π ↑ (π 0 )))) 1 ) = J(π 0 , Π ↓ (Π ↑ (π 0 ), BR(Π ↑ (π 0 )))) 1 ) ≥ J(π 0 , BR(π 0 )) = min π ′ 1 J(π 0 , π ′ 1 ). The first equality follows by definition of the best response function BR. The second equality because Π ↓ preserves expected return. The third equality is notational expansion. The fourth equality follows because π 0 and Π ↓ (Π ↑ (π 0 ), BR(Π ↑ (π 0 )))) 0 can only differ at AOHs that are not reached when playing against Π ↓ (Π ↑ (π 0 ), BR(Π ↑ (π 0 )))) 1 . The inequality and final equality follow by definition of best response. Also, consider min π ′ 1 J(π 0 , π ′ 1 ) = J(π 0 , BR(π 0 )) = J(Π ↑ (π 0 ), Π ↑ (BR(π 0 ))) ≥ J(Π ↑ (π 0 ), BR(Π ↑ (π 0 ))) = min π′ 1 J(Π ↑ (π 0 ), π1 ). The first equality follows by definition of the best response function BR. The second equality follows because Π ↑ preserves expected return. The inequality and final equality follows by definition of best response. These two inequalities can only be true if min π ′ 1 J(Π ↑ (π 0 ), π ′ 1 ) = min π ′ 1 J(π 0 , π ′ 1 ). An analogous argument shows the same result for π 1 . Corollary 2. The exploitability of π in the original game is equal to the exploitability of Π ↑ (π) in PuB-AMG. Proof. This follows immediately from Lemma 1 and the fact that exploitability is defined in terms of best response values. Corollary 3. The value of the PuB-AMG under a uniqueness-guaranteeing objective is well-defined and equal to that of the original game. Proof. Note that it suffices to show that PuB-AMGs are guaranteed to have an equilibrium with the same expected return as the equilibrium of the original game. Then consider π = Π ↑ (π), where π is an equilibrium. Then, since π is an equilibrium and, per Corollary 2, Π ↑ preserves exploitability, π is an equilibrium. Additionally, since Π ↑ preserves expected return, the original game and the PuB-AMG possess equilibria π and π, respectively, that yield the same expected return. We are now ready to prove our two main lemmas. Lemma 2. Let π be the equilibrium of a uniqueness-guaranteeing objective. Let s define a subgame of the original game induced by playing π for some number of steps. Then the unique equilibrium of the subgame π s, considered as an independent game, is the restriction π |s of π to the subgame. Proof. If π |s is not an equilibrium of the subgame, then it must be exploitable in the subgame. This means that either min π ′ 1 J s(π |s 0 , π ′ 1 ) < J s(π |s 0 , π |s 1 ) or J s(π |s 0 , π |s 1 ) < max π ′ 0 J s(π ′ 0 , π |s 1 ) Without loss of generality, assume the former. Let π br 1 = argmin π ′ 1 J s(π |s , π ′ 1 ); let P π (s) represent the probability of reaching s using policy π; let t be the time step corresponding to s and let J <t denote the expected return prior to time t. Further, let s′ ̸ = s range over the possible subgames entered at time t if s is not entered. Then J(π 0 , π 1 ) = J <t (π 0 , π 1 ) + P π (s)J s(π |s 0 , π |s 1 ) + s′ ̸ =s P π (s ′ )J s′ (π |s ′ 0 , π |s ′ 1 ) > J <t (π 0 , π 1 ) + P π (s)J s(π |s 0 , π br 1 ) + s′ ̸ =s P π (s ′ )J s′ (π |s ′ 0 , π |s ′ 1 ) = J(π 0 , [π br 1 , π |-s 1 ]). Here, the first line decomposes the expected return into 1) that which is accrued prior to time t, 2) that which is accrued in subgame s, and 3) that which is accrued after time t outside of subgame s. The second line invokes our assumption that π |s 1 does not achieve the best response value against π |s 0 and P π (s) > 0. The third line re-assembles the expected return, where we use [π br 1 , π

|-s

1 ] to denote a policy that plays π 1 outside s and π br 1 inside s. and where M is the maximum of the absolute values of the returns of the objective. Then, if, for all h ι , ψ(•, h ι ) is continuous and g(•, h ι ) is i) well defined, ii) continuous, and iii) has an interior image, the objective J is uniqueness guaranteeing. Proof. First, we show that such an equilibrium is guaranteed to exist. Let F : [-M, M] |Hι||A| → [-M, M] |Hι||A| be a function that maps each vector [q hι ] hι to the action-value vector for the joint policy dictated by the application of g to [(q hι , h ι )] hι . Note that F is well defined-i.e., the ensuant action values are always well defined-because g maps to the interior, so every history is reached with positive probability. Also note that this function is continuous, by the continuity of g and ψ, and single valued because g is single valued. Thus, because [-M, M] |Hι||A| is compact and convex, by Brouwer's fixed point theorem, a fixed point must exist. The policy corresponding to these fixedpoint action values is an equilibrium. This follows because, by backward induction, each player is optimally responding to the other, holding the other fixed. Now we show that there is a unique equilibrium. Note that, for any fixed opponent, the optimal policy at any decision point reached with positive probability must be full support because g's image is within the interior. By forward induction, this means that every equilibrium must be full support at every decision point. Now, note that, by backward induction, the best responses to full support policies are unique because g is single valued with an interior image. In aggregate, these two things show that any equilibrium is strict-i.e., the only best response to one part of the equilibrium is the other part of the equilibrium. Now, assume there exist two distinct equilibria π and π ′ . Without loss of generality, assume that π 0 performs at least as well as π ′ 0 against π 1 . If π 0 performs equally well, there is a contradiction because π ′ 0 is not the unique best response. If π 0 outperforms π ′ , there is a contradiction because π ′ is not at equilibrium. Thus, the equilibrium must be unique. The result follows because this proof also holds for every subgame of the original game.

C EXPERIMENTS C.1 MAGNETIC MIRROR DESCENT

In our experiments, we use magnetic mirror descent (MMD) (Sokota et al., 2022a) as our game solver. In the instance of MMD we use, updates are of the form π t+1 = argmax π E A∼π q πt (A) + αH(π) - 1 η KL(π, π t ) where π t is the current policy and q t is the MiniMaxEnt Q-value vector for time t. This update possesses the closed form π t+1 ∝ [π t e ηqπ t ] 1 1+αη . The fixed point of equation ( 8) is a policy satisfying π * = arg max π E A∼π q(A) + αH(π) ∝ e q/α . C.2 TABULAR PUB-AMG POLICIES In PuB-AMGs, the state space is continuous. Thus, it may not be possible to express a fully specified PuB-AMG policy in tabular form. We describe how we handle this issue for perturbed rock-paperscissors and Kuhn poker, respectively, in subsequent subsections. In both settings, we solve the games using full feedback, meaning that we compute exact Q-values and update the policy for every AOH.

C.2.1 PERTURBED ROCK-PAPER-SCISSORS

In perturbed rock-paper-scissors, the first moving player's state space is trivial; thus, its policy can be expressed exactly. Also, the (regularized) best response of the second player can be computed in closed form using equation. Thus, we set the second moving player's (regularized) PuB-AMG Iterations Exploitability Regularized Exploitability Nash equilibrium policy to induced by equation ( 9). We update the first moving player's policy using equation ( 8) where q t is the feedback induced by the second moving player's (regularized) PuB-AMG Nash equilibium policy.

C.2.2 KUHN POKER

We also investigate MiniMaxEnt objectives in an extensive-form game-Kuhn poker. In Kuhn poker, there are up to three time steps. For the third time step, we use the (regularized) Pub-AMG Nash equilibrium policy induced by equation (9). For the first time step, at each iteration, we update the policy at each information state using MMD on the feedback from the previous time step. For the second time step, at iteration t, holding fixed the iteration t decision rule for the first time step, we use the policy induced by performing √ t iterations of MMD against the (regularized) PuB-AMG Nash equilibrium policy of the third time step. As √ t grows large, we expect the decision rules for the second time step to approximate a PuB-AMG best response. We show the results for the original game in Figure 6 . 6 Qualitatively, they are analogous to those from the perturbed rock-paper-scissors game. The unregularized objective induces high exploitability iterates (blue) that do not converge in the original game; the objectives with fixed regularization (purple, red, greed) converge to constant exploitability and zero regularized exploitability in the original game; the objective with annealed regularization converges to zero exploitability and zero regularized exploitability. Proof. Note that by assumption we have for all (x, y) within the domains of f 1 , f 2 it holds f 1 (x, y) ≤ f 2 (x, y) + ϵ, (10) f 2 (x, y) ≤ f 1 (x, y) + ϵ. The first equality follows by definition of ṽ * . The second equality follows from Corollary 3. The third equality follows from Lemma 4.

D CONTINUITY IN THE PUB-AMG

Corollary 4. The minimax PuB-AMG value function ṽ * is continuous for the standard expected return objective. Corollary 5. The minimax PuB-AMG value function ṽ * is continuous for any objective satisfying premises of Theorem 3.



Our usage of ι is informal but unambiguous in context. We assume that, if i acts a time t, i's action is included in its private observation at time t + 1 Although an attempt has been made to scale up such approaches for the PuB-MDP(Foerster et al., 2019). Though Schmid et al. (2021) show that certain approximate value functions can be made continuous. This assumption is not required, but makes for cleaner presentation. We omit PuB-AMG (regularized) exploitability, as it is difficult to compute exactly in this case.



Figure 4: Results for perturbed rock-paper-scissors.

Figure 5: A rigged and adversarial variant of matching pennies.

Figure 6: Results for Kuhn poker.

Let f 1 , f 2 be real-valued continuous functions with shared compact domain X × Y. Furthermore, assume their max-min values are attained and the following inequality holds for any (x, y) ∈ X × Y:|f 1 (x, y) -f 2 (x, y)| < ϵ.Then it follows that|[max x∈X min y∈Y f 1 (x, y)] -[max x∈X min y∈Y f 2 (x, y)]| < ϵ.

x, y) + ϵ.Where the first inequality is due to taking the min of both sides of equation 10. Following the same steps starting with f 2 and using equation 11 givesmax x∈X min y∈Y f 2 (x, y) ≤ max x∈X min y∈Y f 1 (x, y) + ϵ.These two inequalities together yield the result.Theorem 4. The minimax PuB-AMG value function ṽ * is continuous for any objective J for which policy evaluation is continuous.Proof. Fix ϵ > 0. Let b and b ′ differ in total variation distance by less than δ = ϵ 2M. Then observe that, for a joint policy π, we have that|v π (b) -v π (b ′ )| = | h b(h)v π (h) -b ′ (h)v π (h)| ≤ h |b(h)v π (h) -b ′ (h)v π (hwhere v π (b) is the expected return under J to playing π starting from the subgame defined by b. Then |ṽ * (b) -ṽ * (b ′ )| = |[max

E H t ∼s t E A t ∼ã(Hι) P(h t+1 | H t , A t , o t+1 pub )

Next, we define a canonical choice function Π ↑ , which maps each joint policy to a corresponding PuB-AMG policy.

annex

In total, we have shown that if π |s is not an equilibrium in the subgame induced by s, then π is not an equilibrium because π 1 is not a best response. Thus, π |s must be an equilibrium of the subgame. Therefore, because J is uniqueness guaranteeing, we must have π s = π |s . Lemma 3. If π is an equilibrium of the PuB-AMG, then the decision rule for the first time step Π ↓ (π) 0 must be part of an equilibrium policy in the original game.Proof. Without loss of generality, assume that ι = 0 at the first time step. Also, use π -0 0 to denote the part of π 0 that is relevant after the first time step. Also, let [π ′0 , π ′-0 0 ] denote a policy for i = 0 that plays according to π ′0 at the first time step and π ′-0 0 otherwise. Let π be an equilibrium of the PuB-AMG. Then observeHere, the first equality follows by Corollary 3; the second equality follows because π0 is part of an equilibrium; the third equality follows because J(π ′ ) = J(Π ↓ (π ′ )); the fourth line equality follows because the image of the correspondence mapping for the first time step is invariant to the PuB-AMG policy at later time steps; the fifth line follows because each player can express any policy in the original game through Π ↓ , up to reachability, and because changes over unreachable AOHs do not change the expected return; the sixth line follows because the evaluation of an argmax is equal to the max.This chain of equalities shows that the best response value tois equal to the value of the game. Thus, Π ↓ (π) 0 is part of an equilibrium.Theorem 1 If π is an equilibrium of the PuB-AMG induced by a uniqueness-guaranteeing objective, then its projection Π ↓ (π) is an equilibrium in the original game.Proof. Lemma 3 shows this to be true for the first time step. Now assume this is true up to time step t and consider time step t + 1. Then, for a particular reachable st+1 , the PuB-AMG subgame starting at this point is the PuB-AMG of the subgame of the original game starting from st+1 . Thus, the PuB-AMG strategy for st+1 must correspond to an equilibrium of the subgame of the original game, as per Lemma 3. Furthermore, because the minimax objective is uniqueness guaranteeing, the equilibrium strategy of the subgame of the original game must the unique restriction of the equilibrium of the original game to that subgame, as per Lemma 2. 

