ABSTRACTING IMPERFECT INFORMATION AWAY FROM TWO-PLAYER ZERO-SUM GAMES

Abstract

In their seminal work, Nayyar et al. (2013) showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned non-correspondence problem-thus, computing them can be treated as perfect information problems. This result yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning algorithms.

1. INTRODUCTION

In single-agent settings, dynamic programming (Bertsekas, 2000) serves as a bedrock for reinforcement learning (Sutton & Barto, 2018) , justifying approximating optimal policies by backward induction and facilitating a simple framework for decision-time planning. One might hope that dynamic programming could also provide similar grounding in multi-agent settings for well-defined notions of optimality, like optimal joint policies in common-payoff games, Nash equilibria in two-player zero-sum (2p0s) games, and team correlated equilibria in two-team zero-sum (2t0s) games. Unfortunately, this is not straightforwardly the case when there is imperfect information-i.e., one player has knowledge that another does not or two players act simultaneously. This difficulty arises from two causes, which we call the backward dependence problem and the non-correspondence problem. The backward dependence problem is that computing the expected return starting from a decision point generally requires knowledge about policies that were played up until now, in addition to the



Figure 1: Our contribution in the context of related work, at an abstract level.

