A UNIFIED APPROACH TO REINFORCEMENT LEARNING, QUANTAL RESPONSE EQUILIBRIA, AND TWO-PLAYER ZERO-SUM GAMES

Abstract

This work studies an algorithm, which we call magnetic mirror descent, that is inspired by mirror descent and the non-Euclidean proximal gradient algorithm. Our contribution is demonstrating the virtues of magnetic mirror descent as both an equilibrium solver and as an approach to reinforcement learning in two-player zero-sum games. These virtues include: 1) Being the first quantal response equilibria solver to achieve linear convergence for extensive-form games with first order feedback; 2) Being the first standard reinforcement learning algorithm to achieve empirically competitive results with CFR in tabular settings; 3) Achieving favorable performance in 3x3 Dark Hex and Phantom Tic-Tac-Toe as a self-play deep reinforcement learning algorithm.

1. INTRODUCTION

This work studies an algorithm that we call magnetic mirror descent (MMD) in the context of twoplayer zero-sum games. MMD is an extension of mirror descent (Beck & Teboulle, 2003; Nemirovsky & Yudin, 1983) with proximal regularization and a special case of a non-Euclidean proximal gradient method (Tseng, 2010; Beck, 2017) -both of which have been studied extensively in convex optimization. To facilitate our analysis of MMD, we extend the non-Euclidean proximal gradient method from convex optimization to 2p0s games and variational inequality problems (Facchinei & Pang, 2003) more generally. We then prove a new linear convergence result for the non-Euclidean proximal gradient method in variational inequality problems with composite structure. As a consequence of our general analysis, we attain formal guarantees for MMD by showing that solving for quantal response equilibria (McKelvey & Palfrey, 1995) (i.e., entropy regularized Nash equilibria) in extensive-form games (EFGs) can be modeled as variational inequality problems via the sequence form (Romanovskii, 1962; Von Stengel, 1996; Koller et al., 1996) . These guarantees provide the first linear convergence results to quantal response equilibria (QREs) in EFGs for a first order method. Our empirical contribution investigates MMD as a last iterate (regularized) equilibrium approximation algorithm across a variety of 2p0s benchmarks. We begin by confirming our theory-showing that MMD converges exponentially fast to QREs in both NFGs and EFGs. We also find that, empirically, MMD converges to agent QREs (AQREs) (McKelvey & Palfrey, 1998 )-an alternative formulation of QREs for extensive-form games-when applied with action-value feedback. These results lead us to examine MMD as an RL algorithm for approximating Nash equilibria. On this front, we show ˚Equal contribution competitive performance with counterfactual regret minimization (CFR) (Zinkevich et al., 2007) . This is the first instance of a standard RL algorithmfoot_0 yielding empirically competitive performance with CFR in tabular benchmarks when applied in self play. Motivated by our tabular results, we examine MMD as a multi-agent deep RL algorithm for 3x3 Abrupt Dark Hex and Phantom Tic-Tac-Toe-encouragingly, we find that MMD is able to successfully minimize an approximation of exploitability. In addition to those listed above, we also provide numerous other experiments in the appendix. In aggregate, we believe that our results suggest that MMD is a unifying approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.

2. BACKGROUND

Sections 2.1 and 3.3 provide a casual treatment of our problem settings and solution concepts and a summary of our algorithm and some of our theoretical results. Sections 2.2 through 3.2 give a more formal and detailed treatment of the same material-these sections are self-contained and safe-to-skip for readers less interested in our theoretical results.

2.1. PROBLEM SETTINGS AND SOLUTION CONCEPTS

This work is concerned with 2p0s games-i.e., settings with two players in which the reward for one player is the negation of the reward for the other player.foot_1 Two-player zero-sum games are often formalized as NFGs, partially observable stochastic games (Hansen et al., 2004) or a perfect-recall EFGs (von Neumann & Morgenstern, 1947) . An important idea is that it is possible to convert any EFG into an equivalent NFG. The actions of the equivalent NFG correspond to the deterministic policies of the EFG. The payoffs for a joint action are dictated by the expected returns of the corresponding joint policy in the EFG. We introduce the solution concepts studied in this work as generalizations of single-agent solution concepts. In single-agent settings, we call these concepts optimal policies and soft-optimal policies. We say a policy is optimal if there does not exist another policy achieving a greater expected return (Sutton & Barto, 2018) . In problems with a single decision-point, we say a policy is α-soft optimal in the normal sense if it maximizes a weighted combination of its expected action value and its entropy: π " arg max π 1 P∆pAq E A"π 1 qpAq `αHpπ 1 q, ( ) where π is a policy, ∆pAq is the action simplex, q is the action-value function, α is the regularization temperature, and H is Shannon entropy. More generally, we say a policy is α-soft optimal in the behavioral sense if it satisfies equation (1) at every decision point. In 2p0s settings, we refer to the solution concepts used in this work as Nash equilibria and QREs. We say a joint policy is a Nash equilibrium if each player's policy is optimal, conditioned on the other player not changing its policy. In games with a single-decision point, we say a joint policy is a QRE 3 (McKelvey & Palfrey, 1995) if each player's policy is soft optimal in the normal sense, conditioned on the other player not changing its policy. More generally, we say a joint policy is an agent QRE (AQRE) (McKelvey & Palfrey, 1998) if each player's policy is soft optimal in the behavioral sense, subject to the opponent's policy being fixed. Note that AQREs of EFGs do not generally correspond with the QREs of their normal-form equivalents. Outside of (A)QREs, our results also apply to other regularized solution concepts, such as those having KL regularization toward a non-uniform policy.

2.2. NOTATION

We use superscript to denote a particular coordinate of x " px 1 , ¨¨¨, x n q P R n and subscript to denote time x t . We use the standard inner product denoted as xx, yy " ř n i"1 x i y i . For a given



We use "standard RL algorithm" to mean algorithms that would look ordinary to single-agent RL practitioners-excluding, e.g., algorithms that converge in the average iterate or operate over sequence form. Note that 2p0s games generalize single-agent settings, such as Markov decision processes (Puterman, 2014) and partially observable Markov decision processes(Kaelbling et al., 1998). Specifically, it is a logit QRE; We omit "logit" as a prefix for brevity.

