ITERATIVE EMPIRICAL GAME SOLVING VIA SINGLE POLICY BEST RESPONSE

Abstract

Policy-Space Response Oracles (PSRO) is a general algorithmic framework for learning policies in multiagent systems by interleaving empirical game analysis with deep reinforcement learning (Deep RL). At each iteration, Deep RL is invoked to train a best response to a mixture of opponent policies. The repeated application of Deep RL poses an expensive computational burden as we look to apply this algorithm to more complex domains. We introduce two variations of PSRO designed to reduce the amount of simulation required during Deep RL training. Both algorithms modify how PSRO adds new policies to the empirical game, based on learned responses to a single opponent policy. The first, Mixed-Oracles, transfers knowledge from previous iterations of Deep RL, requiring training only against the opponent's newest policy. The second, Mixed-Opponents, constructs a pure-strategy opponent by mixing existing strategy's action-value estimates, instead of their policies. Learning against a single policy mitigates variance in state outcomes that is induced by an unobserved distribution of opponents. We empirically demonstrate that these algorithms substantially reduce the amount of simulation during training required by PSRO, while producing equivalent or better solutions to the game.

1. INTRODUCTION

In (single-agent) reinforcement learning (RL), an agent repeatedly interacts with an environment until it achieves mastery, that is, can find no further way to improve performance (measured by reward). In a multiagent system, the outcome of any learned policy may depend pivotally on the behavior of other agents. Direct search over a complex joint policy space for game solutions is daunting, and so more common in approaches combining RL and game theory is to interleave learning and game analysis iteratively. In empirical game-theoretic analysis (EGTA), game reasoning is applied to approximate game models defined over restricted strategy sets. The strategies are selected instances drawn from a large space of possible strategies, for example, defined in terms of a fundamental policy space: mappings from observations to actions. For a given set of strategies (policies),foot_0 we estimate an empirical game via simulation over combinations of these strategies. In iterative approaches to EGTA, analysis of the empirical game informs decisions about the sampling of strategy profiles or additions of strategies to the restricted sets. Schvartzman & Wellman (2009b) first combined RL with EGTA, adding new strategies by learning a policy for one agent (using tabular Q-learning), fixing other agent policies at a Nash equilibrium of the current empirical game. Lanctot et al. (2017) introduced a general framework, Policy-Space Response Oracles (PSRO), interleaving empirical game analysis with deep RL. PSRO generalizes the identification of learning targets, employing an abstract method termed meta-strategy solver (MSS) that extracts a strategy profile from an empirical game. At each epoch of PSRO, a player uses RL to derive a new policy by training against the opponentstrategies in the MSS-derived profile. A particular challenge for the RL step in PSRO is that the learner must derive a response under uncertainty about opponent policies. The profile returned by the MSS is generally a mixed-strategy profile, as in strategically complex environments randomization is often a necessary ingredient for equilibrium. The opponent draws from this mixture are unobserved, adding uncertainty to the multiagent environment. We address this challenge through variants of PSRO in which all RL is applied to environments where opponents play pure strategies. We propose and evaluate two such methods, which work in qualitatively different ways: Mixed-Oracles learns separate best-responses (BR) to each pure strategy in a mixture and combines the results from learning to approximate a BR to the mixture. Mixed-Opponents constructs a single pure opponent policy that represents an aggregate of the mixed strategy and learns a BR to this policy. Both of our methods employ the machinery of Q-Mixing (Smith et al., 2020) , which constructs policies based on an aggregation of Q-functions corresponding to component policies of a mixture. Our methods promise advantages beyond those of learning in a less stochastic environment. Mixed-Oracles transfers learning across epochs, exploiting the Q-functions learned against a particular opponent policy in constructing policies for any other epoch where that opponent policy is encountered. Mixed-Opponents applies directly over the joint opponent space, and so has the potential to scale beyond two-player games. We evaluate our methods in a series of experimental games, and find that both Mixed-Oracles and Mixed-Opponents are able to find solutions at least as good as an unmodified PSRO, and often better, while providing a substantial reduction in the required amount of training simulation.

2. PRELIMINARIES

An RL agent at time t ∈ T receives the state of the environment s t ∈ S, or a partial state called an observation o t ∈O. The agent then chooses an action a t according to its policy, π :O →∆(A), effecting the environment and producing a reward signal r t ∈ R. An experience is a (s t ,a t ,r t+1 ,s t+1 )-tuple, and a sequence of experiences ending in a terminal state is an episode τ. How the environment changes as a result of the action is dictated by the environment's transition dynamics p : S ×A → S. The agent is said to act optimally when it maximizes return G t = ∞ l γ l r t+l (γ is a discount factor). In value-based RL, agents use return as an estimate of the quality of a state V (o t ) = E π ∞ l=0 γ l r(o t+l ,a t+l ) , and/or taking an action in a state Q(o t ,a t )=r(o t ,a t )+γE o t+1 ∈O V (o t+1 ) . A normal form games (NFG) Γ = (Π,U,n) describes a one-shot strategic interaction among n players. When there is more than one agent, we denote agent-specific components with subscripts (e.g., π i ). Negated subscripts represent the joint elements of all other agents (e.g., π -i ). Each of the players has a set of policies Π i = π 0 i ,...,π k i from which it may choose. Player i may select a pure strategy π i ∈Π i , or may randomize play by sampling from a mixed strategy σ i ∈∆(Π i ). At the end of the interaction, each player receives a payoff U :Π→R n . An empirical NFG (ENFG) Γ=( Π, Ũ,n) is an NFG induced by simulation of strategy profiles. An ENFG approximates an underlying game Γ, estimating the payoffs through simulation. ENFGs are often employed when the strategy set is too large for exhaustive representation, for example when the strategies are instances from a complex space of policies. The quality of a strategy profile in an ENFG can be evaluated using regret: how much players could gain by deviating from assigned policies. Regret is measured with respect to a set of available deviation policies. The regret of solution σ to player i when they are able to deviate to policies in Π i ⊆ Π i is: Regret i (σ,Π i ) = max πi∈Πi U i (π i ,σ -i )-U i (σ i ,σ -i ). Example deviation sets Π employed in this paper are Π PSRO , the strategies accumulated through a run of PSRO, and Π EVAL , denoting a static set of held-out evaluation policies. The sum of regrets across all players SumRegret(σ, Π) = i∈n Regret i (σ, Π i ) , sometimes called the Nash convergence, is a measure of how stable a solution is.

3. METHOD

At each epoch e of the PSRO algorithm a new policy is constructed for each player by best-responding to an opponent profile σ * ,e-1 -i from the currently constructed ENFG: π e i ∈ BR(σ * ,e-1 -i ). These policies are then added to each player's strategy set, Πe i ← Πe-1 i ∪{π e i }, and the new profiles are simulated to expand the ENFG. Algorithm 1 presents the full PSRO algorithm as defined by Lanctot et al. (2017) . One of the key design choices in iterative empirical game-solving is choosing which policies to add to the ENFG. This was first studied by Schvartzman & Wellman (2009a) , who termed it the strategy exploration problem. We want to add policies that both bring the solution to the ENFG closer to the solution of the full game and that can be calculated efficiently. In PSRO, the strategy exploration problem is decomposed into two steps: solution and BR via RL. In the solution step, PSRO derives a profile σ * ,e from the current



The term policy as employed in the RL context corresponds exactly to the game-theoretic notion of strategy in our setting, and we use the terms interchangeably.

