The Advantage Regret-Matching Actor-Critic

Abstract

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior: Advantage Regret-Matching Actor-Critic (ARMAC). Rather than saving past state-action data, AR-MAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.

1. Introduction

The notion of regret is a key concept in the design of many decision-making algorithms. Regret minimization drives most bandit algorithms, is often used as a metric for performance of reinforcement learning (RL) algorithms, and for learning in games (3). When used in algorithm design, the common application is to accumulate values and/or regrets and derive new policies based on these accumulated values. One particular approach, counterfactual regret (CFR) minimization (35), has been the core algorithm behind super-human play in Computer Poker research (4; 25; 6; 8) . CFR computes an approximate Nash equilibrium by having players minimize regret in self-play, producing an average strategy that is guaranteed to converge to an optimal solution in two-player zero-sum games and single-agent games. We investigate the problem of generalizing these regret minimization algorithms over large state spaces in the sequential setting using end-to-end function approximators, such as deep networks. There have been several approaches that try to predict the regret, or otherwise, simulate the regret minimization: Regression CFR (RCFR) (34), advantage regret minimization (17), regret-based policy gradients (30), Deep Counterfactual Regret minimization (5), and Double Neural CFR (22). All of these approaches have focused either on the multiagent or single-agent problem exclusively, some have used expert features, while others tree search to scale. Another common approach is based on fictitious play (15; 16; 21; 24) , a simple iterative self-play algorithm based on best response. A common technique is to use reservoir sampling to maintain a buffer that represents a uniform sample over past data, which is used to train a classifier representing the average policy. In Neural Fictitious Self-Play (NFSP), this produced competitive policies in limit Texas Hold'em ( 16), and in Deep CFR this method was shown to approach an approximate equilibrium in a large subgame of Hold'em poker. A generalization of fictitious play, policy-space response oracles (PSRO) (21), stores past policies and a meta-distribution over them, replaying policies against other policies, incrementally adding new best responses to the set, which can be seen as a population-based learning approach where the individuals are the policies and the distribution is modified based on fitness. This approach only requires simulation of the policies and aggregating data; as a result, it was able to scale to a very large real-time strategy game (33). In this paper, we describe an approximate form of CFR in a training regime that we call retrospective policy improvement. Similar to PSRO, our method stores past policies. However, it does not store meta-distributions or reward tables, nor do the policies have to be approximate best responses, which can be costly to compute or learn. Instead, the policies are snapshots of those used in the past, which are retrospectively replayed to predict a conditional advantage, which used in a regret matching algorithm produces the same policy as CFR would do. In the single-agent setting, ARMAC is related to Politex (1), except that it is based on regret-matching ( 14) and it predicts average quantities rather than explicitly summing over all the experts to obtain the policy. In the multiagent setting, it is a sample-based, model-free variant of RCFR with one important property: it uses trajectory samples to estimate quantities without requiring importance sampling as in standard Monte Carlo CFR (20), hence it does not suffer from excessive variance in large environments. This is achieved by using critics (value estimates) of past policies that are trained off-policy using standard policy evaluation techniques. In particular, we introduce a novel training regime that estimates a conditional advantage W i (s, a), which is the cumulative counterfactual regret R i (s, a), scaled by factor B(s) that depends on the information state s only; hence, using regret-matching over this quantity yields the policy that CFR would compute when applying regret-matching to the same (unscaled) regret values. By doing this entirely from sampled trajectories, the algorithm is model-free and can be done using any black-box simulator of the environment; hence, ARMAC inherits the scaling potential of PSRO without requiring a best-response training regime, driven instead by regret minimization. Problem Statement. CFR is a tabular algorithm that enumerates the entire state space, and has scaled to large games through domain-specific (hand-crafted) state space reductions. The problem is to define a model-free variant of CFR using only sampled trajectories and general (domain-independent) generalization via functional approximation, without the use of importance sampling commonly used in Monte Carlo CFR, as it can cause excessive variance in large domains.

2. Background

In this section, we describe the necessary terminology. Since we want to include the (partiallyobservable) multiagent case and we build on algorithms from regret minimization we use extensive-form games notations (29). A single-player game represents the single-agent case where histories are aggregated appropriately based on the Markov property. A game is a tuple (N , A, S, H, Z, u, τ ), where N = {1, 2, • • • , n} is the set of players. By convention we use i ∈ N to refer to a player, and -i for the other players (N -{i}). There is a special player c called chance (or nature) that plays with a fixed stochastic strategy (chance's fixed strategy determines the transition function). A is a finite set of actions. Every game starts in an initial state, and players sequentially take actions leading to histories of actions h ∈ H. Terminal histories, z ∈ Z ⊂ H, are those which end the episode. The utility function u i (z) denotes the player i s return over episode z. The set of states S is a partition of H where histories are grouped into information states s = {h, h , . . .} such that the player to play at s, τ (s), cannot distinguish among the possible histories (world states) due to private information only known by other players 1 . Let ∆(X) represent all distributions over X: each player's (agent's) goal is to learn a policy π i : S i → ∆(A), where S i = {s | s ∈ S, τ (s) = i}. For some state s, we denote A(s) ⊆ A as the legal actions at state s, and all valid state policies π(s) assign probability 0 to illegal actions a ∈ A(s). We now show a diagram to illustrate the key ideas. Kuhn poker, shown in Figure 1 is a poker game with a 3-card deck: Jack (J), Queen (Q), and King (K). Each player antes a single chip and has one more chip to bet with, then gets a single priavte card at random and one is left face down, and players proceed to bet (b) or pass (p). Initially the game 1 Information state is the belief about the world that a given player can infer based on her limited observations and may correspond to many possible histories (world states)

