The Advantage Regret-Matching Actor-Critic

Abstract

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior: Advantage Regret-Matching Actor-Critic (ARMAC). Rather than saving past state-action data, AR-MAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.

1. Introduction

The notion of regret is a key concept in the design of many decision-making algorithms. Regret minimization drives most bandit algorithms, is often used as a metric for performance of reinforcement learning (RL) algorithms, and for learning in games (3) . When used in algorithm design, the common application is to accumulate values and/or regrets and derive new policies based on these accumulated values. One particular approach, counterfactual regret (CFR) minimization (35) , has been the core algorithm behind super-human play in Computer Poker research (4; 25; 6; 8) . CFR computes an approximate Nash equilibrium by having players minimize regret in self-play, producing an average strategy that is guaranteed to converge to an optimal solution in two-player zero-sum games and single-agent games. We investigate the problem of generalizing these regret minimization algorithms over large state spaces in the sequential setting using end-to-end function approximators, such as deep networks. There have been several approaches that try to predict the regret, or otherwise, simulate the regret minimization: Regression CFR (RCFR) (34) , advantage regret minimization (17) , regret-based policy gradients (30) , Deep Counterfactual Regret minimization (5) , and Double Neural CFR (22) . All of these approaches have focused either on the multiagent or single-agent problem exclusively, some have used expert features, while others tree search to scale. Another common approach is based on fictitious play (15; 16; 21; 24) , a simple iterative self-play algorithm based on best response. A common technique is to use reservoir sampling to maintain a buffer that represents a uniform sample over past data, which is used to train a classifier representing the average policy. In Neural Fictitious Self-Play (NFSP), this produced competitive policies in limit Texas Hold'em (16) , and in Deep CFR this method was shown to approach an approximate equilibrium in a large subgame of Hold'em poker. A generalization of fictitious play, policy-space response oracles (PSRO) (21) , stores past policies and a meta-distribution over them, replaying policies against other policies, incrementally adding new best responses to the set, which can be seen as a population-based learning approach where the individuals are the policies and the distribution is modified based on fitness. This approach only requires simulation of the policies and aggregating data; as a result, it was able to scale to a very large real-time strategy game (33) . In this paper, we describe an approximate form of CFR in a training regime that we call retrospective policy improvement. Similar to PSRO, our method stores past policies. However, it does not store meta-distributions or reward tables, nor do the policies have to be approximate best responses, which can be costly to compute or learn. Instead, the policies are snapshots of those used in the past, which are retrospectively replayed to predict a conditional advantage, which used in a regret matching algorithm produces the same policy as CFR would do. In the single-agent setting, ARMAC is related to Politex (1), except that it is based on regret-matching (14) and it predicts average quantities rather than explicitly summing over all the experts to obtain the policy. In the multiagent setting, it is a sample-based, model-free variant of RCFR with one important property: it uses trajectory samples to estimate quantities without requiring importance sampling as in standard Monte Carlo CFR (20) , hence it does not suffer from excessive variance in large environments. This is achieved by using critics (value estimates) of past policies that are trained off-policy using standard policy evaluation techniques. In particular, we introduce a novel training regime that estimates a conditional advantage W i (s, a), which is the cumulative counterfactual regret R i (s, a), scaled by factor B(s) that depends on the information state s only; hence, using regret-matching over this quantity yields the policy that CFR would compute when applying regret-matching to the same (unscaled) regret values. By doing this entirely from sampled trajectories, the algorithm is model-free and can be done using any black-box simulator of the environment; hence, ARMAC inherits the scaling potential of PSRO without requiring a best-response training regime, driven instead by regret minimization. Problem Statement. CFR is a tabular algorithm that enumerates the entire state space, and has scaled to large games through domain-specific (hand-crafted) state space reductions. The problem is to define a model-free variant of CFR using only sampled trajectories and general (domain-independent) generalization via functional approximation, without the use of importance sampling commonly used in Monte Carlo CFR, as it can cause excessive variance in large domains.

2. Background

In this section, we describe the necessary terminology. Since we want to include the (partiallyobservable) multiagent case and we build on algorithms from regret minimization we use extensive-form games notations (29) . A single-player game represents the single-agent case where histories are aggregated appropriately based on the Markov property. A game is a tuple (N , A, S, H, Z, u, τ ), where N = {1, 2, • • • , n} is the set of players. By convention we use i ∈ N to refer to a player, and -i for the other players (N -{i}). There is a special player c called chance (or nature) that plays with a fixed stochastic strategy (chance's fixed strategy determines the transition function). A is a finite set of actions. Every game starts in an initial state, and players sequentially take actions leading to histories of actions h ∈ H. Terminal histories, z ∈ Z ⊂ H, are those which end the episode. The utility function u i (z) denotes the player i s return over episode z. The set of states S is a partition of H where histories are grouped into information states s = {h, h , . . .} such that the player to play at s, τ (s), cannot distinguish among the possible histories (world states) due to private information only known by other playersfoot_0 . Let ∆(X) represent all distributions over X: each player's (agent's) goal is to learn a policy π i : S i → ∆(A), where S i = {s | s ∈ S, τ (s) = i}. For some state s, we denote A(s) ⊆ A as the legal actions at state s, and all valid state policies π(s) assign probability 0 to illegal actions a ∈ A(s). We now show a diagram to illustrate the key ideas. Kuhn poker, shown in Figure 1 is a poker game with a 3-card deck: Jack (J), Queen (Q), and King (K). Each player antes a single chip and has one more chip to bet with, then gets a single priavte card at random and one is left face down, and players proceed to bet (b) or pass (p). Initially the game starts in the empty history h 0 = ∅ where no actions have been taken, and it is chance's turn to play. Suppose chance samples, according to a fixed distribution, one of its six actions, which corresponding to one of the size-2 permutations of deals (one card to each player). For example, suppose outcome 1Q2J is sampled, corresponding to the first player getting the queen and second player getting the jack. This would correspond to a new history h = (1Q2J). Label the information state corresponding to this history as s depicted by the grey joined circles: h = (1Q2K). At this information state s = {h, h }, it is the fist player's turn (τ (s) = 1) and it includes every history consistent with their information (namely, that they were dealt the jack). The legal actions are now A(s) = {p, b}. Suppose the first player chooses p and the second player chooses b, then the history is part of s , the second information state shown in the figure . Finally, suppose the first player chooses to bet (call), then the first player would win gaining 2 chips since they have the higher ranking card. Each player i's goal is to compute π i that achieves maximal reward in expectation, where the expectation is taken over all players' policies, even though player i controls only their own policy. Hence, ideally, the player would learn a safe policy that guarantees the best worst-case scenario. Let π denote a joint policy. Define the state-value v π,i (s) as the expected (undiscounted) return for player i given that state s is reached and all players follow π. Let q π,i be defined similarly except also conditioned on player τ (s) taking action a at s. Formally, v π,i (s) = (h,z)∈Z(s) η π (h|s)η π (h, z)u i (z), where Z(s) are all terminal histories paired with their prefixes that pass through s, η π (h|s) = η π (h) η π (s) , where η π (s) = h ∈s η π (h ), and η π (h, z) is the product of probabilities of each action taken by the players' policies along h to z. The stateaction values q π,i (s, a) are defined analogously. Standard value-based RL algorithms estimate these quantities for policy evaluation. Regret minimization in zero-sum games uses a different notion of value, the counterfactual value: v c π,i (s) = (h,z)∈Z(s) η π -i (h)η π (h, z)u i (z) , where η π -i (h) is the product of opponents' policy probabilities along h. We also write η π i (h) the product of player i's own probabilities along h. Under the standard assumption of perfect recall, we have that for any h, h ∈ s, η π i (h) = η π i (h ). Thus counterfactual values are formally related to the standard values (30)  : v π,i (s) = v c π,i β-i(π,s) , where β -i (π, s) = h∈s η π -i (h). Also, q c π,i (s, a) is defined similarly except over histories (ha, z) ∈ Z(s), where ha is history h concatenated with action a. Counterfactual regret minimization (CFR) is a tabular policy iteration algorithm that has facilitated many advances in Poker AI (35) . On each iteration t, CFR computes counterfactual values q c π,i (s, a) and v c π,i (s) for each state s and action a ∈ A(s) and the regret of not choosing action a (or equivalently the advantage of choosing action a at state s, r t (s, a) = q c π t ,i (s, a) -v c π t ,i (s). CFR tracks the cumulative regrets for each state and action, R T (s, a) = T t=1 r t (s, a). Define (x) + = max(0, x); regret-matching then updates the policy of each action a ∈ A(s) as follows ( 14): π T +1 (s, a) = NormalizedReLU(R T , s, a) = R T ,+ (s,a) b∈A(s) R T ,+ (s,b) if b∈A(s) R T,+ (s, b) > 0 1 |A(s)| otherwise , In two-player zero-sum games, the mixture policy πT converges to the set of Nash equilibria as T → ∞. Traditional (off-policy) Monte Carlo CFR (MCCFR) is a generic family of sampling variants (20) . In outcome sampling MCCFR, a behavior policy µ i is used by player i, while players -i use π -i , a trajectory ρ ∼ (µ i , π -i ) is sampled, and the sampled counterfactual value is computed: qc π,i (s, a | ρ) = 1 η (µi,π-i) i (z) η (µi,π-i) i (ha, z)u i (z), if (s, a) ∈ ρ, or 0 otherwise. qc π,i (s, a | ρ) is an unbiased estim. of q c π,i (s, a) (20, Lemma 1). However, since these quantities are divided by η (µ,π-i) i (z), the product of player i's probabilities, (i) there can be significant variance introduced by sampling, especially in problems involving long sequences of decisions, and (ii) the ranges of the ṽc i can vary wildly (and unboundedly if the exploration policy is insufficiently mixed) over iterations and states, which could make approximating the values in a general way particularly challenging (34) . Deep CFR and Double Neural CFR are successful large-scale implementations of CFR with function approximation, and they get around this variance issue by using external sampling or a robust sampling technique, both of which require a perfect game model and enumeration of the tree. This is unfeasible in very large environments or in the RL setting where full trajectories are generated from beginning to the end without having access to a generative model which could be used to generate transitions from any state.

2.1. Equilibria, Exploitability, and NashConv

In two-player zero-sum games (and, trivially, single-agent games) a Nash equilibrium policy is optimal because it maximizes a player's worst-case payoff (29) . Success in Poker AI, leading to super-human ability, has largely been driven by computing approximate equilibria and playing the strategies against humans. A Nash equilibrium is a joint policy π * = (π * 1 , π * 2 ) such that no player has incentive to deviate from their respective policy because there is no policy that can achieve higher utility against the opponent's policy. A best response for player i is b i (π -i ) = argmax π i u i (π i , π -i ). Finally define player i's incentive to deviate (to a best response) as δ i (π) = u i (b i (π -i ), π -i ) -u i (π). Then, π is a Nash equilibrium if and only if deviating to a best response does not raise a player's utility: ∀i, δ i (π) = 0. Here, the zero on the right-hand side represents not having any incentive to deviate. However, how about if there is a small amount of incentive? The definition naturally extends to the approximate case where the right-hand size is non-zero. An empirical metric to compute how far an aribtrary policy is to a Nash equilibrium is then the sum over players: NashConv(π) = i δ i (π) ≥ 0. Note that the maximal value for NashConv is twice the utility range (this would occur if each player uses a policy achieving the minimum utility, and there exists a best response which gets the maximum utility). In the poker literature there is a commonly metric called exploitability which computes the average rather than the sum: Exploitability(π) = i δi(π) 2 . These metrics measure the empirical distance to equilibrium over time leading to an assessment of an algorithm's convergence rate in practice. 3 The Advantage Regret-Matching Actor-Critic Algorithm 1: Advantage Regret-Matching Actor-Critic input : initial set of parameters θ 0 , num. players n Set initial learning player i ← 1 for epoch t ∈ {0, 1, 2, • • • } do reset D ← ∅ Let π t (s) = NormalizedReLU( Wθ t (s)) Let v θ t (h) = a∈A(h) π t (h, a)q θ t (h, a) Let µ t i be a behavior policy for player i for episode k ∈ {1, . . . , K act } do i ← (i + 1) mod n Sample j ∼ Unif({0, 1, • • • , t -1}) Sample trajectory ρ ∼ (µ i , π j -i ) let d ← (i, j, {u i (ρ)} i∈N ) for history h ∈ ρ where player i acts do let s be the state containing h let r = {q θ j (h, a ) -v θ j (h)} a ∈A(s) let a be the action that was taken in ρ append (h, s, a, r, π j  q θ t (h, a) If τ (s) = i: train Wθ t to predict A(h, a) If τ (s) ∈ -i: train πθ t to predict π t (s) end end Save θ t for future retrospective replays; θ t+1 ← θ t end ARMAC is a model-free RL algorithm motivated by CFR. Like algorithms in the CFR framework, ARMAC uses a centralized training setup and operates in epochs that correspond to CFR iterations. Like RCFR, ARMAC uses function approximation to generate policies. ARMAC was designed so that as the number of samples per epoch increases and the expressiveness of the function approximator approaches a lookup table, the generated sequence of policies approaches that of CFR. Instead of accumulating cumulative regrets-which is problematic for a neural network-the algorithm learns a conditional advantage estimate W (s, a) by regression toward a history-dependent advantage A(h, a), for h ∈ s, and uses it to derive the next set of joint policies that CFR would produce. Indeed we show that W (s, a) is an estimate of the cumulative regret R(s, a) up to a multiplicative factor which is a function of the information state s only, and thus cancels out during the regret-matching step. ARMAC is a Monte Carlo algorithm in the same sense as MC-CFR: value estimates are trained from full episodes. It uses off-policy learning for training the value estimates (i.e. critics), which we show is sufficient to derive W . However, contrary to MCCFR, it does not use importance sampling. ARMAC is summarized in Algorithm 1. ARMAC runs over multiple epochs t and produces a joint policies π t+1 at the end of each epoch. Each epoch starts with an empty data set D and simulates a variety of joint policies executing multiple training iterations of relevant function approximators. ARMAC trains several estimators which can be either heads on the same neural network, or separate neural networks. The first one estimate the history-action values q π t ,i (h, a) = z∈Z(h,a) η π t (h, z)u i (z). This estimator 2 can be trained on all previous data by using any off-policy policy evaluation algorithm from experiences stored in replay memory (we use Tree-Backup(λ) (26) ). If trained until zero error, this quantity would produce the same history value estimates as recursive CFR computes in its tree pass. Secondly, the algorithm also trains a state-action network W t i (s, a) that estimates the expected advantage A µ t ,i (h, a) = q µ t ,i (h, a) -v µ t ,i (h) conditioned on h ∈ s when following some mixture policy µ t (which will be precisely defined in Section B). It happens that W t i (s, a) is an estimate of the cumulative regret R t (s, a) multiplied by a (non-negative) function which depends on the information state s only, thus does not impact the policy improvement step by regret-matching (see Lemma 1) . Once W t i (s, a) is trained, the next joint policy π t+1 (s, a) can be produced by normalizing the positive part as in Eq. 1. After each training epoch the joint policy π t is saved into a past policy reservoir, as it will have to be loaded and played during future epochs. Lastly, an average policy head πt is also trained via a classification loss to predict the policy π t over all time steps t ≤ t. We explain its use in Section 4. Using a history-based critic allows ARMAC to avoid using importance weight (IW) based offpolicy correction as is the case in MCCFR, but at the cost of higher bias due to inaccuracies that the critic has. Using IW may be especially problematic for long games. For large games the critic will inevitably rely on generalization to produce history-value estimates. To save memory, reservoir sampling with buffer of size of 1024 was used to prune past policies. The algorithm also works in a single agent case by treating all opponent reach probabilities as 1. More details and results are given in Appendix in Sections C.1 and D. Our main theoretical result is that ARMAC learns a function W T which is a stand-in replacement for the cumulative regrets of CFR, R T . See Appendix B for an analysis of ARMAC's theoretical properties. A worked out example is given in Appendix in Section A. ARMAC dynamically switches between what policy to use based on estimated returns. For every t there is a pool of candidate policies, all based on the following four policies: (i) random uniform policy. (ii) several policies defined by applying Eq 1 over the current epoch's regret only (q θ t (h, a) -v θ t (h)), with different levels of random uniform exploration: ∈ 0.0, 0.01, 0.05 . (iii) several policies defined by the mean regret, π t as stated in Algorithm 1, also with the same level of exploration. (iv) the average policy πt trained via classification. ARMAC generates experiences using those policies is to facilitate the problem of exploration and to help produce meaningful data at initial stages of learning before average regrets of its effects by any of the players in the game. Thus, the critics represent an expectation over those hidden outcomes. Since this does not affect the theoretical results, we choose this notation for simplicity. Importantly, ARMAC remains model-free: we never enumerate chance moves explicitly nor evaluate their probabilities which may be complex for many practical applications. are learnt. Each epoch, the candidate policies are ranked by cumulative return against an opponent playing πθ t . The one producing highest rewards is used half of the times. When sub-optimal policies are run for players -i, they are not used to train mean regrets for player i, but are used to train the critic. Typically, (ii) produces the best policy initially and allows to bootstrap the learning process with the best data (Fig. 2 ). In later stages of learning, (iii) with the smallest of yields better policies and gets consistently picked over other policies. The more complex the game is, the longer it takes for (iii) to take over (ii).

3.1. Adaptive Policy Selection

Exploratory policy µ T i is constructed by taking the most recent neural network with 50% probability or otherwise sampling one of the past neural networks uniformly and modulating it by the above described method.

3.2. Network architecture

ARMAC can be used with both feed-forward (FF) and recurrent neural networks (RNN) (Fig. 6(a) ). For small games where information states can be easily represented, FF networks were used. For larger games, where consuming observations rather than information states is more natural, RNNs were used. More details can be found in Appendix in Section F.

4. Empirical Evaluation

For partially-observable multiagent environments, we investigate Imperfect Information (II-) Goofspiel, Liar's Dice, and Leduc Poker and betting-abstracted no-limit Texas Hold'em poker (in Section 4.1). Goofspiel is a bidding card game where players spend bid cards collect points from a deck of point cards. Liar's dice is a 1-die versus 1-die variant of the popular game where players alternate bidding on the dice values. Leduc poker is a two-round poker game with a 6-card deck, fixed bet amounts, and a limit on betting. Longer descriptions of each games can be found in (24) . We use OpenSpiel (19) implementations with default parameters for Liar's Dice and Leduc poker, and a 5-card deck and descending points order for II-Goofspiel. To show empirical convergence, we use NashConv, the sum over each player's incentive to deviate to their best response unilaterally (21) , which can be interpreted as an empirical distance from Nash equilibrium (reaching Nash at 0). We compare empirical convergence to approximate Nash equilibria using a model-free sampled form of regression CFR (34) (MC-RCFR). Trajectories are obtained using outcome sampling MCCFR (20) , which uses off-policy importance sampling to obtain unbiased estimates of immediate regrets r, and average strategy updates ŝ, and individual (learned) state-action baselines (27) to reduce variance. A regressor then predicts R and a policy is obtained via Eq. 1, and similarly for the average strategy. Each episode, the learning player i plays with an -on-policy behavior policy (while opponent(s) play on-policy) and adds every datum (s, r, π( R)) to a data set, D, with a retention rule based on reservoir sampling so it approximates a uniform sample of all the data ever seen. MC-RCFR is related, but not equivalent to, a variant of DeepCFR (5) based on outcome sampling (OS-DeepCFR) (31) . Oufar results differ significantly from the OS-DeepCFR results reported in (31) , and we discuss differences in assumptions and experimental setup from previous work in Appendix C. As with ARMAC, the input is raw bits with no expert features. We use networks with roughly the same number of parameters as the ARMAC experiments: feed-forward with 4 hidden layers of 128 units with concatenated ReLU (28) activations, and train using the Adam optimizer. We provide details of the sweep over hyper-parameters in Appendix C. Next we compare ARMAC to NFSP (16) , which combines fictitious play with deep neural network function approximators. Two data sets, D RL and D SL , store transitions of sampled experience for reinforcement learning and supervised learning, respectively. D RL is a sliding window used to train a best response policy to π-i via DQN. D SL uses reservoir sampling to train πi , an average over all past best response policies. During play, each agent mixes between its best response policy and average policy. This stabilizes learning and enables the average policies to converge to an approximate Nash equilibrium. Like ARMAC and MC-RCFR, NFSP does not use any expert features. Convergence plots for MC-RCFR and NFSP are shown in Figure 3 , and for ARMAC in Figure 4 . NashConv values of ARMAC are lower (Liar's Dice) and higher (Goofspiel) than NFSP, but significantly lower than MC-RCFR in all cases. MC-RCFR results are consistent with the outcome sampling results in DNCFR (22) . Both DNCFR and Deep CFR compensate for this problem by instead using external and robust sampling, which require a forward model. So, next we investigate the performance of ARMAC in a much larger game. We ran ARMAC on no-limit Texas Hold'em poker, using the common { Fold, Call, Pot, All-in } (FCPA) action/betting abstraction. This game is orders of magnitude larger than games used above (≈ 4.42 • 10 13 information states). Action abstraction techniques were used by all of the state-of-the-art Poker AI bots up to 2017. Modern search-based techniques of DeepStack ( 25) and Libratus ( 6) still include action abstraction in the search tree.

4.1. No-Limit Texas Hold'em

Computing the NashConv requires traversing the whole game and querying the network at each information state. This becomes infeasible for large games. Instead, we use local best-response (LBR) (23) . LBR is an exploiter agent that produces a lower-bound on the exploitability: given some policy π -i it does a shallow search using the policy at opponent nodes, and a poker-specific heuristic evaluation at the frontier of the search. LBR found that previous competition-winning abstraction-based Poker bots were far more exploitable than first expected. In our experiments, LBR was limited to the betting abstractions: FCPA, and FC. We used three versions of LBR: LBR-FCPA, which uses all 4 actions within the abstraction, LBR-FC, which uses a more limited action set of { Fold, Call } and LBR-FC12-FCPA34 which uses { Fold, Call } for the first two rounds and FCPA for the rest. We first computed the average return that an ARMAC-trained policy achieves against uniform random. Over 200000 episodes, the mean value was 516 (chips) ± 25 (95% c.i.). Similarly, we evaluated the policy against LBR-FCPA; it won 519 ± 81 (95% c.i.) per episode. Hence, LBR-FCPA was unable to exploit the policy. ARMAC also beat LBR-FC12-FCPA34 by 867 ± 87 (95% c.i. 5 ). To the best of our knowledge, this is the first time LBR has been used to approximate exploitability in any form of no-limit Texas Hold'em among this class of algorithms.

5. Conclusion and Future Work

ARMAC was demonstrated to work on both single agent and multi-agent benchmarks. It is brings back ideas from computational game theory to address exploration issues while at the same time being able to handle learning in non-stationary environments. As future work, we intend to apply it to more general classes of multiagent games; ARMAC has the appealing property that it already stores the joint policies and history-based critics, which may be sufficient for convergence one of the classes of extensive-form correlated equilibria (10; 12; 11) .

A Worked-out Example

We now show an example of how ARMAC works on the simple game of Kuhn poker, shown in Figure 1 . Suppose ARMAC has already run for t = 50 epochs, so 50 networks have been saved, and the exploring player is the first player i = 1. The first player uses an exploratory behavior policy µ t i as described above. The second player uses network j = 17 sampled from Unif({0, 1, • • • , 49}). For this episode, chance samples 1Q2K. This happens with probability one sixth, so η -i (h) = 1 6 (chance is always seen as an opponent with a fixes policy) whereas player 1 has not taken any actions so η i (h) = 1. Along this episode ρ, the first player samples actions according to µ i and the second player according to π 17 -i . Suppose then player 1 samples bet and player 2 samples bet (call) leading to u 1 (ρ) = -2 for player 1 and u 2 (ρ) = 2 for player 2. There are two histories traversed, call them h and h respectively. For each one, the regret vectors r are determined by the critics q θ j (h, a ) -v θ j (h), where a is one the two legal actions. Trajectory ρ is added to the buffer D and many similar episodes take place. Finally, in the learning phase: ARMAC uses all the data collected to train the critics using standard 2 regression losses on the TD error defined by TB(λ); all the data can be used because TB is off-policy, allowing the exploratory behavior µ 50 i . Suppose examples from the first trajectory ρ are sampled: only data from the first player (history h) are used to train Wθ t on the advantage A(h, a) using standard regression loss; this is because to asymptotically approach CFR only the exploring player can train regrets leading to a scaling constant that is a function only of the information state (for more detail, see Appendix B). Finally, only the second player's actions are used to train the average network π using a classification loss, as the second player in ρ was playing according to CFR's average policy across 50 epochs (due to sampling j uniformly and then playing π j -i without exploration).

B Theoretical Properties

Each epoch t estimates q π t ,i (h, a) = z∈Z(h,a) η π t (h, z)u i (z) and value v π t ,i (h) = a π t (h, a)q π t ,i (h, a) for the current policies (π t ). Let us write the advantages A π t ,i (h, a) = q π t ,i (h, a) -v π t ,i (h). Notice that we learn functions of the history h and not state s. At epoch T , in order to deduce the next policy, π T +1 , CFR applies regret-matching using the cumulative counterfactual regret R T i (s, a). As already discussed, directly estimating R T i using sampling suffers from high variance due to the inverse probability η (µ,π-i) i (z) in (2). Instead, ARMAC trains a network W T i (s, a) that estimates a conditional advantage along trajectories generated in the following way: For player i we select a behavior policy µ T i providing a good state-space coverage, e.g. a mixture of past policies (π t i ) t≤T , with some added exploration (Section 3.1 provides more details). For the other players -i, for every trajectory, we choose one of the previous opponent policies π j -i played at some epoch j chosen uniformly at random from {1, 2, • • • T }. Thus at epoch T , several trajectories ρ j are generated by following policy (µ T i , π j -i ), where j ∼ U({1, 2, • • • T }). Then at each step (h, a) along these trajectory ρ j , the neural network estimate W T i (s, a) (where s h) is trained to predict the advantage A π j ,i (h, a) using the empirical 2 loss: L = W T i (s, a) -A π j ,i (h, a) 2 . Thus the corresponding average loss is L = 1 T T j=1 E ρ j ∼(µ T i ,π j -i ) L = 1 T T j=1 s∈Si h∈s η (µ T i ,π j -i ) (h)µ T i (s, a) W T i (s, a) -A π j ,i (h, a) 2 . If the network has sufficient capacity, it will minimize this average loss, and W T i (s, a) will converge (when the number of trajectories goes to ∞) in each state-action pair (s, a), such that the reach probability 1 T t η (µ T i ,π t -i ) (s)µ T i (s, a) > 0, to the conditional expectation W T i (s, a) = h∈s 1 T T j=1 η (µ T i ,π j -i ) (h)A π k ,i (h, a) 1 T T j=1 η (µ T i ,π t -i ) (s) = perfect recall h∈s 1 T T j=1 η π j -i (h)A π k ,i (h, a) 1 T T j=1 η π t -i (s) Notice that W T i does not depend on the exploratory policy µ T i for player i chosen in round T . After several trajectories ρ j our network W T i provides us with a good approximation of the W T i values and we use it in a regret matching update to define the next policy, π T +1 i (s) = NormalizedReLU( W T i ), i.e. Equation 1. Lemma 1 shows that if W T i (s, a) is sufficiently close to the W T i (s, a) values, then this is equivalent to CFR, i.e., doing regret-matching using the cumulative counterfactual regret R T . Lemma 1. The policy defined by NormalizedReLU(W T i ) is the same as the one produced by CFR when regret matching is employed as the information-state learner: π T +1 i (s, a) = R T,+ i (s, a) b R T,+ i (s, b) = W T,+ i (s, a) b W T,+ i (s, b) . Proof. First, let us notice that W T i (s, a) = h∈s T t=1 η π t (h) T t=1 η π t (s) A π t ,i (h, a), = h∈s T t=1 η π t -i (h) T t=1 η π t -i (s) A π t ,i (h, a) = 1 w T (s) T t=1 h∈s η π t -i (h)A π t ,i (h, a), where we used the perfect recall assumption in the first derivation, and we define w T (s) = t η π t -i (s). Notice that w T (s) depends on the state only (and not on h). Now the cumulative regret is: R T i (s, a) = K t=1 q c π t ,i (s, a) -v c π t ,i (s) = T t=1 η π t -i (s) q π t ,i (s, a) -v π k ,i (s) = T t=1 η π t -i (s) h∈s η π t -i (h) η π t -i (s) q π t ,i (h, a) -v π t ,i (h) = T t=1 h∈s η π t -i (h)A π t ,i (h, a) = w T (s)W T i (s, a). Finally, noticing that regret matching is not impacted by multiplying the cumulative regret by a positive function of the state, we deduce R T,+ i (s, a) b R T,+ i (s, b) = w T (s)W T i (s, a) + b w T (s)W T i (s, b) + = W T,+ i (s, a) b W T,+ i (s, b) . The W T (s, a) estimates the expected advantages 1 T T j=1 A π j (h, a) conditioned on h ∈ s. Thus ARMAC does not suffer from the variance of estimating the cumulative regret R T (s, a), and in the case of infinite capacity, from any (s, a), the estimate W T (s, a) is unbiased as soon as the (s, a) has been sampled at least once: Proof. The empirical loss being quadratic, under the event {N (s, a) > 0}, its minimum is well defined and reached for Ŵ T i (s, a) = 1 N (s, a) N (s,a) n=1 A π jn ,i (h n , a), where h n ∈ s is the history of the n-th trajectory traversing s. Let us use simplified notations and write A n = A π jn ,i (h, a)I{(h, a) ∈ ρ jn and h ∈ s} and b n = I{(h, a) ∈ ρ jn and h ∈ s}. Thus E Ŵ T i (s, a)I N m=1 b m > 0 = E N n=1 A n I N m=1 b m > 0 N m=1 b m = N n=1 E E A n I N m=1 b m > 0 N m=1 b m N m=1 b m = N n=1 E E A n N m=1 b m I N m=1 b m > 0 N m=1 b m . Now, E A n N m=1 b m = E A n |b n E b n | N m=1 b m since given b n , A n is independent of N m=1 b m . Thus E Ŵ T i (s, a)I N m=1 b m > 0 = N n=1 E A n |b n E E E b n N m=1 b m I N m=1 b m > 0 N m=1 b m = N n=1 E A n |b n E b n I N m=1 b m > 0 N m=1 b m Since N n=1 E bnI N m=1 bm>0 N m=1 bm = E N n=1 bn N m=1 bm I N m=1 b m > 0 = P N m=1 b m > 0 , by a symmetry argument we deduce E bnI N m=1 bm>0 N m=1 bm = 1 N P N m=1 b m > 0 for each n. Thus E Ŵ T i (s, a) N (s, a) > 0 = E Ŵ T i (s, a) N m=1 b m > 0 = E Ŵ T i (s, a)I N m=1 b m > 0 P N m=1 b m > 0 = 1 N N n=1 E[A n |b n ] = E[A 1 |b 1 ] which is the expectation of the advantage A π j ,i (h, a) conditioned on the trajectory ρ j going through h ∈ s, i.e. W T i (s, a) as defined in (3). 

C Baseline Details and Hyperparameters

For MC-RCFR, we sweep over all combinations of the exploration parameter, using a (learned) state-action baseline ( 27), and learning rate ( , b, α) ∈ {0.25, 0.5, 0.6, 0.7, 0.9, 0.95, 1.0} × {True, False} × {0.0001, 0.00005, 0.00001}, where each combination is averaged over five seeds. We found that higher exploration values worked consistently better, which matches the motivation of the robust sampling technique (corresponding to = 1) presented in (22) as it leads to reduced variance since part of the correction term is constant for all histories in an information state. The baseline helped significantly in the larger game with more variable-length episodes. For NFSP, we keep a set of hyperparameters fixed, in line with ( 21) and ( 16): anticipatory parameter η = 0. 3. MC-RCFR uses standard outcome sampling rather than Linear CFR (7). 4. MC-RCFR's strategy is approximated by predicting the OS's average strategy increment rather than sampling from a buffer of previous models. Our NFSP also does not use any extra enhancements.

C.1 Single-Agent Environments

Despite ARMAC being based on commonly-used multiagent algorithms, it has properties that may be desirable in the single-agent setting. First, similar to policy gradient algorithms in the common "short corridor example" (32, Example 13.1), stochastic policies are representable by definition, since they are normalized positive mean regrets over the actions. This could have a practical effect that entropy bonuses typically have in policy gradient methods, but rather than simply adding arbitrary entropy, the relative regret over the set of past policies is taken into account. Second, a retrospective agent uses a form of directed exploration of different exploration policies (2) . Here, this is achieved by the simulation (µ T i , π t -i ), which could be desirable whenever there is overlapping structure in successive tasks. µ T i here is an exploratory policy, which consists of a mixture of all past policies (plus random uniform) played further modulated with different amounts of random uniform exploration (more details are given in Section 3.1). Consider a gridworld illustrated in Fig. 6 (b). Green squares illustrate positions where the agent i gets a reward and the game terminates. Most of RL algorithms would find the reward of +1 first as it is the closest to the origin S. Once this reward is found, a policy would quickly learn to approach it, and finding reward +2 would be problematic. ARMAC, in the meantime, would keep re-running old policies, some of which would pre-date finding reward +1, and thus would have a reasonable chance of finding +2 by random exploration. This behaviour may also be useful if instead of terminating the game, reaching one of those two rewards would start next levels, both of which would have to be explored. These properties are not necessarily specific to ARMAC. For example, Politex (another retrospective policy improvement algorithm (1)) has similar properties by keeping its past approximators intact. Like Politex, we show an initial investigation of ARMAC in Atari in Appendix D. Average strategy sampling MCCFR (13) also uses exploration policies that are a mixture of previous policies and uniform random to improve performance over external and outcome sampling variants. However, this exact sampling method cannot be used directly in ARMAC as it requires a model of the game.

D Initial Investigation of ARMAC in the Atari Learning Environment

While performance on Atari is not the main contribution, it should be treated as a health check of the algorithm. Unlike previously tested multiplayer games, many Atari games have a long term credit assignment problem. Some of them, like Montezuma's Revenge, are well-known hard exploration problems. It is interesting to see that ARMAC was able to consistently score 2500 points on Montezuma's Revenge despite not using any auxiliary rewards, demonstrations, or distributional RL as critic. We hypothesize that regret matching may be advantageous for exploration, as it provides naturally stochastic policies which stay stochastic until regrets for other actions becomes negative. We also tested the algorithm on Breakout, as it is a fine control problem. We are not claiming that out results on Atari are state of art -they should be interpreted as a basic sanity check showing that ARMAC could in principle work in this domain. 

E Training

Training is done by processing a batch of 64 of trajectories of length 32 at a time. In order to implement a full recall, all unfinished episodes will be continued on the next training iteration by propagating recurrent network states forward. Each time when one episode finishes at a particular batch entry, a new one is sampled and started to be unrolled from the beginning.



Information state is the belief about the world that a given player can infer based on her limited observations and may correspond to many possible histories (world states) In practice, rather than using h as input to our approximators, we use a concatenation of all players' observations, i.e. an encoding of the augmented information states or action-observation histories(9; 18). In some games this is sufficient to recover a full history. In others there is hidden state from all players, we can consider any chance event to be delayed until the first observation



Figure 1: A part of Kuhn poker. Terminal utilities shown for the first player.

(s)) to d end add d to D end for learning step k ∈ {1, . . . , K learn } do Sample a random episode/batch d ∼ Unif(D): for history and corresponding state (h, s) ∈ d do Use TB(λ) to train critic

Figure 2: An average reward per modulations scored against opponent πt as a function of time (measured in acting steps). The brown curve is a random uniform policy (i). Cyan, orange and blue are (ii) with ∈ 0.0, 0.01, 0.05 respectively. Pink, green and yellow are (iii) with ∈ 0.0, 0.01, 0.05.

Figure 3: NFSP and MC-RCFR on the Leduc Poker, II-GoofSpiel with 5 cards and Liars Dice

Figure 4: ARMAC results on Leduc, II-Goofspiel, and Liar's Dice. The y-axis is NashConv of the average strategy πt . The x-axis is number of epochs. One epoch consists of 100 learning steps. Each learning step processes 64 trajectories of length 32 sampled from replay memory. The final value reached by the best runs are 0.18 (Leduc), 0.5 (II-Goofspiel), and 0.095 (Liar's Dice).

Figure 5: ARMAC results in No-Limit Texas Hold'em trained with FCPA action abstraction evaluated using LBR-FC metric. The y-axis represents the amount LBR-FC wins agains the ARMAC-trained policy. The x-axis indicate days of training. The left graph shows the learning curve in a linear scale, while the right one shows the same curve in a log-log scale.

Consider the case of a tabular representation and define the estimate Ŵ T i (s, a) as the minimizer (over W ) of the empirical loss defined over N trajectoriesL(s,a) (W ) = 1 N N n=1 W -A π jn ,i (h, a) 2 I{(h, a) ∈ ρ jn and h ∈ s},where ρ jn is the n-th trajectory generated by the policy (µ T i , π jn -i ) where j n ∼ U({1, . . . , T }). Define N (s, a) = N n=1 I{(h, a) ∈ ρ jn and h ∈ s} to be the number of trajectories going through (s, a). Then Ŵ T i (s, a) is an unbiased estimate of W T i (s, a) conditioned on (s, a) being traversed at least once: E Ŵ T i (s, a)|N (s, a) > 0 = W T i (s, a).

Figure 6: The (a) Multi-headed network architecture, and (b) Exploration example.

Figure 7: Performance on Breakout (left) and Montezuma Revenge (right). Results are shown for two seeds.

) . Interestingly, ARMAC learned to beat those two versions of LBR surprisingly quickly. A randomly initialized ARMAC network lost against LBR-FCPA by -704 ± 191 (95% c.i.) and against LBR-FC12-FCPA34 by -230 ± 222 (95% c.i.), but was beating both after a mere 1 hour of training by 561 ± 163 (95% c.i.) and 427 ± 140 (95% c.i.) respectively ( 3 million acting steps, 11 thousand learning steps).Counter-intuitively, ARMAC was exploited by LBR-FC which uses a more limited action set. ARMAC scored -46 ± 26 (95% c.i.) per episode after 18 days of training on a single GPU, 1.3 billion acting steps (rounds), 5 million learning steps, 50000 CFR epochs (Figure

1, -greedy decay duration 20M steps, reservoir buffer capacity 2M entries, replay buffer capacity 200k entries, while sweeping over a combination of the following hyperparameters: -greedy starting value {0.06, 0.24}, RL learning rate 0.1, 0.01, 0.001, SL learning rate {0.01, 0.001, 0.005}, DQN target network update period of {1000, 19200} steps (the later is equivalent to 300 network-parameter updates). Each combination was averaged over three seeds. Agents were trained with the ADAM optimizer, using MSE loss for DQN and one gradient update step using mini-batch size 128, every 64 steps in the game.

annex

Adam optimized with β 1 = 0.0 and β 2 = 0.999 was used for optimization. Hyperparameter selection was done by trying only two learning rates: 5 • 10 -5 and 2 • 10 -4 . The results reported use 5 • 10 -5 in all games, including Atari.

F Neural Network Architecture

The following recurrent neural network was used for no-limit Texas Hold'em experiments. Two separate recurrent networks with shared parameters were used, consuming observations of each player respectively. Each of those networks consisted of a single linear layer mapping input representation to a vector of size 256. This was followed by a double rectified linear unit, producing a representation of size 512 then followed by LSTM with 256 hidden units. This produced an information state representation for each player a 0 and a 1 .Define architecture B(x), which will be reused several times. It consumes one of the information state representations produced by the previously mentioned RNN:The immediate regret head is formed by applying B(s) on the information state representation followed by a single linear layer of the size of the number of actions in the game. The same is done for an average regret head and mean policy head. All those B(s) do not share weights between themselves, but share weights with respective heads for another player.The global critic q(h) is defined in the following way. n A = Linear(128), n B = Linear(128), a 0 = n A (s 0 ) + n B (s 1 ), a 0 = n B (s 0 ) + n A (s 1 ), h 1 = Concat(a 0 , a 1 ), h 2 = B(h 1 ) and finally q 0 (s 1 , s 2 ) and q 1 (s 1 , s 2 ) are evaluated by a two linear layers on top of h 2 . B(x) shares architecture but does not share parameters with the ones used previously.

