EFFICIENT COMPETITIVE SELF-PLAY POLICY OPTI-MIZATION

Abstract

Reinforcement learning from self-play has recently reported many successes. Self-play, where the agents compete with themselves, is often used to generate training data for iterative policy improvement. In previous work, heuristic rules are designed to choose an opponent for the current learner. Typical rules include choosing the latest agent, the best agent, or a random historical agent. However, these rules may be inefficient in practice and sometimes do not guarantee convergence even in the simplest matrix games. This paper proposes a new algorithmic framework for competitive self-play reinforcement learning in two-player zero-sum games. We recognize the fact that the Nash equilibrium coincides with the saddle point of the stochastic payoff function, which motivates us to borrow ideas from classical saddle point optimization literature. Our method simultaneously trains several agents and intelligently takes each other as opponents based on a simple adversarial rule derived from a principled perturbation-based saddle optimization method. We prove theoretically that our algorithm converges to an approximate equilibrium with high probability in convex-concave games under standard assumptions. Beyond the theory, we further show the empirical superiority of our method over baseline methods relying on the aforementioned opponentselection heuristics in matrix games, grid-world soccer, Gomoku, and simulated robot sumo, with neural net policy function approximators.

1. INTRODUCTION

Reinforcement learning (RL) from self-play has drawn tremendous attention over the past few years. Empirical successes have been observed in several challenging tasks, including Go (Silver et al., 2016; 2017; 2018) , simulated hide-and-seek (Baker et al., 2020) , simulated sumo wrestling (Bansal et al., 2017) , Capture the Flag (Jaderberg et al., 2019) , Dota 2 (Berner et al., 2019) , StarCraft II (Vinyals et al., 2019) , and poker (Brown & Sandholm, 2019) , to name a few. During RL from selfplay, the learner collects training data by competing with an opponent selected from its past self or an agent population. Self-play presumably creates an auto-curriculum for the agents to learn at their own pace. At each iteration, the learner always faces an opponent that is comparably in strength to itself, allowing continuous improvement. The way the opponents are selected often follows human-designed heuristic rules in prior work. For example, AlphaGo (Silver et al., 2016) always competes with the latest agent, while the later generation AlphaGo Zero (Silver et al., 2017) and AlphaZero (Silver et al., 2018) generate self-play data with the maintained best historical agent. In specific tasks, such as OpenAI's sumo wrestling, competing against a randomly chosen historical agent leads to the emergence of more diverse behaviors (Bansal et al., 2017) and more stable training than against the latest agent (Al-Shedivat et al., 2018). In population-based training (Jaderberg et al., 2019; Liu et al., 2019) and AlphaStar (Vinyals et al., 2019) , an elite or random agent is picked from the agent population as the opponent. Unfortunately, these rules may be inefficient and sometimes ineffective in practice since they do not necessarily enjoy last-iterate convergence to the "average-case optimal" solution even in tabular matrix games. In fact, in the simple Matching Pennies game, self-play with the latest agent fails to converge and falls into an oscillating behavior, as shown in Sec. 5. In this paper, we develop an algorithm that adopts a principle-derived opponent-selection rule to alleviate some of the issues mentioned above. This requires clarifying first what the solution of self-play RL should be. From the game-theoretical perspective, Nash equilibrium is a fundamental solution concept that characterizes the desired "average-case optimal" strategies (policies). When each player assumes other players also play their equilibrium strategies, no one in the game can gain more by unilaterally deviating to another strategy. Nash, in his seminal work (Nash, 1951) , has established the existence result of mixed-strategy Nash equilibrium of any finite game. Thus solving for a mixed-strategy Nash equilibrium is a reasonable goal of self-play RL. We consider the particular case of two-player zero-sum games as the model for the competitive selfplay RL environments. In this case, the Nash equilibrium is the same as the (global) saddle point and as the solution of the minimax program min x∈X max y∈Y f (x, y). We denote x, y as the strategy profiles (in RL terminology, policies) and f as the loss for x or utility/reward for y. A saddle point (x * , y * ) ∈ X × Y , where X, Y are the sets of all possible mixed-strategies (stochastic policies) of the two players, satisfies the following key property f (x * , y) ≤ f (x * , y * ) ≤ f (x, y * ), ∀x ∈ X, ∀y ∈ Y. Connections to the saddle problem and game theory inspire us to borrow ideas from the abundant literature for finding saddle points in the optimization field (Arrow et al., 1958; Korpelevich, 1976; Kallio & Ruszczynski, 1994; Nedić & Ozdaglar, 2009 ) and for finding equilibrium in the game theory field (Zinkevich et al., 2008; Brown, 1951; Singh et al., 2000) . One particular class of method, i.e., the perturbation-based subgradient methods to find the saddles (Korpelevich, 1976 ; Kallio & Ruszczynski, 1994) , is especially appealing. This class of method directly builds upon the inequality properties in Eq. 1, and has several advantages: (1) Unlike some algorithms that require knowledge of the game dynamics (Silver et al., 2016; 2017; Nowé et al., 2012) , it requires only subgradients; thus, it is easy to be adapted to policy optimization with estimated policy gradients. (2) For convexconcave functions, it is guaranteed to converge in its last iterate instead of an average iterate, hence alleviates the need to compute any historical averages as in Brown (1951) ; Singh et al. (2000) ; Zinkevich et al. (2008) , which can get complicated when neural nets are involved (Heinrich & Silver, 2016) . ( 3) Most importantly, it prescribes a simple principled way to adversarially choose self-play opponents, which can be naturally instantiated with a concurrently-trained agent population. To summarize, we apply ideas from the perturbation-based methods of classical saddle point optimization to the model-free self-play RL regime. This results in a novel population-based policy gradient method with a principled adversarial opponent-selection rule. Analogous to the standard model-free RL setting, we assume only "naive" players (Jafari et al., 2001) where the game dynamic is hidden and only rewards for their own actions are revealed. This enables broader applicability to problems with mismatched or unknown game dynamics than many existing algorithms (Silver et al., 2016; 2017; Nowé et al., 2012) . In Sec. 4, we provide an approximate convergence theorem for convex-concave games as a sanity check. Sec. 5 shows extensive experiment results favoring our algorithm's effectiveness in several games, including matrix games, grid-world soccer, a board game, and a challenging simulated robot sumo game. Our method demonstrates higher per-agent sample efficiency than baseline methods with alternative opponent-selection rules. Our trained agents also outperform the baseline agents on average in competitions.

2. RELATED WORK

Reinforcement learning trains a single agent to maximize the expected return in an environment (Sutton & Barto, 2018). Multiagent reinforcement learning (MARL), of which two-agent is a special case, concerns multiple agents taking actions in the same environment (Littman, 1994). Self-play is a training paradigm to generate data for MARL and has led to great successes, achieving superhuman performance in several domains (Tesauro, 1995; Silver et al., 2016; Brown & Sandholm, 2019) . Applying RL algorithms naively as independent learners in MARL sometimes produces strong agents (Tesauro, 1995) (Silver et al., 2016) . However, Tree search requires learners to know (or at least learn) the game dynamics. The other ones typically require maintaining some historical quantities. In Fictitious play, the learner best-responds to a historical average opponent, and the average strategy converges. Similarly, the total historical regrets in all (information) states are maintained in (counterfactual) regret minimization (Zinkevich et al., 2008) . Furthermore, most of those algorithms are designed only for discrete state action games. Special care has to be taken with neural net function approximators (Heinrich & Silver, 2016) . On the contrary, our method does not require the complicated computation of averaging neural nets, and is readily applicable to continuous environments. In two-player zero-sum games, the Nash equilibrium coincides with the saddle point. This enables the techniques developed for finding saddle points. While some saddle-point methods also rely on time averages (Nedić & Ozdaglar, 2009) , a class of perturbation-based gradient method is known to converge under mild convex-concave assumption for deterministic functions ( 

3. METHOD

Classical game theory defines a two-player zero-sum game as a tuple (X, Y, f ) where X, Y are the sets of possible strategies of Players 1 and 2 respectively, and f : X × Y → R is a mapping from a pair of strategies to a real-valued utility/reward for Player 2. The game is zero-sum (fully competitive), so Player 1's reward is -f . This is a special case of the Stochastic Game formulation for Multiagent RL (Shapley, 1953) which itself is an extension to Markov Decision Processes (MDP). We consider mixed strategies induced by stochastic policies π x and π y . The policies can be parameterized functions in which case X, Y are the sets of all possible policy parameters. Denote a t as the action of Player 1 and b t as the action of Player 2 at time t, let T be the time limit of the game, then the stochastic payoff f writes as f (x, y) = E at∼πx,bt∼πy, st+1∼P (•|st,at,bt)   T t=0 γ t r(s t , a t , b t )   . The state sequence {s t } T t=0 follows a transition dynamic P (s t+1 |s t , a t , b t ). Actions are sampled according to action distributions π x (•|s t ) and π y (•|s t ). And r(s t , a t , b t ) is the reward (payoff) for Player 2 at time t, determined jointly by the state and actions. We use the term 'agent' and 'player' interchangeably. While we consider an agent pair (x, y) in this paper, in some cases (Silver et al., 2016) , x = y can be enforced by sharing parameters if the game is impartial. The discounting factor γ weights between short-and long-term rewards and is optional. Note that when one agent is fixed, taking y as an example, the problem x is facing reduces to an MDP if we define a new state transition dynamic P new (s t+1 |s t , a t ) = bt P (s t+1 |s t , a t , b t )π y (b t |s t ) and a new reward r new (s t , a t ) = bt r(s t , a t , b t )π y (b t |s t ). This leads to the naive gradient descentascent algorithm, which provably works in strictly convex-concave games (where f is strictly convex in x and strictly concave in y) under some assumptions (Arrow et al., 1958) . However, in general, it does not enjoy last-iterate convergence to the Nash equilibrium. Even for simple games such as Matching Pennies and Rock Paper Scissors, as we shall see in our experiments, the naive algorithm generates cyclic sequences of x k , y k that orbit around the equilibrium. This motivates us to study the perturbation-based method which converges under weaker assumptions. Recall that the Nash equilibrium has to satisfy the saddle constraints Eq. 1: f (x * , y) ≤ f (x * , y * ) ≤ f (x, y * ). The perturbation-based methods build upon this property (Nedić & Ozdaglar, 2009; Kallio & Ruszczynski, 1994; Korpelevich, 1976) and directly optimize for a solution that meets the constraints. They find perturbed points u of Player 1 and v of Player 2, and use gradients at (x, v) and (u, y) to optimize x and y respectively. Under some regularity assumptions, gradient direction from a single perturbed point is adequate for proving convergence for (not strictly) convex-concave functions (Nedić & Ozdaglar, 2009) . They can be easily extended to accommodate gradient based policy optimization and the stochastic RL objective in Eq. 4. We propose to find the perturbations from an agent population, resulting in the algorithm outlined in Alg. 1. The algorithm trains n pairs of agents simultaneously. At each rounds of training, we first run n 2 pairwise competitions as the evaluation step (Alg. 1 L3), costing n 2 m k trajectories. To save sample complexity, we can use these rollouts to do one policy update as well. Then a simple adversarial rule (Eq. 3) is adopted in Alg. 1 L6 to choose the opponents adaptively. The intuition is that v k i and u k i are the most challenging opponents in the population for the current x i and y i . v k i = arg max y∈C k y i f (x k i , y), u k i = arg min x∈C k x i f (x, y k i ). The perturbations v k i and u k i always satisfy f (x k i , v k i ) ≥ f (u k i , y k i ), since max y∈C k y i f (x k i , y) ≥ f (x k i , y k i ) ≥ min x∈C k x i f (x, y k i ). Then we run gradient descent on x k i with the perturbed v k i as opponent to minimize f (x k i , v k i ), and run gradient ascent on y k i to maximize f (u k i , y k i ). Intuitively, the duality gap between min x max y f (x, y) and max y min x f (x, y), approximated by f (x k i , v k i ) - f (u k i , y k i ), is reduced, leading (x k i , y k i ) to converge to the saddle point (equilibrium). We build the candidate opponent sets in L5 of Alg. 1 simply as the concurrently-trained n-agent population. Specifically, C k yi = y k 1 , . . . , y k n and C k xi = x k 1 , . . . , x k n . This is due to the following considerations. An alternative source of candidates is the fixed known agents such as a rule-based agent, which may not be available in practice. Another source is the extragradient methods (Korpelevich, 1976; Mertikopoulos et al., 2019) , where extra gradient steps are taken on y before optimizing x. The extragradient method can be thought of as a local approximation to Eq. 3 with a neighborhood opponent set, thus is related to our method. However, this method could be less efficient because the trajectory sample used in the extragradient steps is wasted as it does not contribute to actually optimizing y. Yet another source is the past agents. This choice is motivated by Fictitious play and ensures that the current learner always defeats a past self. However, as we shall see in the experiments, self-play with a random past agent may learn slower than our method. We expect all agents in the population in our algorithm to be strong, thus provide stronger learning signals. Finally, we use Monte Carlo estimation to compute the values and gradients of f . In the classical game theory setting where the game dynamic and payoff are known, it is possible to compute the exact values and gradients of f . But in the model-free MARL setting, we have to collect roll-out trajectories to estimate both the function values through policy evaluation and gradients through the Policy gradient theorem (Sutton & Barto, 2018) . After collecting m independent trajectories {(s i t , a i t , r i t )} T t=0 m i=1 , we can estimate f (x, y) by f (x, y) = 1 m m i=1 T t=0 γ t r i t . And given estimates Qx (s, a; y) to the state-action value Q x (s, a; y) (assuming an MDP with y as a fixed opponent of x), we construct an estimator for ∇ x f (x, y) (and similarly for ∇ y f given Qy ) by ∇x f (x, y) ∝ 1 m m i=1 T t=0 ∇ x log π x (a i t |s i t ) Qx (s i t , a i t ; y).

4. CONVERGENCE ANALYSIS

We establish an asymptotic convergence result in the Monte Carlo policy gradient setting in Thm. 2 for a variant of Alg. 1 under regularity assumptions. This variant sets l = 1 and uses the vanilla SGD as the policy optimizer. We add a stop criterion f ( x k i , v k ) -f (u k , y k i ) after Line 6 with an accuracy parameter . The full proof can be found in the appendix. Since the algorithm is symmetric between different pairs of agents in the population, we drop the subscript i for text clarity. Assumption 1 (A1). X, Y ⊆ R d are compact sets. As a consequence, there exists D s.t ∀x 1 , x 2 ∈ X, x 1 -x 2 1 ≤ D and ∀y 1 , y 2 ∈ Y, y 1 -y 2 1 ≤ D. Assume C k y , C k x are compact subsets of X and Y . Further, assume f : X × Y → R is a bounded convex-concave function. Theorem 1 (Convergence with exact gradients (Kallio & Ruszczynski, 1994)). Under A1, if a se- quence (x k , y k ) → (x, ŷ) ∧ f (x k , v k ) -f (u k , y k ) → 0 implies (x, ŷ) is a saddle point, Alg. 1 (replacing estimates with true values) produces a sequence (x k , y k ) ∞ k=0 convergent to a saddle. The above case with exact sub-gradients is easy since both f and ∇f are deterministic. In RL setting, we construct estimates for f (x, y) and ∇ x f, ∇ y f with samples. Intuitively, when the samples are large enough, we can bound the deviation between the true values and estimates by concentration inequalities, then the proof outline similar to Kallio & Ruszczynski (1994) also goes through. Thm. 2 requires an extra assumption on the boundedness of Q and gradients. By showing the policy gradient estimates are approximate sub-/super-gradients of f , we are able to prove that the output (x N i , y N i ) of Alg. 1 is an approximate Nash equilibrium with high probability. Assumption 2 (A2). The Q value estimation Q is unbiased and bounded by R, and the policy has bounded gradient ∇ log π θ (a|s) ∞ ≤ B. Theorem 2 (Convergence with policy gradients). Under A1, A2, let sample size at step k be m k ≥ Ω R 2 B 2 D 2 2 log d δ2 -k and learning rate η k = α Êk -2 ĝk x 2 + ĝk y 2 with 0 ≤ α ≤ 2, then with probability at least 1 -O(δ), the Monte Carlo version of Alg. 1 generates a sequence of points (x k , y k ) ∞ k=0 convergent to an O( )-approximate equilibrium (x, ȳ), that is ∀x ∈ X, ∀y ∈ Y, f (x, ȳ) -O( ) ≤ f (x, ȳ) ≤ f (x, y) + O( ). Discussion. The theorems require f to be convex in x and concave in y, but not strictly, which is a weaker assumption than Arrow et al. (1958) . The purpose of this simple analysis is mainly a sanity check for correctness. It applies to the setting in Sec. 5.1 but not beyond, as the assumptions do not necessarily hold for neural networks. The sample size is chosen loosely as we are not aiming at a sharp finite sample complexity analysis. In practice, we can find suitable m k (sample size) and η k (learning rates) by experimentation, and adopt a modern RL algorithm with an advanced optimizer (e.g., PPO (Schulman et al., 2017) with RmsProp (Hinton et al.)) in place of the SGD updates.

5. EXPERIMENTS

We empirically evaluate our algorithm in several games with distinct characteristics. Compared methods. In Matrix games, we compare to a naive mirror descent method, which is essentially Self-play with the latest agent, to verify convergence. In the rest of the environments, we compare the results from the following methods: (Brown, 1951) since uniformly random sampling is equivalent to historical average. However, Fictitious play only guarantees convergence of the average-iterate but not the last-iterate agent. 4. OURS(n = 2, 4, 6, . . .). This is our algorithm with a population of n pairs of agents trained simultaneously, with each other as candidate opponents. Implementation can be distributed. Evaluation protocols. We mainly measure the strength of agents by the Elo scores (Elo, 1978) . Pairwise competition results are gathered from a large tournament among all the checkpoint agents of all methods after training. Each competition has multiple matches to account for randomness. The Elo scores are computed by logistic regression, as Elo assumes a logistic relationship P (A wins) + 0.5P (draw) = 1/(1 + 10 (R B -R A ) /400 ). A 100 Elo difference corresponds to roughly 64% win-rate. The initial agent's Elo is calibrated to 0. Another way to measure the strength is to compute the average rewards (win-rates) against other agents. We also report average rewards in the appendix.

5.1. MATRIX GAMES

We verified the last-iterate convergence to Nash equilibrium in several classical two-player zerosum matrix games. In comparison, the vanilla mirror descent/ascent is known to produce oscillating behaviors (Mertikopoulos et al., 2019) . Payoff matrices (for both players separated by comma), phase portraits, error curves, and our observations are shown in Tab. 1,2,3,4 and Fig. We studied two settings: (1) OURS(Exact Gradient), the full information setting, where the players know the payoff matrix and compute the exact gradients on action probabilities; (2) OURS(Policy Gradient), the reinforcement learning or bandit setting, where each player only receives the reward of its own action. The action probabilities were modeled by a probability vector p ∈ ∆ 2 . We estimated the gradient w.r.t p with REINFORCE estimator (Williams, 1992) with sample size m k = 1024, and applied η k = 0.03 constant learning rate SGD with proximal projection onto ∆ 2 . We trained n = 4 agents jointly for Alg. 1 and separately for the naive mirror descent under the same initialization.

5.2. GRID-WORLD SOCCER GAME

We conducted experiments in a grid-world soccer game. Similar games were adopted in Littman (1994) and He et al. (2016) . Two players compete in a 6 × 9 grid world, starting from random positions. The action space is {up, down, left, right, noop}. Once a player scores a goal, it gets positive reward 1.0, and the game ends. Up to T = 100 timesteps are allowed. The game ends with a draw if time runs out. The game has imperfect information, as the two players move simultaneously. The policy and value functions were parameterized by simple one-layer networks, consisting of a one-hot encoding layer and a linear layer that outputs the action logits and values. The logits are transformed into probabilities via softmax. We used Advantage Actor-Critic (A2C) ( Other hyper-parameters are listed in the appendix. All methods were run multiple times to calculate the confidence intervals. In Fig. 5 , OURS(n = 2, 4, 6) all perform better than others, achieving higher Elo scores after experiencing the same number of per-agent episodes. Other methods fail to beat the rule-based agent after 32000 episodes. Competing with a random past agent learns the slowest, suggesting that, though it may stabilize training and diversify behaviors (Bansal et al., 2017) , the learning efficiency is not high because a large portion of samples is devoted to weak opponents. Within our method, the performance increases with a larger n, suggesting a larger population may help find better perturbations.

5.3. GOMOKU BOARD GAME

We investigated the effectiveness in the Gomoku game, which is also known as Renju, Five-in-a-row. In our variant, two players place black or white stones on a 9-by-9 board in turn. The player who

Game payoff matrix Phase portraits and error curves

Heads Tails Heads 1, -1 -1, 1 Tails -1, 1 1, -1 Tab. 1: Matching Pennies, a classical game where two players simultaneously turn their pennies to heads or tails. If the pennies match, Player 2 (Row) wins one penny from Player 1 (Column); otherwise, Player 1 wins. (Px(head), Py(head)) = 1 2 , 1 2 is the unique Nash equilibrium with game value 0. Observation: In the leftmost column of Fig. 1 ,2, the naive mirror descent does not converge pointwisely; Instead, it is trapped in a cyclic behavior. The trajectories of the probability of playing Heads orbit around the Nash, showing as circles in the phase portrait. On the other hand, our method enjoys approximate last-iterate convergence with both exact and policy gradients. Observation: Similar observations occur in the Rock Paper Scissors game (Fig. 3 ). The naive method circles around the corresponding equilibrium pointsfoot_0 5 , 2 5 and 1 3 , 1 3 , 1 3 , while our method converges with diminishing error. Observation: Our method has the benefit of producing diverse solutions when there exist multiple Nash equilibria. The solution for row player is x = 1 2 , 1 2 , while any interpolation between 1 2 , 1 2 , 0 and 0, 1 3 , 2 3 is an equilibrium column strategy. Depending on initialization, agents in our method converges to different equilibria. gets an unbroken row of five horizontally, vertically, or diagonally, wins (reward 1). The game is a draw (reward 0) when no valid move remains. The game is sequential and has perfect information. This experiment involved much more complex neural networks than before. We adopted a 4layer convolutional ReLU network (kernels (5, 5, 3, 1), channels (16, 32, 64, 1), all strides 1) for both the policy and value networks. Gomoku is hard to train from scratch with pure model-free RL without explicit tree search. Hence, we pre-trained the policy nets on expert data collected from renjuoffline.com. We downloaded roughly 130 thousand games and applied behavior cloning. The pre-trained networks were able to predict expert moves with ≈ 41% accuracy and achieve an average score of 0.93 (96% win and 4% lose) against a random-action player. We adopted the A2C In Fig. 6 , all methods are able to improve upon the behavior cloning policies significantly. OURS(n = 2, 4, 6) demonstrate higher sample efficiency by achieving higher Elo ratings than the alternatives given the same amount of per-agent experience. This again suggests that the opponents are chosen more wisely, resulting in better policy improvements. Lastly, the more complex policy and value functions (multi-layer CNN) do not seem to undermine the advantage of our approach. 2018), a random past opponent is sampled in selfplay, corresponding to the "Self-play w/ random past" baseline here. The agents are initialized from imitating the pre-trained agents of Al-Shedivat et al. (2018). We considered n = 4 and n = 8 for our method. From Fig. 7 , we observe again that OURS(n = 4, 8) outperform the baseline methods by a statistical margin and that our method benefits from a larger population size.

6. CONCLUSION

We propose a new algorithmic framework for competitive self-play policy optimization inspired by a perturbation subgradient method for saddle points. Our algorithm provably converges in convexconcave games and achieves better per-agent sample efficiency in several experiments. In the future, we hope to study a larger population size (should we have sufficient computing power) and the possibilities of model-based and off-policy self-play RL under our framework. Fig. 9 : Illustration of the Gomoku game (also known as Renju, five-in-a-row). We study the 9x9 board variant. Two players sequentially place black and white stones on the board. Black goes first. A player wins when he or she gets five stones in a row. In the case of this illustration, the black wins because there is five consecutive black stones in the 5th row. Numbers in the stones indicate the ordered they are placed.

Observation space:

Tensor of shape [9, 9, 3], last dim 0: vacant, 1: black, 2: white Action space: Any valid location on the 9x9 board Time limit: 41 moves per-player Terminal reward: +1 for winning player -1 for losing player 0 if timeout Fig. 10 : Illustration of the RoboSumo Ants game. Two ants fight in the arena. The goal is to push the opponent out of the arena or down to the floor. Agent positions are initialized to be random at the start of the game. The game ends in a draw if the time limit is reached. In addition to the terminal ±1 reward, the environment comes with shaping rewards (motion bonus, closeness to opponent, etc.). In order to make the game zero-sum, we take the difference between the original rewards of the two ants. Observation space: R 120 Action space: R 8 Time limit: 100 moves Reward: r t = r orig y t -r orig x t Terminal ±1 or 0.

A.2 HYPER-PARAMETERS

The hyper-parameters in different games are listed in Tab. 5.

Win-rates (or average rewards).

Here we report additional results in terms of the average winrates, or equivalently the average rewards through the linear transform win-rate = 0.5 + 0.5 reward, in Tab. 6 and 7. Since we treat each (x i , y i ) pair as one agent, the values are the average of f (x i , •) and f (•, y i ) in the first table. The one-side f (•, y i ) win-rates are in the second table. Mean and 95% confidence intervals are estimated from multiple runs. Exact numbers of runs are in the captions Tab. 5: Hyper-parameters. of Fig. 5, 6 ,7 of the main paper. The message is the same as that suggested by the Elo scores: Our method consistently produces stronger agents. We hope the win-rates may give better intuition about the relative performance of different methods.

Hyper-param

Tab. 6: Average win-rates (∈ [0, 1]) between the last-iterate (final) agents trained by different algorithms. Last two rows further show the average over other last-iterate agents and all other agents (historical checkpoint) included in the tournament, respectively. Since an agent consists of an (x, y) pair, the win-rate is averaged on x and y, i.e., win(col vs row) = f (x row ,y col )-f (x col ,y row ) 2 × 0.5 + 0.5. The lower the better within each column; The higher the better within each row. 0.424 ± 0.017 0.440 ± 0.020 0.362 ± 0.020 0.455 ± 0.017 0.488 ± 0.013 -Last-iter average 0.455 ± 0.010 0.479 ± 0.011 0.407 ± 0.012 0.509 ± 0.010 0.537 ± 0.008 0.560 ± 0.008 Overall average 0.541 ± 0.004 0.561 ± 0.004 0.499 ± 0.005 0.583 ± 0.004 0.599 ± 0.003 0.615 ± 0.003 (c) RoboSumo RoboSumo Self-play latest Self-play best Self-play rand Ours (n=4) Ours (n=8) Self-play latest -0.502 ± 0.012 0.493 ± 0.013 0.511 ± 0.011 0.510 ± 0.010 Self-play best 0.498 ± 0.012 -0.506 ± 0.014 0.514 ± 0.008 0.512 ± 0.010 Self-play rand 0.507 ± 0.013 0.494 ± 0.014 -0.508 ± 0.011 0.515 ± 0.011 Ours (n=4) 0.489 ± 0.011 0.486 ± 0.008 0.492 ± 0.011 -0.516 ± 0.008 Ours (n=8) 0.490 ± 0.010 0.488 ± 0.010 0.485 ± 0.011 0.484 ± 0.008 -Last-iter average 0.494 ± 0.006 0.491 ± 0.005 0.492 ± 0.006 0.500 ± 0.005 0.514 ± 0.005 Overall average 0.531 ± 0.004 0.527 ± 0.004 0.530 ± 0.004 0.539 ± 0.003 0.545 ± 0.003 Training time. Thanks to the easiness of parallelization, the proposed algorithm enjoys good scalability. We can either distribute the n agents into n processes to run concurrently, or make the rollouts parallel. Our implementation took the later approach. In the most time-consuming RoboSumo Ants experiment, with 30 Intel Xeon CPUs, the baseline methods took approximately 2.4h, while Ours (n=4) took 10.83h to train (×4.5 times), and Ours (n=8) took 20.75h (×8.6 times). Note that, Ours (n) trains n agents simultaneously. If we train n agents with the baseline methods by repeating the experiment n times, the time would be 2.4n hours, which is comparable to Ours (n). Chance of selecting the agent itself as opponent. One big difference between our method and the compared baselines is the ability to select opponents adversarially from the population. Consider the agent pair (x i , y i ). When training x i , our method finds the strongest opponent (that incurs the largest loss on x i ) from the population, whereas the baselines always choose (possibly past versions of) y i . Since the candidate set contains y i , the "fall-back" case is to use y i as opponent in our method. We report the frequency that y i is chosen as opponent for x i (and x i for y i likewise). This gives a sense of how often our method falls back to the baseline method. From Tab. 8, we can observe that, as n grows larger, the chance of fall-back is decreased. This is understandable since a larger population means larger candidate sets and a larger chance to find good perturbations.

B PROOFS

We adopt the following variant of Alg. 1 in our asymptotic convergence analysis. For clarity, we investigate the learning process of one agent in the population and drop the i index. C k x and C k y are Tab. 7: Average one-sided win-rates (∈ [0, 1]) between the last-iterate (final) agents trained by different algorithms. The win-rate is one-sided, i.e., win(y col vs x row ) = f (x row , y col ) × 0.5 + 0.5. The lower the better within each column; The higher the better within each row. Frequency of self (Soccer) 0.4983 ± 0.0085 0.2533 ± 0.0072 0.1650 ± 0.0082 Frequency of self (Gomoku) 0.5063 ± 0.0153 0.2312 ± 0.0111 0.1549 ± 0.0103 not set simply as the population for the sake of the proof. Alternatively, we pose some assumptions. Setting them to the population as in the main text may approximately satisfy the assumptions. We restate the assumptions and the theorem here more clearly for reference.  ) → (x, ŷ) ∧ f (x k , v k ) -f (u k , y k ) → 0 for some v k ∈ C k y and u k ∈ C k x implies (x, (x k , v k ) -f (u k , y k ) = 0 only at a saddle point (u k , v k ) ∈ C k y × C k x . An trivial example would be C k x = X, C k y = Y . Another example would be the proximal regions around x k , y k . In practice, Alg. 1 constructs the candidate sets from the population which needs to be adequately large and diverse to satisfy Assump. B.2 approximately. The proof is due to (Kallio & Ruszczynski, 1994 ), which we paraphrase here. Proof. We shall prove that one iteration of Alg. 2 decreases the distance between the current (x k , y k ) and the optimal (x * , y * ). Expand the squared distance, x k+1 -x * 2 ≤ x k + η k g k x -x * 2 = x k -x * 2 + 2η k g k x , x k -x * + η 2 k g k x 2 . From Assump. B.1, convexity of f (x, y) on x gives g k x , x k -x * ≥ f (x k , v k ) -f (x * , v k ) which yields x k+1 -x * 2 ≤ x k -x * 2 -2η k (f (x k , v k ) -f (x * , v k )) + η 2 k g k x 2 . Similarly for y k , concavity of f (x, y) on y gives y k+1 -y * 2 ≤ y k -y * 2 + 2η k f (u k , y k ) -f (u k , y * ) + η 2 k g k x 2 . Sum the two and notice the saddle point condition implies f (x * , v k ) ≤ f (x * , y * ) ≤ f (u k , y * ), we have W k+1 := x k+1 -x * 2 + y k+1 -y * 2 ≤ x k -x * 2 + y k -y * 2 -2η k f (x k , v k ) -f (x * , v k ) -f (u k , y k ) + f (u k , y * ) + η 2 k g k x 2 + g k y 2 ≤ W k -2η k E k + η 2 k g k x 2 + g k y 2 . If the learning rate satisfies  η k < E k g k x 2 + g k y 2 , the sequence {W k } ∞ k=0 is strictly decreasing unless E k = 0. Since W k is bounded below by 0, therefore E k → 0. (x, ŷ) ∈ X × Y , ∀(u, v) ∈ C k x × C k y , f (u, ŷ) -≤ f (x, ŷ) ≤ f (x, v) + implies ∀(u, v) ∈ X × Y, f (u, ŷ) -≤ f (x, ŷ) ≤ f (x, v) + , namely, (x, η k = α Êk -2 ĝk x 2 + ĝk y 2 . Then with probability at least 1 -O(δ), the Monte Carlo version of Alg. 2 generates a sequence of points (x k , y k ) ∞ k=0 convergent to an O( )-approximate equilibrium (x, ȳ). That is ∀x ∈ X, ∀y ∈ Y, f (x, ȳ) -O( ) ≤ f (x, ȳ) ≤ f (x, y) + O( ). In the stochastic game (or reinforcement learning) setting, we construct estimates for f (x, y) (Eq. 4) and policy gradients ∇ x f, ∇ y f (Eq. 5) with samples. Intuitively speaking, when the samples are large enough, we can bound the deviation between the true values and the estimates by concentration inequalities, then the similar proof outline also goes through. Let us first define the concept of -subgradient for convex functions and -supergradient for concave functions. Then we calculate how many samples are needed for accurate gradient estimation in Lemma 3 with high probability. With Lemma 3, we will be able to show that the Monte Carlo policy gradient estimates are good enough to be -subgradients when sample size is large in Lemma 4. Definition 1. An -subgradient of a convex function h : R d → R at x is g ∈ R d that satisfies ∀x , h(x ) -h(x) ≥ g, x -x -. Similarly, an -supergradient of a concave function h : Lemma 4 (Policy gradients are sub-/super-gradients). Under Assump. B.1, the policy gradient estimate ∇ x f in Lemma 3 is an D-subgradient of f at x, i.e., for all x ∈ X, f (x , y) -f (x, y) ≥ ∇ x f , x -x -D with probability ≥ 1 -δ. (And ∇ y f is D-super-gradient for y.) R d → R at x is g ∈ R d that satisfies ∀x , h(x ) -h(x) ≤ g, x -x + . Proof. Apply the telescoping trick, ∇ x f , x -x = ∇ x f -∇ x f + ∇ x f, x -x = ∇ x f, x -x + ∇ x f -∇ x f, x -x ≥ f (x , y) -f (x, y) + ∇ x f -∇ x f, x -x . With the sample size in Lemma 3, we know it holds that max i | ∇ x f -∇ x f | i ≤ with probability ≥ 1 -δ. Hence, by Holder's inequality, the last part satisfies ∇ x f -∇ x f, x -x ≥ -| ∇ x f -∇ x f |, |x -x| ≥ -∇ x f -∇ x f ∞ x -x 1 ≥ -D. ( ) The proof of ∇ y f being D-super-gradient for y is similar, hence omitted. Similarly for accurate function value evaluation, we have the following lemma on sample size, which directly follows from Hoeffding's inequality. Lemma 5 (Evaluation sample size). Suppose Assump. B.3 holds. Then with independently collected m ≥ 2R 2 2 log 2 δ trajectories (s i t , a i t , r i t )} T t=0 m i=1 , the value estimate f = 1 m i,t γ t r t is -close to the true gradient f with high probability, namely, Pr f -f ∞ ≤ ≥ 1 -δ. Now we prove our main theorem which guarantees the output of Alg. 2 is an approximate Nash with high probability. This is done by using Lemma 4 in place of the exact convexity condition to analyze the relationship between W k and W k+1 , using Lemma 5 to bound the error of policy evaluation, and analyzing the stop condition carefully. Proof. (Theorem 2.) Suppose (x * , y * ) is one saddle point of f . We shall prove that one iteration of Alg. 2 sufficiently decreases the squared distance between the current (x k , y k ) and (x * , y * ) defined as W k := x kx * 2 + y k -y * 2 . Relation between W k and W k+1 : Note that W k+1 = x k+1 -x * 2 ≤ x k + η k ĝk x -x * 2 = x k -x * 2 + 2η k ĝk x , x k -x * + η 2 k ĝk x 2 . ( ) By Lemma 4, the gradient estimate ĝk x with sample size m k is an -subgradient on x with probability at least 1 -δ /2 k , i.e., ĝk x , x k -x * ≥ f (x k , v k ) -f (x * , v k ) -. Plugging back into Eq. 14, we get x k+1 -x * 2 ≤ x k -x * 2 -2η k f (x k , v k ) -f (x * , v k ) -+ η 2 k g k x 2 . ( ) Similarly for y k , since ĝk (18) Accurate estimation of E k : In Eq. 18, the second term involves E k which is unknown to the algorithm. Recall that E k (u k , v k ) = f (x k , v k ) -f (u k , y k ) and the empirical estimate Êk = f (x k , v k ) -f (u k , y k ) in Alg. 2 Line 5. By Lemma 5, when the sample size m k is chosen as in Theorem 2, with probability 1 -2δ d2 k , | f (x k , v k ) -f (x k , v k )| ≤ BD ≤ and | f (u k , y k ) -f (u k , y k )| ≤ BD ≤ . Thus Êk is 2 -accurate because Êk -2 = f (x k , v k ) --f (u k , y k ) -≤ E k ≤ f (x k , v k ) + -f (u k , y k ) + = Êk + 2 . ( ) Case (1) . Stop condition in Alg. 2 Line 6: If there does not exist (u, v) ∈ C k x × C k y such that Êk (u, v) > 3 , meaning ∀(u, v) ∈ C k x × C k y , Êk ≤ 3 . We can conclude  E k = f (x k , v) -f (u, y k ) ≤ Êk + 3 = 5 Following from Assump. B.4, this implies ∀(u, v) ∈ X ×Y, f (u, y k )-5 ≤ f (x k , y k ) ≤ f (x k , v)+ 5 , which suggests (x k , y k ) is an approximate saddle point (equilibrium). On the other hand, we want to bound the failure probability. Define events F (g) := "|ĝ -g| ≤ is true" for all g ∈ {g 0 x , g 0 y , f (x 0 , v 0 ), f (u 0 , y 0 ) . . . , g k y , f (x k , y k ) . . .  This means that inaccurate MC estimation (failure) occurs with small probability O(δ). The purpose of the increasing m k w.r.t. k is to handle the union bound and the geometric series here. So, when the algorithm stops, it returns (x, ȳ) = (x k , y k ) as a 5 -approximate solution to the saddle point (equilibrium) with high probability.



5 , 25 with value 0.8.Rock Paper ScissorsRock 0, 0 -1, 1 1, -1 Paper 1, -1 0, 0 -1, 1 Scissors -1, 1 1, -1 0, 0Tab. 3: Rock Paper Scissors.



Perturbation-based self-play policy optimization of an n agent population.

Mnih et al., 2016) with Generalized Advantage Estimation(Schulman et al., 2016) and RmsProp(Hinton et al.)   as the base RL algorithm. The hyper-parameters were N = 50, l = 10, m k = 32 for Alg. 1. We kept track of the per-agent number of trajectories (episodes) each algorithm used for fair comparison.

Fig. 1: Matching Pennies. (Top) The phase portraits. (Bottom) The squared L2 distance to the equilibrium. Four colors correspond to the 4 agents in the population with 4 initial points.

Fig. 2: Skewed Matching Pennies. The unique Nash equilibrium is (Px(heads), Py(heads)) =

Fig. 3: Rock Paper Scissors. (Top) Visualization of Player 1's strategies (y0) of one of the agents in the population. (Down) The squared distance to equilibrium.

Fig. 4: Visualization of the row player's strategies. (Left) Exact gradient; (Right) Policy gradient. The dashed line represents possible equilibrium strategies. The four agents (in different colors) in the population trained by our algorithm (n = 4) converge differently.

Fig. 5: Soccer Elo curves averaged over 3 runs (random seeds).For OURS(n), the average is over 3n agents. Horizontal lines show the scores of the rule-based and the random-action agents.

Mnih et al., 2016) with GAE (Schulman et al., 2016) and RmsProp (Hinton et al.) with learning rate η k = 0.001. Up to N = 40 iterations of Alg. 1 were run. The other hyperparameters were the same as those in the soccer game.

Fig. 8: Illustration of the 6x9 grid-world soccer game. Red and blue represent the two teams A and B. At start, the players are initialized to random positions on respective sides, and the ball is randomly assigned to one team. Players move up, down, left and right. Once a player scores a goal, the corresponding team wins and the game ends. One player can intercept the other's ball by crossing the other player.

Simplified perturbation-based self-play policy optimization of one agent.Input: η k : learning rates, m k : sample size; Result: Pair of policies (x, y); Initialize x 0 , y 0 ;for k = 0, 1, 2, . . . ∞ do Construct candidate opponent sets C k y and C k x ;Find perturbed v k = arg max y∈C k y f (x k , y) and perturbed u k = arg min x∈C k x f (x, y k ) where the evaluation is done with Eq. 4 and sample size m k ;Compute estimated duality gapÊk = f (x k , v k ) -f (u k , y k ); if Êk ≤ 3 then return (x k , y k )Estimate policy gradients ĝk x = ∇xf (x k , v k ) and ĝk y = ∇yf (u k , y k ) w/ Eq. 5 and sample size m k ;Update policy parameters with x k+1 ← x k -η k ĝk x and y k+1 ← y k + η k ĝk y ;B.1 PROOF OF THEOREM 1

x is a super-gradient by Lemma 4,y k+1 -y * 2 ≤ y k -y * 2 + 2η k f (u k , y k ) -f (u k , y * ) + + η 2 (17)Sum the two inequalities above, and notice the saddle point condition impliesf (x * , v k ) ≤ f (x * , y * ) ≤ f (u k , y * ),we have the following inequality holds with probability 1 -2δ /2 k ,W k+1 = x k+1 -x * 2 + y k+1 -y * x k -x * 2 + y k -y * 2η k f (x k , v k ) -f (x * , v k ) -f (u k , y k ) + f (u k , y * ) -2 + η W k -2η k (E k -2 ) + η 2

20) with probability at least1 -2δ d2 k ≥ 1 -2δ 2 k . Set v = y k and u = x k respectively in the above inequality, we obtain ∀(u, v) ∈ C k x × C k y , f (u, y k ) -5 ≤ f (x k , y k ) ≤ f (x k , v) + 5 .

}. By De Morgan's law and the union bound, Pr all MC estimates till step k are accurate= Pr k l=0 (g l x ) ∩ F (g l y ) ∩ F (f (x l , v l )) ∩ F (f (u l , y l )) Pr k l=1 (g l x ) ∩ . . . ∩ F (f (u l , y l )) -O(δ).

Regret minimization(Jafari et al., 2001;Zinkevich et al., 2008), and Mirror descent (Mertikopoulos et al., 2019; Rakhlin & Sridharan, 2013). Tree search such as minimax and alpha-beta pruning is particularly effective in small-state games. Monte Carlo Tree Search (MCTS) is also effective in Go

Input: N : N o iterations; η k : learning rates; m k : sample size; n: population size; l: N o inner updates;

Self-play with the latest agent (Naive Mirror Descent). The learner always competes with the most recent agent. This is essentially the Gradient Descent Ascent method by Arrow-Hurwicz-Uzawa(Arrow et al., 1958) or the naive mirror/alternating descent. 2. Self-play with the best past agent. The learner competes with the best historical agent maintained. The new agent replaces the maintained agent if it beats the existing one. This is the scheme in AlphaGo Zero and AlphaZero(Silver et al., 2017; 2018). 3. Self-play with a random past agent (Fictitious play). The learner competes against a randomly sampled historical opponent. This is the scheme in OpenAI sumo(Bansal et al., 2017;Al-Shedivat et al., 2018). It is similar to Fictitious play

Last-iter average 0.489 ± 0.008 0.485 ± 0.008 0.495 ± 0.009 0.500 ± 0.007 0.514 ± 0.007 Overall average 0.528 ± 0.004 0.521 ± 0.004 0.530 ± 0.005 0.534 ± 0.003 0.544 ± 0.003Tab. 8: Average frequency of using the agent itself as opponent, in the Soccer and Gomoku experiments. The frequency is calculated by counting over all agents and iterations. The ± shows the standard deviations estimated by 3 runs with different random seeds.

Assumption B.1. X, Y ⊆ R d (d > 1) are compact sets. As a consequence, there exists D ≥ 1, s.t.,∀x 1 , x 2 ∈ X, x 1 -x 2 1 ≤ D and ∀y 1 , y 2 ∈ Y, y 1 -y 2 1 ≤ D. Further, assume f : X × Y → R is a bounded convex-concave function. Assumption B.2. C k y , C kx are compact subsets of X and Y . Assume that a sequence (x k , y k

Alg. 2 (when replacing all estimates with true values) produces a sequence of points (x k , y k )Assump. B.1 is standard, which is true if f is based on a payofftable and X, Y are probability simplex as in matrix games, or if f is quadratic and X, Y are unit-norm vectors. Assump. B.2 is about the regularity of the candidate opponent sets. This is true if C k y , C k x are compact and f

Following from Assump. B.2, the convergent point lim k→∞ (x k , y k ) = (x * , y * ) is a saddle point. B.2 PROOF OF THEOREM 2 We restate the additional Assump. B.3 and the theorem here for reference. Assump. B.2 is replaced by the following approximated version B.4. Assumption B.3. The total return is bounded by R, i.e., | t γ t r t | ≤ R. The Q value estimator Q is unbiased and bounded by R (| Q| ≤ R). And the policy has bounded gradient max{ ∇ log π θ (a|s) ∞ , 1} ≤ B in terms of L ∞ norm. Assumption B.4. C k y , C k x are compact subsets of X and Y . Assume at iteration k, for some

ŷ) is an -approximate saddle point. Theorem 2 (Convergence with policy gradients). Under Assump. B.1,B.3,B.4, let sample size at step k be

3 (Policy gradient sample size). Consider x or y alone and treat the problem as MDP.

annex

Case (2) . Sufficient decrease of W k : Otherwise, if the stop condition is not triggered, we have picked u k , v k such that Êk > 3 . With probability 1 -2δ, E k > Êk -2 ≥ . With the chosen learning rate η k in the theorem statement, W k strictly decreases by at leastSince W k is bounded below by 0, by the monotone convergence theorem, there exists a finite k such that, and no (u, v) can be found to decrease W k more than 3 . In this case,, which is exactly the stop condition in Case (1) . This means the algorithm will eventually stop, and the proof is complete.Remark 1. The sample size is chosen very loosely. More efficient ways to find perturbations (e.g., best-arm identification), to better characterize or cover the policy class and to better utilize trajectories (e.g., especially off-policy evaluation w/ importance sampling) can potentially reduce sample complexity. In practice, we found on-policy methods which do not reuse past experience such as A2C and PPO work well enough. Remark 2. Assump. B.4 is a rather strong assumption on the candidate opponent sets. In theory, we can construct an -covering of f to satisfy the assumption. In practice, as in population-based training of Alg. 1, this assumption can be roughly met if n is large or diverse enough. We found a relatively small population with randomly initialized agents already brought noticeable benefit. Remark 3. The proof requires a variable learning rate η k . However, the intuition is that the learning rate needs to be small, as we did in our experiments.

