EFFICIENT COMPETITIVE SELF-PLAY POLICY OPTI-MIZATION

Abstract

Reinforcement learning from self-play has recently reported many successes. Self-play, where the agents compete with themselves, is often used to generate training data for iterative policy improvement. In previous work, heuristic rules are designed to choose an opponent for the current learner. Typical rules include choosing the latest agent, the best agent, or a random historical agent. However, these rules may be inefficient in practice and sometimes do not guarantee convergence even in the simplest matrix games. This paper proposes a new algorithmic framework for competitive self-play reinforcement learning in two-player zero-sum games. We recognize the fact that the Nash equilibrium coincides with the saddle point of the stochastic payoff function, which motivates us to borrow ideas from classical saddle point optimization literature. Our method simultaneously trains several agents and intelligently takes each other as opponents based on a simple adversarial rule derived from a principled perturbation-based saddle optimization method. We prove theoretically that our algorithm converges to an approximate equilibrium with high probability in convex-concave games under standard assumptions. Beyond the theory, we further show the empirical superiority of our method over baseline methods relying on the aforementioned opponentselection heuristics in matrix games, grid-world soccer, Gomoku, and simulated robot sumo, with neural net policy function approximators.

1. INTRODUCTION

Reinforcement learning (RL) from self-play has drawn tremendous attention over the past few years. Empirical successes have been observed in several challenging tasks, including Go (Silver et al., 2016; 2017; 2018) , simulated hide-and-seek (Baker et al., 2020) , simulated sumo wrestling (Bansal et al., 2017) , Capture the Flag (Jaderberg et al., 2019 ), Dota 2 (Berner et al., 2019) , StarCraft II (Vinyals et al., 2019) , and poker (Brown & Sandholm, 2019) , to name a few. During RL from selfplay, the learner collects training data by competing with an opponent selected from its past self or an agent population. Self-play presumably creates an auto-curriculum for the agents to learn at their own pace. At each iteration, the learner always faces an opponent that is comparably in strength to itself, allowing continuous improvement. The way the opponents are selected often follows human-designed heuristic rules in prior work. For example, AlphaGo (Silver et al., 2016) always competes with the latest agent, while the later generation AlphaGo Zero (Silver et al., 2017) and AlphaZero (Silver et al., 2018) generate self-play data with the maintained best historical agent. In specific tasks, such as OpenAI's sumo wrestling, competing against a randomly chosen historical agent leads to the emergence of more diverse behaviors (Bansal et al., 2017) and more stable training than against the latest agent ( Al-Shedivat et al., 2018) . In population-based training (Jaderberg et al., 2019; Liu et al., 2019) and AlphaStar (Vinyals et al., 2019) , an elite or random agent is picked from the agent population as the opponent. Unfortunately, these rules may be inefficient and sometimes ineffective in practice since they do not necessarily enjoy last-iterate convergence to the "average-case optimal" solution even in tabular matrix games. In fact, in the simple Matching Pennies game, self-play with the latest agent fails to converge and falls into an oscillating behavior, as shown in Sec. 5. In this paper, we develop an algorithm that adopts a principle-derived opponent-selection rule to alleviate some of the issues mentioned above. This requires clarifying first what the solution of self-play RL should be. From the game-theoretical perspective, Nash equilibrium is a fundamental solution concept that characterizes the desired "average-case optimal" strategies (policies). When each player assumes other players also play their equilibrium strategies, no one in the game can gain more by unilaterally deviating to another strategy. Nash, in his seminal work (Nash, 1951) , has established the existence result of mixed-strategy Nash equilibrium of any finite game. Thus solving for a mixed-strategy Nash equilibrium is a reasonable goal of self-play RL. We consider the particular case of two-player zero-sum games as the model for the competitive selfplay RL environments. In this case, the Nash equilibrium is the same as the (global) saddle point and as the solution of the minimax program min x∈X max y∈Y f (x, y). We denote x, y as the strategy profiles (in RL terminology, policies) and f as the loss for x or utility/reward for y. A saddle point (x * , y * ) ∈ X × Y , where X, Y are the sets of all possible mixed-strategies (stochastic policies) of the two players, satisfies the following key property f (x * , y) ≤ f (x * , y * ) ≤ f (x, y * ), ∀x ∈ X, ∀y ∈ Y. Connections to the saddle problem and game theory inspire us to borrow ideas from the abundant literature for finding saddle points in the optimization field (Arrow et al., 1958; Korpelevich, 1976; Kallio & Ruszczynski, 1994; Nedić & Ozdaglar, 2009) and for finding equilibrium in the game theory field (Zinkevich et al., 2008; Brown, 1951; Singh et al., 2000) . One particular class of method, i.e., the perturbation-based subgradient methods to find the saddles (Korpelevich, 1976; Kallio & Ruszczynski, 1994) , is especially appealing. This class of method directly builds upon the inequality properties in Eq. 1, and has several advantages: (1) Unlike some algorithms that require knowledge of the game dynamics (Silver et al., 2016; 2017; Nowé et al., 2012) , it requires only subgradients; thus, it is easy to be adapted to policy optimization with estimated policy gradients. (2) For convexconcave functions, it is guaranteed to converge in its last iterate instead of an average iterate, hence alleviates the need to compute any historical averages as in Brown (1951); Singh et al. (2000) ; Zinkevich et al. (2008) , which can get complicated when neural nets are involved (Heinrich & Silver, 2016) . ( 3) Most importantly, it prescribes a simple principled way to adversarially choose self-play opponents, which can be naturally instantiated with a concurrently-trained agent population. To summarize, we apply ideas from the perturbation-based methods of classical saddle point optimization to the model-free self-play RL regime. This results in a novel population-based policy gradient method with a principled adversarial opponent-selection rule. Analogous to the standard model-free RL setting, we assume only "naive" players (Jafari et al., 2001) where the game dynamic is hidden and only rewards for their own actions are revealed. This enables broader applicability to problems with mismatched or unknown game dynamics than many existing algorithms (Silver et al., 2016; 2017; Nowé et al., 2012) . In Sec. 4, we provide an approximate convergence theorem for convex-concave games as a sanity check. Sec. 5 shows extensive experiment results favoring our algorithm's effectiveness in several games, including matrix games, grid-world soccer, a board game, and a challenging simulated robot sumo game. Our method demonstrates higher per-agent sample efficiency than baseline methods with alternative opponent-selection rules. Our trained agents also outperform the baseline agents on average in competitions.

2. RELATED WORK

Reinforcement learning trains a single agent to maximize the expected return in an environment (Sutton & Barto, 2018) . Multiagent reinforcement learning (MARL), of which two-agent is a special case, concerns multiple agents taking actions in the same environment (Littman, 1994) . Self-play is a training paradigm to generate data for MARL and has led to great successes, achieving superhuman performance in several domains (Tesauro, 1995; Silver et al., 2016; Brown & Sandholm, 2019) . Applying RL algorithms naively as independent learners in MARL sometimes produces strong agents (Tesauro, 1995) There are algorithms developed from the game theory and online learning perspective (Lanctot et al., 2017; Nowé et al., 2012; Cardoso et al., 2019) , notably Tree search, Fictitious self-play (Brown,



However, most of these methods are designed for tabular RL only, therefore not readily applicable to continuous state action spaces or complex policy functions where gradient-based policy optimization methods are preferred. Recently, Bai & Jin (2020),Lee et al.  (2020)  andZhang et al. (2020)  provide theoretical regret or convergence analyses under tabular or other restricted self-play settings, which complement our empirical effort.

