EFFICIENT COMPETITIVE SELF-PLAY POLICY OPTI-MIZATION

Abstract

Reinforcement learning from self-play has recently reported many successes. Self-play, where the agents compete with themselves, is often used to generate training data for iterative policy improvement. In previous work, heuristic rules are designed to choose an opponent for the current learner. Typical rules include choosing the latest agent, the best agent, or a random historical agent. However, these rules may be inefficient in practice and sometimes do not guarantee convergence even in the simplest matrix games. This paper proposes a new algorithmic framework for competitive self-play reinforcement learning in two-player zero-sum games. We recognize the fact that the Nash equilibrium coincides with the saddle point of the stochastic payoff function, which motivates us to borrow ideas from classical saddle point optimization literature. Our method simultaneously trains several agents and intelligently takes each other as opponents based on a simple adversarial rule derived from a principled perturbation-based saddle optimization method. We prove theoretically that our algorithm converges to an approximate equilibrium with high probability in convex-concave games under standard assumptions. Beyond the theory, we further show the empirical superiority of our method over baseline methods relying on the aforementioned opponentselection heuristics in matrix games, grid-world soccer, Gomoku, and simulated robot sumo, with neural net policy function approximators.

1. INTRODUCTION

Reinforcement learning (RL) from self-play has drawn tremendous attention over the past few years. Empirical successes have been observed in several challenging tasks, including Go (Silver et al., 2016; 2017; 2018 ), simulated hide-and-seek (Baker et al., 2020)) , simulated sumo wrestling (Bansal et al., 2017) , Capture the Flag (Jaderberg et al., 2019 ), Dota 2 (Berner et al., 2019 ), StarCraft II (Vinyals et al., 2019 ), and poker (Brown & Sandholm, 2019) , to name a few. During RL from selfplay, the learner collects training data by competing with an opponent selected from its past self or an agent population. Self-play presumably creates an auto-curriculum for the agents to learn at their own pace. At each iteration, the learner always faces an opponent that is comparably in strength to itself, allowing continuous improvement. The way the opponents are selected often follows human-designed heuristic rules in prior work. For example, AlphaGo (Silver et al., 2016) always competes with the latest agent, while the later generation AlphaGo Zero (Silver et al., 2017) and AlphaZero (Silver et al., 2018) generate self-play data with the maintained best historical agent. In specific tasks, such as OpenAI's sumo wrestling, competing against a randomly chosen historical agent leads to the emergence of more diverse behaviors (Bansal et al., 2017) and more stable training than against the latest agent ( Al-Shedivat et al., 2018) . In population-based training (Jaderberg et al., 2019; Liu et al., 2019) and AlphaStar (Vinyals et al., 2019) , an elite or random agent is picked from the agent population as the opponent. Unfortunately, these rules may be inefficient and sometimes ineffective in practice since they do not necessarily enjoy last-iterate convergence to the "average-case optimal" solution even in tabular matrix games. In fact, in the simple Matching Pennies game, self-play with the latest agent fails to converge and falls into an oscillating behavior, as shown in Sec. 5. In this paper, we develop an algorithm that adopts a principle-derived opponent-selection rule to alleviate some of the issues mentioned above. This requires clarifying first what the solution of 1

