AUTOMATIC CURRICULUM GENERATION FOR REIN-FORCEMENT LEARNING IN ZERO-SUM GAMES Anonymous

Abstract

Curriculum learning (CL), whose core idea is to train from easy to hard, is a popular technique to accelerate reinforcement learning (RL) training. It has also been a trend to automate the curriculum generation process. Automatic CL works primarily focus on goal-conditioned RL problems, where an explicit indicator of training progress, e.g., reward or success rate, can be used to prioritize the training tasks. However, such a requirement is no longer valid in zero-sum games: there are no goals for the agents, and the accumulative reward of the learning policy can constantly fluctuate throughout training. In this work, we present the first theoretical framework of automatic curriculum learning in the setting of zero-sum games and derive a surprisingly simple indicator of training progress, i.e., the Qvalue variance, which can be directly approximated by computing the variance of value network ensembles. With such a progression metric, we further adopt a particle-based task sampler to generate initial environment configurations for training, which is particularly lightweight, computation-efficient, and naturally multi-modal. Combining these techniques with multi-agent PPO training, we obtain our final algorithm, Zero-sum Automatic Curriculum Learning (ZACL). We first evaluate ZACL in a 2D particle-world environment, where ZACL produces much stronger policies than popular RL methods for zero-sum games using the same amount of samples. Then we show in the challenging hide-and-seek environment that ZACL can lead to all four emergent phases using a single desktop computer, which is reported for the first time in the literature. The project website is at https://sites.google.com/view/zacl.

1. INTRODUCTION

Curriculum learning (CL) (Bengio et al., 2009) , whose core idea is to generate training samples from easy to hard, is a popular paradigm to accelerate the training of reinforcement learning (RL) agents (Lazaric et al., 2008; Taylor et al., 2008; Narvekar et al., 2020) . Starting from simple tasks, an RL agent can progressively adapt to tasks with increasing difficulty according to a properly designed curriculum, and finally solve hard tasks with fewer samples than naive RL training from scratch with uniformly sampled training tasks. Automating the design of curriculum for RL training has attracted much research interest. An ordinary formulation is the teacher-student framework (Matiisen et al., 2019; Portelas et al., 2020) , where a "teacher" proposes task configurations that are neither too easy nor too hard for the "student" RL agent to solve. A key ingredient in order to generate suitable task configurations is to measure the progress of the learning student. In goal-oriented RL problems or cooperative multiagent games, the training progress is straightforward to measure since the success rate for reaching a goal or the accumulated reward can explicitly reflect the current performance of the student on a specific task (Wang et al., 2019; Florensa et al., 2018; Chen et al., 2021) . However, an explicit progression metric does not exist in the setting of zero-sum games, where the ultimate goal of the RL agent is no longer reaching any goal or getting high rewards. Instead, the convergence criterion is to find a Nash equilibrium (NE) efficiently. The accumulated reward of a policy for one player would oscillate when it is exploiting or being exploited by its opponent throughout RL training. Therefore, it becomes non-trivial to measure how "close" the current policy to an NE is. Existing curricula in zero-sum games are typically based on heuristics, i.e., training on an increasing number of agents (Long et al., 2020; Wang et al., 2020b) or adapting other environment parameters according to domain knowledge (Berner et al., 2019; Tang et al., 2021) . It remains unclear how to automate the generation process of these task parameters. In this work, we propose a novel automatic CL framework for multi-agent RL training in zerosum games. We theoretically derive a surprisingly simple progress metric, i.e., Q-value variance, as an implicit signal of the learning progress towards an NE. By prioritizing learning on game configurations with high Q-value variance, we are implicitly tightening a lower bound of the true distance between the learning policy and a desired NE. This simple metric is also straightforward to incorporate into a configuration sampler to automatically generate the training curriculum for the RL agents to accelerate convergence towards NE. We approximate the Q-value variance by value variance and implement it as the empirical uncertainty over an ensemble of independently learned value networks. We then develop a curriculum generator that samples game configurations according to an empirical probability distribution with density defined by the value uncertainty. In order to keep track of the constantly evolving and multi-modal density landscape induced by value uncertainty throughout RL training, we adopt a non-parametric sampler that directly samples configurations from a diversified data buffer. Combining the progression metric and non-parametric curriculum generator with a multi-agent RL backbone MAPPO (Yu et al., 2022) , we derive our overall automatic curriculum learning algorithm for zero-sum games, Zero-sum Automatic Curriculum Learning (ZACL). We first evaluate ZACL in a 2D particle-world benchmark, where ZACL learns stronger policies with lower exploitability than existing multi-agent RL algorithms for zero-sum games given the same amount of environment interactions. We then stress-test the efficiency of ZACL in the challenging hide-and-seek environment. ZACL produces the emergence of all four phases of strategies only using a single desktop machine, which is reported for the first time. Moreover, ZACL also consumes substantially fewer environment samples than large-scale distributed PPO training.

2. RELATED WORK

Curriculum learning has a long history of accelerating RL training (Asada et al., 1996; Soni & Singh, 2006; Lazaric et al., 2008; Taylor et al., 2008; Narvekar et al., 2020) . In the recent literature, automatic curriculum learning (ACL) is often applied to the goal-oriented RL setting where the RL agent needs to reach a specific goal in each episode. ACL methods design or learn a smart sampler to generate proper task configurations or goals that are most suitable for training advances w.r.t. some progression metric (Florensa et al., 2017; 2018; Racaniere et al., 2019; Matiisen et al., 2019; Portelas et al., 2020; Dendorfer et al., 2020; Chen et al., 2021) . Such a metric typically relies on an explicit signal, such as the goal-reaching reward or reward changes (Wang et al., 2019; Florensa et al., 2018; Portelas et al., 2020; Matiisen et al., 2019) , success rates (Chen et al., 2021) , and the expected value on the testing tasks. However, in the setting of zero-sum games, these explicit progression metrics become no longer valid since the reward associated with a Nash equilibrium can be arbitrary. There are also ACL works utilizing reward-agnostic metrics. One representative category of methods is asymmetric self-play (Sukhbaatar et al., 2018; Liu et al., 2019; OpenAI et al., 2021) , where two separate RL agents are trained via self-play with one agent setting up goals to exploit the other. Such a self-play training framework implicitly constructs a competitive game leading to an emergent curriculum for the training agents. Some recent works have also shown that this training process can be provably related to regret minimization with proper changes (Dennis et al., 2020; Gur et al., 2021) . However, these methods still assume a goal-oriented problem and it remains non-trivial to adapt them to zero-sum games. (Zhang et al., 2020) adopts a similar value disagreement ACL metric to tackle a collection of single-agent goal-reaching tasks, which is perhaps the most technically similar work to our method. In addition to different problem domains of interest, we remark that (Zhang et al., 2020) develops the method in a purely empirical fashion while we follow a complete theoretical derivation. It is a beautiful coincidence that empirical observations and theoretical analysis converge to the same implementation even with parallel motivations. Applying RL to solve zero-sum games can be traced back to the TD-Gammon project (Tesauro et al., 1995) and has led to great achievements in defeating professional humans in complex competitive games (Jaderberg et al., 2018; Berner et al., 2019; Vinyals et al., 2019) . Most RL methods for zero-sum games can be proved to converge to an (approximate) Nash equilibrium (NE) or correlated

