SCALING LAWS FOR A MULTI-AGENT REINFORCE-MENT LEARNING MODEL

Abstract

The recent observation of neural power-law scaling relations has made a significant impact in the field of deep learning. A substantial amount of attention has been dedicated as a consequence to the description of scaling laws, although mostly for supervised learning and only to a reduced extent for reinforcement learning frameworks. In this paper we present an extensive study of performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the basis of a relationship between Elo rating, playing strength and power-law scaling, we train AlphaZero agents on the games Connect Four and Pentago and analyze their performance. We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute, and as a power of compute when training optimally sized agents. We observe nearly identical scaling exponents for both games. Combining the two observed scaling laws we obtain a power law relating optimal size to compute similar to the ones observed for language models. We find that the predicted scaling of optimal neural network size fits our data for both games. We also show that large AlphaZero models are more sample efficient, performing better than smaller models with the same amount of training data.

1. INTRODUCTION

In recent years, power-law scaling of performance indicators has been observed in a range of machine-learning architectures (Hestness et al., 2017; Kaplan et al., 2020; Henighan et al., 2020; Gordon et al., 2021; Hernandez et al., 2021; Zhai et al., 2022 ), such as Transformers, LSTMs, Routing Networks (Clark et al., 2022) and ResNets (Bello et al., 2021) . The range of fields investigated include natural language processing and computer vision (Rosenfeld et al., 2019) . Most of these scaling laws regard the dependency of test loss on either dataset size, number of neural network parameters, or training compute. The robustness of the observed scaling laws across many orders of magnitude led to the creation of large models, with parameters numbering in the tens and hundreds of billions (Brown et al., 2020; Hoffmann et al., 2022; Alayrac et al., 2022) . Until now, evidence for power-law scaling has come in most part from supervised learning methods. Considerably less effort has been dedicated to the scaling of reinforcement learning algorithms, such as performance scaling with model size (Reed et al., 2022; Lee et al., 2022) . At times, scaling laws remained unnoticed, given that they show up not as power laws, but as log-linear relations when Elo scores are taken as the performance measure in multi-agent reinforcement learning (MARL) (Jones, 2021; Liu et al., 2021) (see Section 3.2). Of particular interest in this context is the AlphaZero family of models, AlphaGo Zero (Silver et al., 2017b ), AlphaZero (Silver et al., 2017a ), and MuZero (Schrittwieser et al., 2020) , which achieved state-of-the-art performance on several board games without access to human gameplay datasets by applying a tree search guided by a neural network. Here we present an extensive study of power-law scaling in the context of two-player openinformation games. Our study constitutes, to our knowledge, the first investigation of power-law scaling phenomena for a MARL algorithm. Measuring the performance of the AlphaZero algorithm using Elo rating, we follow a similar path as Kaplan et al. ( 2020) by providing evidence of power-law 1

