SCALING LAWS FOR A MULTI-AGENT REINFORCE-MENT LEARNING MODEL

Abstract

The recent observation of neural power-law scaling relations has made a significant impact in the field of deep learning. A substantial amount of attention has been dedicated as a consequence to the description of scaling laws, although mostly for supervised learning and only to a reduced extent for reinforcement learning frameworks. In this paper we present an extensive study of performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the basis of a relationship between Elo rating, playing strength and power-law scaling, we train AlphaZero agents on the games Connect Four and Pentago and analyze their performance. We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute, and as a power of compute when training optimally sized agents. We observe nearly identical scaling exponents for both games. Combining the two observed scaling laws we obtain a power law relating optimal size to compute similar to the ones observed for language models. We find that the predicted scaling of optimal neural network size fits our data for both games. We also show that large AlphaZero models are more sample efficient, performing better than smaller models with the same amount of training data.

1. INTRODUCTION

In recent years, power-law scaling of performance indicators has been observed in a range of machine-learning architectures (Hestness et al., 2017; Kaplan et al., 2020; Henighan et al., 2020; Gordon et al., 2021; Hernandez et al., 2021; Zhai et al., 2022) , such as Transformers, LSTMs, Routing Networks (Clark et al., 2022) and ResNets (Bello et al., 2021) . The range of fields investigated include natural language processing and computer vision (Rosenfeld et al., 2019) . Most of these scaling laws regard the dependency of test loss on either dataset size, number of neural network parameters, or training compute. The robustness of the observed scaling laws across many orders of magnitude led to the creation of large models, with parameters numbering in the tens and hundreds of billions (Brown et al., 2020; Hoffmann et al., 2022; Alayrac et al., 2022) . Until now, evidence for power-law scaling has come in most part from supervised learning methods. Considerably less effort has been dedicated to the scaling of reinforcement learning algorithms, such as performance scaling with model size (Reed et al., 2022; Lee et al., 2022) . At times, scaling laws remained unnoticed, given that they show up not as power laws, but as log-linear relations when Elo scores are taken as the performance measure in multi-agent reinforcement learning (MARL) (Jones, 2021; Liu et al., 2021) (see Section 3.2). Of particular interest in this context is the AlphaZero family of models, AlphaGo Zero (Silver et al., 2017b ), AlphaZero (Silver et al., 2017a ), and MuZero (Schrittwieser et al., 2020) , which achieved state-of-the-art performance on several board games without access to human gameplay datasets by applying a tree search guided by a neural network. Here we present an extensive study of power-law scaling in the context of two-player openinformation games. Our study constitutes, to our knowledge, the first investigation of power-law scaling phenomena for a MARL algorithm. Measuring the performance of the AlphaZero algorithm using Elo rating, we follow a similar path as Kaplan et al. ( 2020) by providing evidence of power-law Figure 1 : Left: Optimal number of neural network parameters for different amounts of available compute. The optimal agent size scales for both Connect Four and Pentago as a single power law with available compute. The predicted slope α opt C = α C /α N of Eq. ( 7) matches the observed data, where α C and α N are the compute and model size scaling exponents, respectively. See Table 3 for the numerical values. Right: The same graph zoomed out to include the resources used to create AlphaZero (Silver et al., 2017a) and AlphaGoZero (Silver et al., 2017b) . These models stand well bellow the optimal trend for Connect Four and Pentago. scaling of playing strength with model size and compute, as well as a power law of optimal model size with respect to available compute. Focusing on AlphaZero-agents that are guided by neural nets with fully connected layers, we test our hypothesis on two popular board games: Connect Four and Pentago. These games are selected for being different from each other with respect to branching factors and game lengths. Using the Bradley-Terry model definition of playing strength (Bradley & Terry, 1952) , we start by showing that playing strength scales as a power law with neural network size when models are trained until convergence in the limit of abundant compute. We find that agents trained on Connect Four and Pentago scale with similar exponents. In a second step we investigate the trade-off between model size and compute. Similar to scaling observed in the game Hex (Jones, 2021), we observe power-law scaling when compute is limited, again with similar exponents for Connect Four and Pentago. Finally we utilize these two scaling laws to find a scaling law for the optimal model size given the amount of compute available, as shown in Fig. 1 . We find that the optimal neural network size scales as a power law with compute, with an exponent that can be derived from the individual size-scaling and compute-scaling exponents. All code and data used in our experiments are available onlinefoot_0 2 RELATED WORK Little work on power-law scaling has been published for MARL algorithms. Schrittwieser et al. ( 2021) report reward scaling as a power law with data frames when training a data efficient variant of MuZero. Jones (2021), the closest work to our own, shows evidence of power-law scaling of performance with compute, by measuring the performance of AlphaZero agents on small-board variants of the game of Hex. For board-sizes 3-9, log-scaling of Elo rating with compute is found when plotting the maximal scores reached among training runs. Without making an explicit connection to power-law scaling, the results reported by Jones (2021) can be characterized by a compute exponent of α C ≈ 1.3, which can be shown when using Eq. 3. In the paper, the author suggests a phenomenological explanation for the observation that an agent with twice the compute of its opponent seems to win with a probability of roughly 2/3, which in fact corresponds to a compute exponent of α C = 1. Similarly, Liu et al. (2021) report Elo scores that appear to scale as a log of environment frames for humanoid agents playing football, which would correspond to a power-law exponent of roughly 0.5 for playing strength scaling with data. Lee et al. (2022) apply the Transformer architecture to Atari games and plot performance scaling with the number of model parameters. Due to the substantially increased cost of calculating model-size scaling compared to compute or dataset-size scaling, they obtain only a limited number of data points, each generated by a single training seed. On this ba-



https://github.com/OrenNeumann/AlphaZero-scaling-laws

