SCALING LAWS VS MODEL ARCHITECTURES: HOW DOES INDUCTIVE BIAS INFLUENCE SCALING?

Abstract

There have been a lot of interest in the scaling properties of Transformer models (Kaplan et al., 2020). However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.

1. INTRODUCTION

There have been a lot recent interest in the scaling properties of Transformer models (Kaplan et al., 2020; Hernandez et al., 2021; Bahri et al., 2021; Henighan et al., 2020; Tay et al., 2021b; Abnar et al., 2021) . However, not much is understood about the scaling properties of different inductive biases imposed by model architectures. Improvements at a a specific scale (compute, size etc) are often assumed to transfer to different scales and compute regions (So et al., 2019; Choromanski et al., 2020; Lan et al., 2019; Dehghani et al., 2018 ) and new research is often presented in a point-wise fashion with respect to scale. In short, it is not uncommon for new methods to be presented with data points at very specific or limited compute regions (e.g., base size). We believe that understanding the interaction between architecture and scaling laws is crucial as designing models that perform well at diverse scales will likely have significant impact. This paper is an attempt to understand the effect of inductive bias (architecture) on scaling laws of language models. To this end, we pre-train and finetune over ten diverse model architectures across multiple compute region and scales (e.g., from 15M to 40 Billion parameters). In total, we pre-train and finetune over 100 different models of different architectures and sizes and present insights and challenges at scaling these ten diverse architectures. We consider a broad spectrum of models in our extensive experiments. Concretely, we consider several well-established Transformer variants (Vaswani et al., 2017 ) such as Evolved Transformer (So et al., 2019 ), Universal Transformers (Dehghani et al., 2018 ) and Switch Transformers (Fedus et al., 2021) . We also consider lightweight models such as ALBERT (Lan et al., 2019) and/or efficient Transformers (Tay et al., 2020) such as Performer (Choromanski et al., 2020) and Funnel Transformers (Dai et al., 2020) . In our comparison, we are also interested in finding out if general improvements to the Transformer architectures such as Mixture-of-Softmax (Yang et al., 2017) and/or Gated Linear Units (Dauphin et al., 2017; Shazeer, 2020) influence the scaling behaviour of models. Finally, we also evaluate models outside the family of Transformers including Lightweight convolutions (Wu et al., 2019 ), Dynamic convolutions (Wu et al., 2019) and the recently proposed MLP-Mixers (Tolstikhin et al., 2021) . Figure 1 illustrates an overview about the experiments we run. We also note that scaling these models is not as straightforward as it seems, i.e., there are intricate details of scale that are intertwined with architectural choices which we study in detail in this 1.1e+12 2.2e+12 4.4e+12 8.8e+12 1.8e+13 3.5e+13 7.0e+13 1.4e+14 FLOPS -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 Negative Log-Perplexity paper. For example, a distinct feature of Universal Transformers (and ALBERT) is parameter sharing. Hence, compared with standard Transformers, this architectural choice significantly warps the scaling behaviour not only with respect to performance but also amongst compute metrics such as FLOPs, speed and number of parameters (Dehghani et al., 2021a) . Conversely, models such as Switch Transformers are on the other end of the spectrum with an uncommon relationship between FLOPs and number of parameters, i.e., they have high parameter to FLOPs ratio. This difficulty makes navigating this landscape challenging.

Our Contributions and Insights

The key contributions of this paper are as follows: • For the first time, we derive scaling laws for different inductive biases and model architectures. We find that this scaling coefficient differs greatly from model to model. We believe this is an important consideration in model development. It turns out that amongst all ten architectures that we consider, the vanilla Transformer has the best scaling behaviour, even if its absolute performance at each compute region is not the greatest. • We observe that models that operate well in one compute-scale region is not necessarily the best in another compute-region. Moreover, we find that certain models have difficulty scaling despite performing decently (comparably) at lower-compute regions. This has implications, since it is difficult to get the fulll picture of a model's scalability with pointwise comparisons at a certain compute-region. • We find that when it comes to scaling different model architectures, upstream pre-training perplexity might not correlate well with downstream transfer. Hence, the underlying architecture and inductive bias is also crucial for downstream transfer. • We highlight the difficulties of scaling with certain architectures and show that some models do not scale (or scale with a negative trend). We also find concerning trends where linear-time attention models such as Performer struggle with scaling up.

2. RELATED WORK

Kaplan et al. ( 2020) studied empirical scaling laws of the decoder-only Transformer language models. They focused on the standard left-to-right language modeling objective with the cross-entropy loss as the performance metric. One of the main findings is that the loss scales as a power-law with three major characteristics of the model training: model size, dataset size and the training



Figure 1: An overview compute-performance (FLOPs vs performance) plot of all the diverse models and architectures we pretrained and finetuned in this study. Colors represent different model architectures and size of the circles represent the size of the model (parameters).

