SCALING LAWS VS MODEL ARCHITECTURES: HOW DOES INDUCTIVE BIAS INFLUENCE SCALING?

Abstract

There have been a lot of interest in the scaling properties of Transformer models (Kaplan et al., 2020) . However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and ( 2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community. We also note that scaling these models is not as straightforward as it seems, i.e., there are intricate details of scale that are intertwined with architectural choices which we study in detail in this 1 Under review as a conference paper at ICLR 2023 1.1e+12 2.2e+12 4.4e+12 8.8e+12 1.8e+13 3.5e+13 7.0e+13 1.4e+14 FLOPS -3.

1. INTRODUCTION

There have been a lot recent interest in the scaling properties of Transformer models (Kaplan et al., 2020; Hernandez et al., 2021; Bahri et al., 2021; Henighan et al., 2020; Tay et al., 2021b; Abnar et al., 2021) . However, not much is understood about the scaling properties of different inductive biases imposed by model architectures. Improvements at a a specific scale (compute, size etc) are often assumed to transfer to different scales and compute regions (So et al., 2019; Choromanski et al., 2020; Lan et al., 2019; Dehghani et al., 2018 ) and new research is often presented in a point-wise fashion with respect to scale. In short, it is not uncommon for new methods to be presented with data points at very specific or limited compute regions (e.g., base size). We believe that understanding the interaction between architecture and scaling laws is crucial as designing models that perform well at diverse scales will likely have significant impact. This paper is an attempt to understand the effect of inductive bias (architecture) on scaling laws of language models. To this end, we pre-train and finetune over ten diverse model architectures across multiple compute region and scales (e.g., from 15M to 40 Billion parameters). In total, we pre-train and finetune over 100 different models of different architectures and sizes and present insights and challenges at scaling these ten diverse architectures. We consider a broad spectrum of models in our extensive experiments. Concretely, we consider several well-established Transformer variants (Vaswani et al., 2017) such as Evolved Transformer (So et al., 2019) , Universal Transformers (Dehghani et al., 2018) and Switch Transformers (Fedus et al., 2021) . We also consider lightweight models such as ALBERT (Lan et al., 2019) and/or efficient Transformers (Tay et al., 2020) such as Performer (Choromanski et al., 2020) and Funnel Transformers (Dai et al., 2020) . In our comparison, we are also interested in finding out if general improvements to the Transformer architectures such as Mixture-of-Softmax (Yang et al., 2017) and/or Gated Linear Units (Dauphin et al., 2017; Shazeer, 2020) influence the scaling behaviour of models. Finally, we also evaluate models outside the family of Transformers including Lightweight convolutions (Wu et al., 2019) , Dynamic convolutions (Wu et al., 2019) and the recently proposed MLP-Mixers (Tolstikhin et al., 2021) . Figure 1 illustrates an overview about the experiments we run. paper. For example, a distinct feature of Universal Transformers (and ALBERT) is parameter sharing. Hence, compared with standard Transformers, this architectural choice significantly warps the scaling behaviour not only with respect to performance but also amongst compute metrics such as FLOPs, speed and number of parameters (Dehghani et al., 2021a) . Conversely, models such as Switch Transformers are on the other end of the spectrum with an uncommon relationship between FLOPs and number of parameters, i.e., they have high parameter to FLOPs ratio. This difficulty makes navigating this landscape challenging.

Our Contributions and Insights

The key contributions of this paper are as follows: • For the first time, we derive scaling laws for different inductive biases and model architectures. We find that this scaling coefficient differs greatly from model to model. We believe this is an important consideration in model development. It turns out that amongst all ten architectures that we consider, the vanilla Transformer has the best scaling behaviour, even if its absolute performance at each compute region is not the greatest. • We observe that models that operate well in one compute-scale region is not necessarily the best in another compute-region. Moreover, we find that certain models have difficulty scaling despite performing decently (comparably) at lower-compute regions. This has implications, since it is difficult to get the fulll picture of a model's scalability with pointwise comparisons at a certain compute-region. • We find that when it comes to scaling different model architectures, upstream pre-training perplexity might not correlate well with downstream transfer. Hence, the underlying architecture and inductive bias is also crucial for downstream transfer. • We highlight the difficulties of scaling with certain architectures and show that some models do not scale (or scale with a negative trend). We also find concerning trends where linear-time attention models such as Performer struggle with scaling up. 2 RELATED WORK Kaplan et al. (2020) studied empirical scaling laws of the decoder-only Transformer language models. They focused on the standard left-to-right language modeling objective with the cross-entropy loss as the performance metric. One of the main findings is that the loss scales as a power-law with three major characteristics of the model training: model size, dataset size and the training compute. Another somewhat surprising finding is that the model shapes such as width or depth of the Transformer network have minimal effects on the cross-entropy loss for a wide range of scales. Subsequent works (Henighan et al., 2020; Hernandez et al., 2021) made similar conclusions for autoregressive generative modeling and for transfer learning, respectively. This finding is also generally supported by (Tay et al., 2021b) but discrepancies were found for the gap between pretraining and finetuning -highlighting the fact that observing downstream performance of large language model is indeed important. In (Tay et al., 2021b) , the effect of depth was unusually pronounced for downstream performance. While previous studies have repeatedly shown the benefits of scale for language understanding tasks for both dense and sparse Transformers and cross-lingual abilities, all of these used the same Transformer implementation within each studies. With a plethora of improved Transformer architectures proposed in the literature, it is timely to investigate which of these improved architecture has the best scaling properties. The main goal of this paper is to systematically study how inductive biases imposed by these Transformer variants affect the scaling behavior in a shared software and hardware settings. This is in similar spirit to (Narang et al., 2021) that studies the impact of architectures on performance. Our analysis extends that of (Narang et al., 2021) to the model scale axis.

3. METHODS

This section outlines our experimental setup.

3.1. MODELS

This section describes the models we evaluate in our experiments. Our models are largely implemented in a sequence to sequence framework (Sutskever et al., 2014) following the convention of T5 (Raffel et al., 2019) . Encoder-decoder models are a natural choice for this experimentation because they can universally express both encoding and decoding tasks. Transformer Variants We consider several standard Transformer variants. • Transformers (Vaswani et al., 2017) -The basic vanilla Transformer architecture. Our basic setup considers the T5-style of Transformers (Raffel et al., 2019) , which largely follows the vanilla Transformer except that it uses relative attention instead of sinusoidal position embeddings and pre-layer normalization, i.e. layer normalization is applied before each sublayer. • Evolved Transformers (So et al., 2019) -A transformer architecture learned via AutoML. The architecture comprises of convolutions and attention. We scale Evolved Transformers following the same pattern as vanilla Transformers. • Universal Transformers (UT) (Dehghani et al., 2018) Efficient Transformer Variants These class of models are mainly concerned at reducing computational costs, memory usage, or parameter count of models. • Performer (Choromanski et al., 2020) -A linear time attention model using generalizable kernel attention. For simplicity, we adopt the relu kernel variant for our experiments. We scale Performer in the similar fashion (i.e., uniform scaling) as vanilla Transformers. • Funnel Transformer (FT) (Dai et al., 2020) A Transformer architecture that downsamples the input sequence across the layer stack. Our implementation uses FT only in the encoder and reverts to vanilla Transformer in the decoder following Narang et al. (2021) . • ALBERT (Lan et al., 2019) -A lightweight transformer architecture that shares parameters across all layers and factorizes the embedding and output softmax layers. For our seq2seq ALBERT, we also share the weights of encoder and decoder. General Improvements We consider general improvements that are not necessarily tied to Transformers. We select candidates that have shown to do well in Narang et al. (2021) . • Mixture of Softmaxes (Yang et al., 2017) -A transformer architecture adopting the MoS method at the Softmax layer. • Gated Linear Units with GeLU (GLU-Transformer) -Replacing position-wise feedforward-networks in Transformers with Gated Linear Units (Dauphin et al., 2017) .

Non-Transformer Architectures

We are interested in the scaling behaviour of non-Transformer based architectures such as convolutions and/or mixer architectures. • 

3.2. EXPERIMENT SETUP

Our setup, along with all models, are implemented in Mesh TensorFlow (Shazeer et al., 2018) , a library with similar interface to TensorFlow but enables distributed model parallelism across multiple workers. For fair comparison, all models are pretrained for 2 19 steps on the english C4 corpus optimized using an inverse square root learning rate with Adafactor (Shazeer & Stern, 2018) . All models use the same SentencePiece tokenizer (Kudo & Richardson, 2018) containing 32K subwords. This closely follows the setup in the T5 paper (Raffel et al., 2019) . Finetuning is performed for 100K steps on a mixture of GLUE (Wang et al., 2018) , SuperGLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2016) . We evaluate on both upstream (pre-training) validation perplexity as well as downstream transfer for NLU tasks (GLUE + SuperGLUE + SQuAD) after fine-tuning. We pretrain and finetune our models with 16 TPU-v3 chips with data parallelism. All large models have a model parallelism of 2 and XL models have a model parallelism of 8.

Model Sizes

We consider several different model sizes for each architecture. For models that are straightforward to scale, we simply follow the standard convention in Raffel et al. (2019) , moving from small to base, to large and XL. We include a tiny version of each model to observe how different models behave at lower compute regions. For models where it was not straightforward to scale (e.g., Universal Transformers, ALBERT), we tried to scale them in a similar fashion but faced obvious limitations such as getting ALBERT to have the same number of parameters as T5 XL without incurring a huge number of cost in terms of FLOPs. For convolutional models, we consider d model to be the hidden size (i.e., channel depth) for the one-dimensional convolution layers. Values such as d kv , N H then become redundant. Details on scaling detailsfoot_0 of each architecture can be found in the supplementary material.

3.3. MAIN RESULTS

We report the main results of this paper in Table 1 . We report the number of trainable parameters, FLOPs (of a single forward pass) and speed (steps per second). We also report on validation perplexity (on upstream pre-training) and results on 17 downstream tasks. The results are reported aggregates of GLUE, SuperGLUE and SQuAD. While we use the same Mesh TensorFlow-based codebase used by Raffel et al. (2019) and hence expect our experimental results to match theirs, we verify that our T5 base does achieve similar results to what is reported in Raffel et al. (2019) .

3.4. DO ALL MODELS SCALE THE SAME WAY?

This section investigates if all model architectures scale in the same way. Upstream Perplexity Figure 2 reports the scaling behaviour of all models as we increase the number of FLOPs. We observe that the scaling behaviour of all models are quite unique and distinct, i.e., most of them are quite different from standard Transformers. Perhaps the biggest finding here is that most models (e.g., LConv, Evolved) all seem to be on-par or better than standard Transformers but fail to scale with a higher compute budget. Another interesting trend is that "linear" Transformers such as Performer fail to scale as shown in Figure 2i . The pre-training perplexity metric only decreases by 2.7% going from base to large scale compared to 8.4% of the vanilla Transformer. The overall finding that most models have distinct scaling curves compared to Transformers is also evident in downstream tasks. It is also noteworthy that most models have a different upstream and downstream scaling curve. We find that some models such as Funnel Transformer and LConvs that seem to hold out pretty well on upstream but suffer substantially on downstream. As for Performer, the performance (disparity) seems to be even greater in downstream as compared to upstream. Notably, the SuperGLUE downstream tasks generally require pseudo cross-attention on the encoder, which models such as convolutions are not equipped to handle (Tay et al., 2021a) . To this end, we find that certain models may have difficulty learning the downstream tasks despite good upstream performance.

3.5. ARE THE BEST MODELS AT EACH SCALE DIFFERENT?

Figure 1 shows the Pareto-frontier when plotting compute against upstream and downstream performance. Since the colors of the plot represent different models, we can observe that the best model for every scale and compute region might be different. Moreover, from Figure 3 , we can also observe this. For example, the Evolved Transformer seems to do well against the standard Transformer at tiny to small region (downstream) but this quickly changes when scaling the model up. We also observe this with MoS-Transformer where it clearly outperforms vanilla Transformers at some regions but not at others. of parameters). In general, most values of α depict how well a model scales. For example α F,U is plotting FLOPs against Upstream performance. The only exception is α U,D which is a measure of upstream vs downstream performance. A high α U,D value means that the transfer to the downstream tasks is better as a model scales. Overall, the α value is a metric that represents how well a model performs relatively across all scales Analysis of Slope for each Model In general, we find that the vanilla Transformer has the highest values of α. Models such as Evolved Transformer, GLU-Transformer, MoS-Transformer and Funnel Transformer tend to have similar scaling properties to the vanilla Transformer. The GLU-Transformer has similar and slightly worse scaling properties to the vanilla Transformer, even if it was observed to do better in absolute sense on some compute-regions. On the other hand, we also observe that there are models which are difficult to scale such as LConv, UT, MLP-Mixer and Performer. This is even more evident on downstream task. We also note that ALBERT scales (trends) negativelyfoot_1 (gets worse) as we scale the model up. On the other hand, the metric α U,D measures how the downstream performance scales with upstream performance. Overall, the Switch Transformer does the best on this metric where downstream performance scales well with upstream performance. Generally, models that make less changes to the main Transformer architecture (GLU-Transformer, MoS-Transformer) tend to retain similar scaling behaviours and changing the inductive bias also significantly alters the scaling property of the model.

3.7. DO SCALING PROTOCOLS INFLUENCE MODEL ARCHITECTURES IN THE SAME WAY?

We are interested in how different scaling protocols influence the model architectures. Figure 4 shows the effect of scaling depth of four model architectures (MoS-Transformer, Transformer, Evolved Transformer and LConv) . Figure 5 shows the effect of scaling width on the same four architectures. Firstly, on upstream (negative log perplexity) curves, we note that while different architectures have a distinct difference in absolute performance, the scaling trend remains quite similar. On downstream, depth scaling (Figure 4 ) seems to act equally on most architectures with the exception of LConv. Meanwhile, for width scaling, it seems that Evolved Transformers scale slightly better when applying width-scaling. It is also interesting to note that depth-scaling has a much more substantial impact on downstream scaling as opposed to width-scaling.

3.8. EPILOGUE AND CONCLUSION

In this paper, we conducted extensive experiments, pretraining and finetuning of up to 100 models ranging from 10 well-established Transformer and non-Transformer architectures. We showed that different model architectures can have different scaling behaviours and models performing well in one compute region (or model size) may not do identically well in another compute region. We also showed that model architectures may do well on upstream perplexity but fail to transfer to downstream tasks. Hence, practitioners should be cautious about developing architectures that not only scale well with respect to the upstream perplexity, but also based on downstream performance. While we certainly do not expect researchers to always report model performance across all scales (especially large-scale), we believe that it is good to keep in mind that architectures can perform quite differently at different compute regions. Hence, this might be a good dimension to consider when designing new inductive biases. As such, performing evaluation at a certain compute region may be insufficient to capture the full picture. It is also good to consider if different inductive biases will result in different extends of emergent capabilities (Wei et al., 2022; Abnar et al., 2020) . We also showed that different model architectures may react differently to different scaling protocols, which further expands on the narrative that comparing and benchmarking these models can be very challenging (Dehghani et al., 2021b) . When it comes to scaling large models, this paper shows that novel inductive biases can be indeed quite risky which might explain why most state-of-the-art large language models Rae et al. ( 2021); Chowdhery et al. (2022) ; Tay et al. (2022) are based on relatively vanilla architectures. Our advice is to be cautious when staking an expensive run on a Transformer architecture that drastically modifies the attention mechanism (e.g., Mixers and Performers are generally high risk options as seen in our experiment results). Finally, we acknowledge that not every practitioner or researcher would require models that are able to scale to billion of parameters. In that case, inductive biases that are tailored to small or low compute will be sufficient.  70 80 90 Squad Accuracy Albert Small Albert Base Albert Large Tiny Mini Small Base Large XL Albert 1 70 80 90 Squad Accuracy DConv Tiny DConv Small DConv Base DConv Large Tiny Mini Small Base Large XL DConv 1 70 80 90 Squad Accuracy Evolved Tiny Evolved Small Evolved Base Evolved Large Evolved 3B Tiny Mini Small Base Large XL Evolved 1 70 80 90 Squad Accuracy Funnel Tiny Funnel Small Funnel Base Funnel Large Funnel 3B Tiny Mini Small Base Large XL Funnel 1



The largest Switch transformer was scaled in a pretty sub-optimal way. So we don't think it is representative of the full potential of the Switch family. Take the last data point of Switch with a pinch of salt. This version of ALBERT shares parameters across encoder and decoder which may partially explain why we had a hard time scaling up.



Figure 1: An overview compute-performance (FLOPs vs performance) plot of all the diverse models and architectures we pretrained and finetuned in this study. Colors represent different model architectures and size of the circles represent the size of the model (parameters).

Raffel et al. (2019) studied the effect of pre-training objectives, model structures (e.g., encoderdecoder, decoder-only), pre-training dataset size and training strategy on the transfer learning. They showed that the downstream performance monotonically increases with the model scale (from 60M to 11B parameters). While they studied several model structures, the Transformer implementation is mostly the same as the original Transformer by Vaswani et al. (2017). Conneau et al. (2020); Goyal et al. (2021) scaled-up multilingual encoder-only architectures up to 11B parameters while maintaining the original Transformer implementation. They found that scaling the model improves its cross-lingual ability. Fedus et al. (2021) scaled a sparse model based on Mixture of Experts (MoE) models up to trillion parameters.

Lightweight Convolutions (Wu et al., 2019) -Lightweight depthwise convolutions that have shown promise over Transformer architectures. • Dynamic Convolutions (Wu et al., 2019) -An extension of the Lightweight Convolution to create time-dependent kernels. • MLP-Mixers (Tolstikhin et al., 2021) -Mixers are recently proposed architectures that learn a lightweight mixing of tokens. Since Mixers have not been used in autoregressive decoding, we only use token-mixers on the input encoder.

Figure 2: Upstream Negative Log-Perplexity of vanilla Transformer compared to other models.

Figure 3: Downstream accuracy of vanilla Transformer compared to other models.

Figure 4: Scaling depth

Figure 5: Scaling width of FFN

Figure 9: Quality-cost trade of for the downstream Squad Accuracy of vanilla Transformer compared to other models, with respect to FLOPs, number of parameters, and throughput.



Results on pre-training and finetuning ten different model architectures. Full results (further varying hyperparameters of these models) can be found in the Appendix.

Slope of a fitted linear line for each model, when we compare FLOPs vs. upstream performance (F, U ), FLOPs vs. downstream performance (F, D), parameter size vs. upstream performance (F, U ), parameter size vs. downstream performance (P, D), and finally upstream performance vs. downstream performance (U, D).

4.1. SCALING DETAILS FOR INDIVIDUAL MODELS

For most models, it was reasonable to follow the uniform scaling method in the main T5 sizes. At each size, the hyperparameters are as follows: Scaling for Universal Transformer Scaling UTs are generally difficult as described in the main text. There were two main considerations for scaling UTs. Initially we tried scaling the number of recurrent operations. However, we found that even with an increase of FLOPS, this does not lead to improved performance. Overall, the UT model might be pretty slow and therefore a model with the same hparams as vanilla XL might be infeasible to run. Hence, we explored increasing the width of the MLPs to 32K to see if UTs would scale in this manner. Quality-cost trade for the downstream Glue Accuracy of vanilla Transformer to other models, with respect to FLOPs, number of parameters, and throughput.

