CRAMMING: TRAINING A LANGUAGE MODEL ON A SINGLE GPU IN ONE DAY Anonymous

Abstract

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

1. SCALING UP AND SCALING DOWN

Large-scale training of machine learning models with transformer architectures has lead to groundbreaking improvements in many sub-fields of natural language processing including language understanding and natural language generation (Vaswani et al., 2017; Dosovitskiy et al., 2021; Radford et al., 2019) . The nowadays accepted (but historically surprising) key behavior of these systems is that they reliably scale -they continuously improve in performance when the number of model parameters and amount of data grow. These increases in performance are well-described by various power laws as studied by Kaplan et al. (2020) . This sets up a dominant paradigm in which scaling is the key to performance improvement (Sutton, 2019) . The power of scale has set off a race to produce extremely large models, which in turn has created an environment where few researchers or practitioners feel that they are capable of training a language model. The original BERT model Devlin et al. ( 2019), which became a cornerstone transformer for many practical applications in natural language understanding, already required a significant amount of computation to train. Yet, the reproduction and improvements in Liu et al. ( 2019) further increased its performance by cranking up the level of computation by orders of magnitude. As these pre-trained checkpoints became popular for a range of downstream applications (Wolf et al., 2020) , the competition for the largest language model became a focal point for industrial labs. This led to training runs that improved the performance of pretrained language models at the expense of computation at the zettaFLOP scale (Raffel et al., 2020; Yang et al., 2020; Zaheer et al., 2021 ) and later at the extremely large yottaFLOP scale (Brown et al., 2020; Black et al., 2022; Chowdhery et al., 2022; Rae et al., 2022) . Our goal is to turn this trend on its head and investigate how to best scale down language model training and what trade-offs emerge when doing so: What downstream performance can be achieved by a modest researcher when training from scratch with a single GPU for a single day? The ability to train a language model to the performance level of BERT with such modest resources has several interesting implications. For one, if scaled-down model pretraining is a viable analogue of large-compute pretraining, then this opens up a host of further academic investigations that

