EFFICIENT LARGE-SCALE TRANSFORMER TRAINING VIA RANDOM AND LAYERWISE TOKEN DROPPING

Abstract

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (random-LTD), which skips the computation of a subset of the input tokens at all middle layers. Particularly, random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, random-LTD does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., [CLS]), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new LayerToken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that random-LTD can be applied to broader applications, including GPT and BERT pretraining as well as ViT and GPT finetuning tasks. Our results show that random-LTD can save about 33.3% theoretical compute cost and 25.6% wall-clock training time while achieving similar zero-shot evaluations on GPT-3 1.3B as compared to baseline.

1. INTRODUCTION

Large-scale transformers have been demonstrated to have supreme performance on natural language processing (Tenney et al., 2019; Radford et al., 2019; Raffel et al., 2019 ), computer vision (Dosovitskiy et al., 2020) , and other applications (Gong et al., 2021; Guo et al., 2021) . However, both the pretraining procedure and some downstream finetuning tasks (e.g., long document summary) are time-consuming and resource-hungry. Thus, there is a need to speed up the training and reduce the compute cost for large-scale transformer pretraining and finetuning. Recently, Hou et al. (2022) adopt the token pruning/dropping/bypassing technique (Kim et al., 2021; Goyal et al., 2020; Kim & Cho, 2020) from BERT inference to BERT pretraining by skipping the compute of part of the input tokens at some middle layers. The results of (Hou et al., 2022) (referred to as TokenBypass) show that it can theoretically reduce the pretraining cost by 25% for both BERT base and BERT large without losing accuracy on finetuning tasks. Although achieving great speedup, TokenBypass (1) needs an import-score metric to determine the dropped tokens and special token treatment to keep important tokens (e.g., [CLS]), both of which require manual designs; (2) has to keep the first half layers and the last layer (in total, half of the depth) in full sequence length training, which limits its layer-bypassing ability. (3) solely focuses on BERT Masked-LM pretraining tasks and has not been applied to other tasks, e.g., causal-LM. In this work, we address those challenges and introduce our random and layerwise token-dropping method (random-LTD). In summary, our contributions are as follows: • All tokens are treated equally without any special token treatment or import-score measurement, i.e., no manual design, and are dropped in a purely random manner. Meanwhile, instead of fully bypassing the dropped token for all middle layers (Hou et al., 2022) , each layer in random-LTD drops tokens independently from the other layers. This helps the multi-head attention in the middle layers capture the dependency relation across different tokens suggested in (Vig & Belinkov, 2019). • random-LTD applies token dropping at all middle layers except the very first and last layers, which further reduces manual design and increases training efficiency. We also propose a new monotonic sequence length growth method as training evolves to (1) reduce the gradient noise introduced

