EFFICIENT SEQUENCE PACKING WITHOUT CROSS-CONTAMINATION: ACCELERATING LARGE LANGUAGE MODELS WITHOUT IMPACTING PERFORMANCE

Abstract

Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up to 50% of all tokens can be padding. In less common, but not extreme, cases (e.g. GLUE-cola with sequence length 128), the ratio is up to 89%. Existing methods to address the resulting inefficiency are complicated by the need to avoid 'crosscontamination' in self-attention, by a reduction in accuracy when sequence ordering information is lost, or by customized kernel implementations only valid for specific accelerators. This paper introduces a new formalization of sequence packing in the context of the well-studied bin packing problem, and presents new algorithms based on this formulation which, for example, confer a 2x speedup for phase 2 pre-training in BERT. We show how existing models can be adapted to ensure mathematical equivalence between the original and packed models, meaning that packed models can be trained with existing pre-training and fine-tuning practices.

1. INTRODUCTION

Many language datasets, including the de-facto pre-training dataset for BERT-Wikipedia, have a skewed distribution of sequence lengths (see Figure 1 ). However, typical machine learning accelerators, and their corresponding libraries, exhibit poor performance when processing variablelength workloads. A simple mitigation is to set a maximum sequence length, and to pad shorter sequences with padding tokens. This naive batching is widely used and provided in the vanilla BERT implementation as well as the Hugging Face framework (32). Its effect is enhanced by the offline dataset generation process which, in BERT, attempts to "pack" together sentences so as to fill the sequence length as completely as possible (8). We improve this process at a whole-dataset level. We show that, even after this pre-processing, padding tokens represent 50% of all tokens of the Wikipedia pre-training dataset at sequence length 512. Thus, by avoiding processing the padding tokens one can get a 2x speed-up for phase 2. Overall, the lengths range between 5 tokens up to 512. Samples of length 512 represent only 23.5% of the dataset, Beyond the simple batching, other solutions have been addressed in the literature, and in open-source software implementations. When processing sequences, most libraries and algorithms mention packing as reference to concatenating sentences from the same document (BERT) or from different documents (BERT, T5 (24), GPT-3 (4), and RoBERTa ( 16)) as they arrive (GREEDY) from the source dataset to generate the training dataset. None of the respective papers addresses the packing efficiency, i.e., remaining fraction of padding. To "separate" sequences from different documents, a separator token is introduced. However, this is not sufficient and can have a significant impact on performance. This is discussed only in the RoBERTa paper which shows that downstream F1 scores get consistently reduced on average by 0.35%. Alternative common approaches to overcome the large amount of padding in many datasets are "un-padding" as in Effective Transformer (5) and sorted batching (SORT) as in Faster Transformer (21), lingvo (28) fairseq ( 22), and RoBERTa. However, for

annex

running efficiently on arbitrary accelerators, these approaches require substantial hardware-specific low-level code optimizations only available on GPUs. Further details are in Sections C (1) and 4.4.Beyond language models, packing has been also present in other areas of machine learning, however with little to no exploration in the literature and mostly hidden in some libraries without any further discussion. For example, PyG (PyTorch Geometric) combines multiple small graphs in a batch to account for the large variation in size and to optimize the hardware usage when training a Graph Neural Network (GNN). Another example is the RNN implementation in PyTorch which introduces a "PackedSequence" object and states that "All RNN modules accept packed sequences as inputs" but does not address how sequences are packed efficiently and how the processing of packed sequences is implemented in an efficient manner while avoiding interaction between sequences. Even though we focus on BERT (6) and other transformers in this paper, the general principles can be transferred to many more machine learning algorithms with differently sized data samples.In this paper, we formally frame the packing problem in transformer based models, and provide some solutions, showing that sequences can be packed efficiently, separator tokens are not required, and cross-contamination can be avoided with little overhead.In summary, the contributions of the paper are as follows. In Section 2, we produce histograms of a variety of datasets showing the high percentage of padding tokens. In Section 3.1, we present two new deterministic and efficient packing algorithms based on established solvers which efficiently pack datasets with millions of sequences in a matter of seconds (or less). In Section 3.2 and Section 3.3, we describe 'cross-contamination' -the cause of the accuracy reduction which separator tokens do not mitigate-and show how the BERT model can be adjusted to show the same convergence behavior on packed and unpacked sequences. We empirically show that the proposed packing algorithms produce a nearly-optimal packing scheme for Wikipedia pre-training dataset (Section 4.1) and more in the Appendix. In Section 4.2, we demonstrate that the convergence of the BERT large model on the packed dataset is equivalent to that on the un-packed dataset with 2x throughput increase on the Wikipedia sequence length 512 pre-training dataset. Further experiments underline the necessity and efficiency of our changes. BERT is pre-trained using masked-language modelling and next-sentence prediction on a large corpus of Wikipedia articles. Each sequence is composed of one <CLS> token followed by the first "segment" of sentences, followed by a <SEP> token, and then finally the second "segment" of sentences. Because these "segments" are created in sentence-level increments there is no token-level control of sequence length. Furthermore 10% (default value, ( 7)) of sequences are intentionally

