EFFICIENT SEQUENCE PACKING WITHOUT CROSS-CONTAMINATION: ACCELERATING LARGE LANGUAGE MODELS WITHOUT IMPACTING PERFORMANCE

Abstract

Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up to 50% of all tokens can be padding. In less common, but not extreme, cases (e.g. GLUE-cola with sequence length 128), the ratio is up to 89%. Existing methods to address the resulting inefficiency are complicated by the need to avoid 'crosscontamination' in self-attention, by a reduction in accuracy when sequence ordering information is lost, or by customized kernel implementations only valid for specific accelerators. This paper introduces a new formalization of sequence packing in the context of the well-studied bin packing problem, and presents new algorithms based on this formulation which, for example, confer a 2x speedup for phase 2 pre-training in BERT. We show how existing models can be adapted to ensure mathematical equivalence between the original and packed models, meaning that packed models can be trained with existing pre-training and fine-tuning practices. Under review as a conference paper at ICLR 2023 running efficiently on arbitrary accelerators, these approaches require substantial hardware-specific low-level code optimizations only available on GPUs. Further details are in Sections C (1) and 4.4. Beyond language models, packing has been also present in other areas of machine learning, however with little to no exploration in the literature and mostly hidden in some libraries without any further discussion. For example, PyG (PyTorch Geometric) combines multiple small graphs in a batch to account for the large variation in size and to optimize the hardware usage when training a Graph Neural Network (GNN). Another example is the RNN implementation in PyTorch which introduces a "PackedSequence" object and states that "All RNN modules accept packed sequences as inputs" but does not address how sequences are packed efficiently and how the processing of packed sequences is implemented in an efficient manner while avoiding interaction between sequences. Even though we focus on BERT ( 6) and other transformers in this paper, the general principles can be transferred to many more machine learning algorithms with differently sized data samples. In this paper, we formally frame the packing problem in transformer based models, and provide some solutions, showing that sequences can be packed efficiently, separator tokens are not required, and cross-contamination can be avoided with little overhead. In summary, the contributions of the paper are as follows. In Section 2, we produce histograms of a variety of datasets showing the high percentage of padding tokens. In Section 3.1, we present two new deterministic and efficient packing algorithms based on established solvers which efficiently pack datasets with millions of sequences in a matter of seconds (or less). In Section 3.2 and Section 3.3, we describe 'cross-contamination' -the cause of the accuracy reduction which separator tokens do not mitigate-and show how the BERT model can be adjusted to show the same convergence behavior on packed and unpacked sequences. We empirically show that the proposed packing algorithms produce a nearly-optimal packing scheme for Wikipedia pre-training dataset (Section 4.1) and more in the Appendix. In Section 4.2, we demonstrate that the convergence of the BERT large model on the packed dataset is equivalent to that on the un-packed dataset with 2x throughput increase on the Wikipedia sequence length 512 pre-training dataset. Further experiments underline the necessity and efficiency of our changes.

1. INTRODUCTION

Many language datasets, including the de-facto pre-training dataset for BERT-Wikipedia, have a skewed distribution of sequence lengths (see Figure 1 ). However, typical machine learning accelerators, and their corresponding libraries, exhibit poor performance when processing variablelength workloads. A simple mitigation is to set a maximum sequence length, and to pad shorter sequences with padding tokens. This naive batching is widely used and provided in the vanilla BERT implementation as well as the Hugging Face framework (32) . Its effect is enhanced by the offline dataset generation process which, in BERT, attempts to "pack" together sentences so as to fill the sequence length as completely as possible (8) . We improve this process at a whole-dataset level. We show that, even after this pre-processing, padding tokens represent 50% of all tokens of the Wikipedia pre-training dataset at sequence length 512. Thus, by avoiding processing the padding tokens one can get a 2x speed-up for phase 2. Overall, the lengths range between 5 tokens up to 512. Samples of length 512 represent only 23.5% of the dataset, Beyond the simple batching, other solutions have been addressed in the literature, and in open-source software implementations. When processing sequences, most libraries and algorithms mention packing as reference to concatenating sentences from the same document (BERT) or from different documents (BERT, T5 (24) , GPT-3 (4), and RoBERTa ( 16)) as they arrive (GREEDY) from the source dataset to generate the training dataset. None of the respective papers addresses the packing efficiency, i.e., remaining fraction of padding. To "separate" sequences from different documents, a separator token is introduced. However, this is not sufficient and can have a significant impact on performance. This is discussed only in the RoBERTa paper which shows that downstream F1 scores get consistently reduced on average by 0.35%. Alternative common approaches to overcome the large amount of padding in many datasets are "un-padding" as in Effective Transformer (5) and sorted batching (SORT) as in Faster Transformer (21) , lingvo (28) fairseq (22) , and RoBERTa. However, for Figure 1 : Sequence length distributions for different datasets. The three graphics at the top left show Wikipedia BERT pre-training dataset sequence length histograms (token count excluding padding) for different maximum sequence lengths based on the Wikipedia article dump from October 1st 2020. The theoretical speed-up relates to not using any padding tokens and not having any overhead from processing the different lengths. Top right: GLUE datasets. Bottom from left to right: SQuAD 1.1, LibriSpeech text labels, LibriSpeech audio token sequence, and QM9 molecules of a graph in a sequence. BERT is pre-trained using masked-language modelling and next-sentence prediction on a large corpus of Wikipedia articles. Each sequence is composed of one <CLS> token followed by the first "segment" of sentences, followed by a <SEP> token, and then finally the second "segment" of sentences. Because these "segments" are created in sentence-level increments there is no token-level control of sequence length. Furthermore 10% (default value, (7) ) of sequences are intentionally cut short. This leads to significant levels of padding, especially for longer maximum sequence lengths (see Figure 1 and Section J(1)). At sequence length 128 (commonly used in phase 1 of pre-training) the theoretical speed-up is around 1.2, at sequence length 384 this increases to 1.7, and finally at sequence length 512 (commonly used for phase 2 of pre-training) it is 2.0. Despite the widespread use of the Wikipedia dataset for pre-training BERT such histograms have, to the best of our knowledge, not been published previously. This has perhaps lead to the underestimation of the speed-up opportunity available. To put things into perspective, the sequence length 512 dataset contains 8.33 billion tokens, of which 4.17 billion are padding tokens. Note that the skewed sequence length distributions are neither limited to Wikipedia, as shown with GLUE (30; 31) from Section L(1) and SQuAD 1.1 (25) from Section K(1) (2.2x speed up), to BERT training, as shown with LibiSpeech text distributions (23) from Section M(1), nor to text itself, given the LibriSpeech audio data distributions, and the QM9 molecular data (27; 26) (1.6x speed-up, Section Q(1)). All distributions can be found in Figure 1 . Since LibriSpeech audio data is skewed to longer sequences, only 1.3x speed-up could be achieved despite the theoretical maximum of 1.6x. For all other cases, the algorithms presented in Section 3.1 lead to close to optimal packing.

3. METHODS

Our approach consists of three distinct components. Firstly, we pack the n data samples efficiently during pre-processing to make full use of the maximum sequence length, s m (Sections 3.1 and F). Secondly, we introduce a series of model changes in Section 3.2 that preserve the equivalence with the original BERT implementation. The changes include a self-attention mask to prevent the model from attending between different sequences in the same pack (Section 3.2.2), and an adjustment of the the positional embeddings (Section 3.2.1) to handle packs of sequences. Other components of the model, such as the feed-forward layer (29) , operate on a per-token basis and do not require modification for pre-training. In Section 3.2.3, we also demonstrate how to compute a per-sequence loss and accuracy for NSP and downstream fine-tuning tasks. Thirdly, we provide suggestions for hyperparameter adjustment (Section 3.3) that lead to analogous convergence behavior between the packed and un-packed BERT implementations. Additional videos and animations are provided as supplemental material.

3.1. PACKING ALGORITHMS

The widely studied and well established bin packing problem deals with the assignment of items into bins of a fixed capacity such that the number of utilized bins is minimized. It has been known for decades if not centuries. Since an exact solution is strongly NP-complete (14) , numerous approximate solutions have been proposed (12; 15; 13; 35) . Since most existing approximations have a high complexity of at least O(n log n), we propose two new heuristic offline algorithms that are tailored to the NLP setting applied to the whole dataset. For a detailed introduction to packing see Section F.

3.1.1. SHORTEST-PACK-FIRST HISTOGRAM-PACKING (SPFHP)

Shortest-pack-first histogram-packing (SPFHP) works on the bins in the sequence length histogram (with bin size 1) rather than the individual samples. The histogram is traversed in sorted order from longest to shortest sequences. Then, to pack the data during the traversal, we apply the worst-fit algorithm (12; 35) such that the histogram bin being processed goes to the "pack"foot_0 that has the most space remaining ("shortest-pack-first"). If the histogram bin does not fit completely, a new pack is created. We also limit the packing depth, in other words the maximum number of sequences that are allowed in a pack. Therefore, an existing pack is only extended if it is not already at maximum packing depth. The detailed code for the algorithm is provided in Listing 3. The time and space complexity of the algorithm are O(n + s 2 m ) and O(s 2 m ) (Section G.2(1)).

3.1.2. NON-NEGATIVE LEAST SQUARES HISTOGRAM-PACKING (NNLSHP)

The proposed NNLSHP algorithm is based on re-stating the packing problem as a (weighted) nonnegative least squares problem (NNLS) (3) of the form wAx = wb where x ≥ 0. The vector b is the histogram containing the counts of all the sequence lengths in the dataset. Next, we define the A matrix (the "packing matrix") by first generating a list of all possible sequence length combinations ("strategies") that add up exactly to the maximum sequence length. We focus specifically on strategies that consist of at most 3 sequences per pack (independent of b) and encode each strategy as a column of the sparse matrix A. For example, a strategy consisting of the sequence length 128, 128, and 256 in represented a column vector that has the value 2 at the 128th row, the value 1 at the 256th row, and zero at all other rows. The variable x describes the non-negative repetition count for each strategy. So a 24 in the ith row of x means that the strategy represented by the ith column of A should repeat 24 times. Moreover, in the un-weighted setting, Ax = b states that we would like to "mix" the pre-defined strategies (columns of A) such that the number of samples matches the histogram b, and where each strategy is used x ≥ 0 times. We use the residual weight w to control the penalization of the Ax -b residual on different sequence lengths (different rows of b). Heuristically, we set the weight of 0.09 for all sequences of length 8 or smaller because they are considered acceptable padding sequences while all other sequence lengths get weight 1. We discuss this heuristic choice of parameters in Section F.4.5 and F.5 (1) . The overall efficiency of the packing is not greatly influenced by the weighing (less than 1% extra speed-up). After solving wAx = wb for x ≥ 0 using an off-the-shelf solver, we obtain a floating point solution, which means that the repetition counts are not necessarily integers. Since we cannot use a non-natural number of strategies, we round the solution x to the nearest integer. The error introduced by this rounding is found to be negligible (a few hundred sequences in the worst case) compared to the size of the dataset (millions of sequences). The time complexity and space complexity of the algorithm are O(n + s 5 m ) and O(s 3 m ). Further details are provided in Section F.4.

3.2. PACKEDBERT: MODEL CHANGES

This section describes how any vanilla BERT implementation should be modified for packed sequence processing, such that the behavior of the model is the same as when processing unpacked sequences. Preserving the mathematical equivalence is necessary to ensure existing BERT pre-training and fine-tuning practices remain valid, as well as being required by benchmarks such as MLPerf™ (17) . The presented approaches and principles apply to a variety of other models.

3.2.1. ADJUST POSITIONAL EMBEDDINGS

The BERT model uses three types of embeddings: token, segment, and positional embeddings. The latter is canonically implemented as a bias add operation, rather than a full embedding look-up. This is possible because the positional indices increase linearly for every sequence. However, when using the packed data format the position index needs to be reset with each new packed sequence. For instance, when packing two sequences one of length 2 and one of length 3, the positional embedding indexes that need to be picked up are [0, 1, 0, 1, 2]. To achieve this, the bias add needs to be replaced by an embedding look-up to extract the correct positional embedding for each token in the pack. This also requires keeping an extra input which specifies the position of each token in its sequence. This required adjustment has only a minor impact on absolute accuracy/loss (see Section 4.2 and 4.2.1).

3.2.2. ADJUST ATTENTION MASKING

# input mask = np.array([[1, 1, 1, 2, 2]]) # 0, 1 mask zero_one_mask = tf.equal(mask, mask.T) # for use with softmax: softmax_mask = tf.where( zero_one_mask, 0, -1000) To maintain an implementation that is consistent with the un-packed version, tokens from different sequences within a pack should not be able to attend to each other. This is typically achieved in other implementations by unpacking the sequences using custom attention kernels and then doing the attention per-sequence (5) . Instead, we propose directly masking the attention matrix with a block-diagonal mask before the attention softmax. This is straightforward to implement in modern frameworks (see Figure 2 ). Naturally, there is a cost to both the mask construction and applying it to the attention matrix. However, it is required to keep the accuracy (see Table 1 , Section 4.1, Section 4.2). See also the code of the deprecated tensor2tensor library and our own provided code.      1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1     

3.2.3. ADJUST PER-SEQUENCE LOSS AND ACCURACY

Canonical implementations of BERT compute the cross-entropy loss for the masked language model on a per-token basis. However other NLP tasks, such as SQuAD, compute the loss and accuracy on a per-sequence basis. This section discusses how to handle such tasks when training with packed sequences. Simply feeding packs of sequences to the same implementation of cross-entropy would result in a per-pack weighted loss. In other words, the overall loss on the micro-batch would sum-up the losses on the individual packs, rather than individual sequences. As a result, the model would converge to a different optimum than when running with the un-packed implementation. For instance, a pack of a single sequence would contribute to the loss with the same weight as a pack of three sequences. To recover the per-sequence averaging behavior of the canonical un-packed BERT implementation, we effectively "unpack" the incoming logits and labels. Once the sequences have been unpacked, we can compute the loss on each sequence separately as usual and then add up the losses. However, rather than looping through the sequences index, we compute on all indexes in parallel (see Figure 2 ). This minimizes the latency overhead of un-packing the loss calculation. As an example, we show how per-sequence loss can be implemented for the pre-training task. We use the "masked lm weight" (7) input tensor to represent which sequence a given masked token belongs to (0, 1, 2 and so on). This is consistent with the canonical BERT implementation where this input takes a value of either 1 (belonging to the sequence) or 0 (belonging to padding). The full methodology is detailed in Listing 5 and can be applied to other classification or pre-training tasks.

3.3. ADJUST HYPERPARAMETERS

In terms of convergence behavior, the primary consequence of packing is an increase in the effective batch size (with respect to number of sequences and real tokens) with some added variation over different iterations. If we look on the sentence level, the number of sentences in one batch increases by the packing factor. Similarly, the number of tokens in one batch increases. Hence, hyperparameters that are sensitive to these numbers need to be adjusted. A direct solution is to reduce the computational batch size by the packing factor (average number of sequences per pack) and keep all other hyperparameters the same. For example, if the packing factor is 2, cutting the gradient accumulation count by half is sufficient. The advantage of this strategy is that no fine-tuning of hyperparameters is required and performance curves are comparable. However, this approach might be not desirable as it might imply under-utilizing the memory/compute, especially if the micro batch size needs to be reduced. Hence to preserve batch size and optimize hardware utilization, we additionally propose an approximate heuristic for updating the decay parameters of the LAMB optimizer (34) . For a packed dataset with a packing factor p, we update the decay parameters as: β 1 := β p 1 , β 2 := β p 2 . For p = 2, this corresponds to the exact parameters for calculating momentum and velocity, when updating with the same gradient twice (Section D). A common approach is to scale the learning rate with the batch size. However, our experiments in Section 4.2 show that this reduces convergence speed. Since these adjustments are only heuristics the convergence of the model will be comparable but not identical. In particular, it is unlikely that simply adjusting the hyperparameters will fully undo the impact of the increased batch size. However, with these adjustments, researchers should be able to continue to use existing configurations.

4.1. BIN PACKING ALGORITHM COMPARISON

We evaluate our algorithms using the following metrics: number of packs, number of all tokens, number of padding tokens, solution time of the packing algorithm (after histogram and strategy creation), number of strategies used, packing efficiency (the fraction of non-padding tokens in the packed dataset), the speed-up achieved compared to not packing (depth 1), and the average number of sequences per sample (packing factor). For SPFHP, we analyse different (maximum) packing depth, since packing is less efficient with smaller depth and we want to get a general understanding on how the packing depth influences the processing time. For NNLSHP, we focus on packing depth 3 because it packs the data sufficiently well. For the speed-up analysis, we focus on the intelligence processing unit (IPU) (11) (IPU-M2000, 16 accelerator chips), BERT phase 2 pretraining setup as in Section 4.2. A GPU dynamically loads the code into the accelerator; in contrast, the IPU works with a static pre-compiled engine that gets loaded onto the chip at the start of the run. While other approaches result in excessive padding or continuous changes of the code, our approach can work with the same code for the whole dataset. So in this setting the IPU architecture would especially benefit from our approach since it avoids code changes. Nevertheless, it can be applied to any implementation on GPU or TPU. For determining the speed-up, we take advantage of the precompiled kernel. Since time measurements are quite noisy, we can profile the kernel and how many cycles it takes for processing a batch. That way, we can determine the overhead (in cycles) from processing the additional attention masking and for unpacking the loss. Combining overhead and packing factor, we get the speed-up estimate. No experiment repetitions are required since the algorithms and measurements are deterministic. Packing depth describes the maximum number of packed sequences. NONE is the baseline BERT implementation, whereas SORT corresponds to sorted batching, and GREEDY concatenates sequences as they arrive until they would exceed 512 tokens. Setting no limit resulted in a maximum packing depth of 16. EFFiciency is the percentage of real tokens in the packed dataset. The packing factor describes the resulting potential speed-up compared to packing depth 1. With overhead (OH), we denote the percentage decrease in throughput due to changes to the model to enable packing (such as the masking scheme introduced in Section 3.2.2). The realized speed-up is the combination of the speed-up due to packing (the packing factor) and the decrease in throughput due to the overhead on the IPU. It is used to measure the relative speed-up in throughput and the overhead from masking and loss adjustment. SORT can be only efficient on GPUs (see Section 4.4). The main results for the performance metric evaluation are displayed in Table 1 . The processing time for SPFHP on an Intel(R) Xeon(R) Gold 6138 CPU with 2.00GHz, 80 nodes, and 472G RAM was around 0.03s and independent from the packing depth. Classical First-Fit-Decreasing requires 87-120s, a lot of memory, and scales almost linear with the number of samples. We see that the overhead slightly increases with packing depth but that the benefits of packing outweigh the cost. The best speed-up is obtained with NNLSHP at depth 3 which required 28.4s on the CPU for processing and ran out of memory for larger depth. With a value of 1.913, it is close to the theoretical upper bound of 2.001. The results show that efficiency, packing factor, and speed-up can be viewed interchangeably. The amount of time needed to process a sample (a pack of sequences) is barely changed relative to the un-packed implementation. The packing factor, or the improvement in efficiency, effectively provide an accurate estimate of the speed-up. GREEDY packing as used in T5 shows to be quite inefficient and sorted batching (SORT) is highly efficient in avoiding padding but the resulting different computational graphs cause a major overhead on the IPU that exceeds the benefits of avoiding the padding. Since we made our algorithm and code public available, results have been reproduced with a different framework on the Habana Gaudi accelerator (10) and confirmed that our approach is hardware and software independent giving it a huge advantage over existing approaches. With packing, we effectively increase the average batch size by the packing factor (≈ 2). However, with a different batch size, different hyperparameters are required (see Section 3.3) and there is no mapping that will generate exact matching of results but only heuristics. In a first comparison, we use the same hyperparameters when comparing packed and unpacked training except for cutting the accumulation count by half. This way, we make sure that the batch size is constant on average and we have the same amount of training steps. In the second comparison, we evaluate our heuristics and how they compensate the difference in batch size. This setup is more desirable because it is beneficial to use the hardware to its full potential and cutting the batch size by half usually reduces throughput. In the third comparison, we compare two optimized setups. In these two cases, packing takes half the amount of training steps. The learning curves are displayed in Figure 3 . In the first setup, we see the curves almost matching perfectly when normalizing by the numbers of samples processed. Differences can be explained by the variation of the number of sequences in the packing batch, and general noise in the training process. Especially after the initial phase, the curves show a near-identical match. The second setup shows bigger differences since changing the batch size and hyperparameters changes the training dynamics. We observe slower convergence early on in training due to the increased batch size. This is expected. The adjustment of the learning rate actually decreases performance probably because we correct for the increased number of sequences already in the modified loss. With the adjustment of the decay parameter of LAMB, we see matching performance at the later training stages. However, it is not feasible to completely recover the early convergence behavior of the smaller batch size by adjusting the hyperparameters. For instance doubling the batch size of unpacked BERT to 3000 and adjusting the LAMB decay parameters leads to more of a slow down in convergence than when running packed BERT with a batch size of 1500 and a packing factor of 2. n practice, our implementations exceeds the estimated 1.913 maximum speed-up. This estimate is based on the reduction in the computational work needed to process the dataset. However, packing the data also reduces the latency of the transferring the data to the device. Figure 3 shows that the realized total speed-up from packing exceeds 2x.

4.2.1. ABLATION STUDY

So far, we have shown that with the introduced adjustments, we can match the accuracy of unpacked BERT. In the following, we analyze in how far the masking adjustment is required. In Figure 4 , we can see that without our adjustments, training loss and accuracy worsen drastically and a longer training time does not lead to a recovery. When not adjusting the positional embedding, the loss and accuracy almost match. However, the accuracy stalls at 71.8% and does not reach the target accuracy of 72.1%. So overall, both adjustments are crucial to avoid a reduction in performance. When running packed BERT without the NSP loss but keeping everything else the same in a full training setup, we observed that downstream performance on SQuAD reduced the F1 measure by 1.31% and EM by 1.15%. Hence, we do not consider removing NSP as done in approaches like RoBERTa and T5 as discussed in Section I.

4.3. FULL PRETRAINING AND SQUAD FINETUNING

Packing slightly violates the i.i.d. assumption of data. Thus, we have to check that downstream performance is not impacted by packing. This is especially relevant in a full training setup without a starting checkpoint. To this aim, we show that the packed and unpacked SQuAD 1.1 scores are comparable after a full-pretraining of BERT base and large plus fine-tuning. During pre-training, in order to avoid giving an advantage to packing by further hyperparameter tuning, we reduce the gradient accumulation count for the packed BERT training for phase 1 and phase 2 to match, on average, the total number of sequences that get processed before each weight update. With this approach, we can use the same hyperparameters and number of training steps but process each batch faster by avoiding the processing of padding. This gives a slight disadvantage to the packed run in terms of machine utilization, as explained in Section 3.3 and is different to the speedup analysis in Section 4.2. For Phase 2, we use sequence length 384 since longer range attention is not relevant for SQuAD 1.1. The respective speed-ups from packing for BERT base and large are shown in Table 2 : the realized speed-up, measured as the quotient of the throughputs between the packed and unpacked runs, is slightly lower to the theoretical throughput (i.e. the packing factor) due to the packing overhead. Further learning curves with the loss function and accuracy are provided in Section P. For the fine-tuning training on SQuAD 1.1, we do not use packing. The scores, computed as the median of 10 different seeds, are displayed in Table 3 . They are comparable to the reference ones in (6): for BERT base (resp. large) the F1 score is reduced by 0.2% (resp. 0.3%) and the EM score increases by 0.3% (resp. 0.02%). A further advantage of packing over competing un-padding approaches is the inherent load balancing provided by packing. So called un-padding approaches rely on dynamically launching custom kernels that ignore padding. A stated advantage of such implementations is the ability to avoid computing the complete (512 x 512) attention matrix. This provides additional computational savings compared to packing, where the attention matrix is computed in its entirety and then masked. Because of these additional savings, un-padding can exceed the theoretical upper bound for speed-up from packing (2.013 on Wikipedia). As a result of the dynamic nature of the approach, the processing time with un-padding is different for each sequence in the batch, and the amount of time required to process a batch of sequences will be determined by the processing time of the longest sequence in the batch (with the sequences being processed in parallel). Furthermore, in the multiple accelerator setting the processing time on each device will vary depending on the sequences in the batch that it receives. Devices which finish early have to wait for the slowest device to finish before exchanging gradients. This load-imbalance between the devices (and inside the batch) leads to a considerable decrease in the speed-up from un-padding as the number of accelerators is increased (see Figure 5 and Section E ( 1)). In contrast, packing (our approach) is inherently load-balanced. The processing time on each accelerator is independent of the content inside the batch received by the device. Any number of accelerators can therefore operate in unison without having to wait for the slowest batch to process (all per-device batches are equally fast). 

5. CONCLUSION

Whereas packing is a well known concept, this paper sheds a new light onto it in multiple aspects. First, we visualize the sequence length distributions of multiple datasets not just from language domains but also audio and molecular domains to emphasize that packing is beneficial for varied datasets, leading to more than 2x acceleration by removing 50% or more padding. Second, we provide two new highly efficient packing approaches based on established solvers that leave almost no padding and that can tackle arbitrarily large datasets in a matter of seconds, in contrast to existing approaches that are slow and suboptimal. Third, we demonstrate that without adjusting the sequence processing algorithm (e.g., BERT) to the packed sequences, predictive performance is reduced. Thus, we propose several model adjustments that are all necessary to keep predictive performance. Last but not least, we prove that, thanks to such adjustments, predictive performance is preserved as if no packing was used -but speed significantly increases, especially since the adjustments come with an overhead of less than 5%. We prove in our experiments that downstream performance is not impacted by packing and that the anticipated 2x acceleration can be achieved. Supplemental Material for "Efficient Sequence Packing without Crosscontamination: Accelerating Large Language Models without Impacting Performance" 

A BROADER IMPACT

We showed that when pre-training BERT on Wikipedia, the computational overhead taken to process padding tokens is roughly 50%. By eliminating this wasted computational time, the approach presented in this paper paves a way to halving the carbon footprint of training BERT-based models. Furthermore, our approach circumvents the need for custom kernels, making the benefits of packing readily accessible to a broader audience of NLP practitioners. As such, we are hopeful the research will have a positive impact on the NLP community, and do not see any disadvantage of using this approach. The benefit of our algorithm is based on two assumptions: A skewed length distribution in the training dataset and a hardware setup that trains efficiently on a fixed batch size. If efficient training is possible, with a variable batch size approaches like FasterTransformer and the fairseq sorted batch approach will result in the same or even larger benefits (due to smaller self-attention matrices). If the dataset is generated differently like in GPT models (4) and RoBERTa (FULL-SENTENCES) ( 16), all sequences will be at full length and sequences cannot be concatenated and there is indeed no benefit in packing sequences. However, strategies that reach full sequence length usually combine segments from different unrelated document sources which can result in reduced performance. Even in the normal BERT model, there might be this contamination between segments from different documents. Our paper introduced an approach to avoid the contamination between sequences. However, the same approach could also be applied to avoid contamination between segments and it remains future work to explore its benefits beyond BERT pretraining. Future work would need to investigate the applicability of packing on text produced by different cultures and in different languages. We have already shown that the speed-up resulting from using our methods does not only occur when pre-training BERT on Wikipedia but also on other datasets such as SQuAD and GLUE. Furthermore, the sentence length distribution of the original English language text shows similar characteristics. Our research leads us to believe that compressible distributions arise naturally in language tasks and beyond, for instance in DNA sequence lengths (39) , protein lengths (38) , and speech (Section M). Many such sequence modelling workloads are based on variations of the BERT/transformer architecture and would therefore easily benefit from our acceleration. Failures in NLP can have a big impact on society; many technologies, such as Alexa, Siri, and Google Home, rely on them. Whilst any errors arising from our approach can be avoided, one potential source of error comes from the implementation. Both the attention mask and the per-sequence loss need to be modified to support packing. These changes are significantly smaller than those required by custom kernels, however they may still be time consuming to implement and debug. To help mitigate the risk of any implementation errors, we share our reference implementations of the required changes in the appendix.

B REPRODUCIBILITY STATEMENT

All code for the packing algorithms is available in the appendix (Section U) and is directly linked to our GitHub page to simplify the download and usage. We even provide code for different variants and the histograms of sequence length for different datasets that got tokenized for BERT training of fine-tuning. To generate the learning curves, our public submission to MLPerf™ could be used and we are preparing further code releases in other frameworks. To encourage the use of the adjustments of models for packed sequences, we additionally provide detailed explanations and code snippets in TensorFlow. Detailed mathematical formulas (Section E and F), a theorem proof (Section D), and complexity calculations (Section G) are provided in this appendix to support our claims in the paper in full detail.

C RELATED WORK

The most obvious way to reduce the extent of padding in the dataset is to group samples by size before batching (SORT), i.e., process the shorter samples together and longer samples together. BERT is pre-trained in two phases, where the first phase uses sequence length 128 for 900K steps and the second phase uses sequence length 512 for 100K steps. However even by splitting the training in this way, the wasted compute due to padding is approximately 20% (see Figure 1 ). Other examples of this "sorted batching" approach can be found in Faster Transformer (21) , lingvo ( 28) fairseq (22) , and RoBERTa (16) , which group samples of similar size together in one batch and fill up with padding only to the maximum length in this batch. This approach can be highly efficient in cases where the dataset length is multiple orders of magnitude larger than the batch size and the number of different sequence lengths. Despite its high computational efficiency, this approach has multiple drawbacks. We outline these below and propose an alternative which maintains the high efficiency, while also circumventing the downsides. Firstly, sorting the data can reduce the overall convergence speed when the batch size is large because it violates the i.i.d. assumption on the data distribution (2; 18). Secondly, processing batches with shorter sequence lengths under-utilizes the compute compared to running the same batch size with a longer sequence length. For GPUs, a common heuristic to mitigate this effect is to adjust the batch size to keep the number of processed tokens near constant (22; 16) . In general however, the relationship between the sequence length and the optimum batch size is more complex and maximizing compute utilization can require the model to be sharded differently across multiple accelerators. Avoiding this, often manual process, is important for ease of use and the portability of methods across different hardware architectures. Thirdly, modern NLP applications are optimized and compiled for fixed tensor sizes using tools such as XLA (33; 9) , which provides a ≈ 7x acceleration for BERT in MLPerf™ ( 17) compared to the non-XLA baseline (33) . Changing the sequence length or batch size requires re-optimization of the computational graph and recompilation of the program for the new tensor shapes. For complex models such as BERT, optimization and recompilation take a non-negligible amount of time. Even if one pre-compiled and cached all combinations of batch size and sequence length, the kernels would still need to be re-uploaded to the device every time the shapes change. Depending on how frequently the tensor shapes change, the overhead from switching kernels adds up. To avoid these issues, it is preferable (and common) to work with fixed tensor shapes for the entire duration of the training run. More advanced approaches for reducing the padding overhead rely on custom computational kernels. Loosely these are referred to as "un-padding" approaches. In Effective Transformer (5), the input batch is provided as a padded matrix but padding values are dynamically removed and restored during different calculation stages. While un-padding implementations are highly sophisticated and are able to completely circumvent the processing of padding tokens, they introduce a significant overhead due to the multiple GPU kernel launches (i.e., one kernel per sequence rather than one kernel per batch). Additionally the time to process each batch will fluctuate depending on the sequence lengths in each batch, i.e., batches with only shorter sequences will typically be processed faster. When working with more than one accelerator, this variability in throughput results in all devices in the cluster waiting for the device with the most compute intensive batch to finish processing. As such, un-padding approaches are not appropriate for deployment on large clusters. The "packing" based approach introduced in this paper offers significant advantages over un-padding approaches. Firstly, packing is implemented directly at the framework level and requires no additional custom kernel implementations. Secondly, the processing time for each batch is independent of the content of the batch, allowing the packing based approach to maintain the same speed-up whether running on a single device or thousands. While we demonstrate the effectiveness of packing specifically on the Wikipedia dataset, packing SQuAD (25) or GLUE datasets (31; 30) for BERT also leads to significant speed-ups (some in excess of 9x) (Sections K and L). The effectiveness of packing is a result of both the length distribution of the documents in the source datasets as well as the different text preprocessing steps for BERT (8) . The use of bi-directional self-attention in BERT implies that the input sequences should contain complete sentences. If a sentence is abruptly cut short, the hidden state on other (preceding) tokens in the sequence will be affected. Language models with causal attention (only attending to previous tokens in the input) do not have this issue to the same degree. For such models, if a sequence is cut short at an arbitrary token, the other tokens (which occur earlier in the sequence) will not be affected. This ability to cut sequences arbitrarily completely trivializes the packing problem for models based on causal attention. For instance, GPT-3 ( 4) is trained with a maximum sequence length of 2048 where a single sequence may contain multiple segments of sentences separated by a special end of segment token. The last segment in each sequence is simply cut to meet the sequence length requirement making the packing problem trivial and avoiding any padding. In the interest of computational efficiency GPT-3 does not mask the attention between different segments in a sequence. In contrast, the packing approach presented in this paper introduces a mask in the attention layer (see Section 3.2.2) to prevent cross-contamination between examples in a pack. Note, we mask the interaction between different sequences and not between different sentences or segments in the same sequence. This ensures that the characteristics of the original dataset and model are matched as closely as possible. RoBERTa and many other models in production like T5 (24) use a similar packing approach as GPT-3, combining full sentences/sequences with GREEDY packing (first come first concatenate) and also separation tokens or additional padding. The RoBERTa ablation study shows that mixing of sentences from different documents reduces accuracy, but it is used nonetheless for load balancing reasons which indicates that sorted batching is not sufficient. There might be hidden code snippets as in the deprecated tensor2tensor library that seems to implement the same attention masking mechanism as we propose. However, these lack a sufficient documentation, testing, evaluation, ablation, and communication to the research community to be considered state of the art in NLP research. More general, to the best of our knowledge and the knowledge of many other engineers and researchers that we were in contact with, there is no other research work that focuses on packing in NLP.

D THEOREM ON LAMB HYPERPARAMETER CORRECTION HEURISTIC

With packing, the effective batch size changes and hence hyperparameters of the LAMB optimizer (34) need to be adjusted. For a packed dataset with a packing factor p, we update the decay parameters as: β 1 := β p 1 , β 2 := β p 2 . For instance if β 1 = 0.81 for the un-packed dataset, then for a packed dataset with an average of 2 sequences per sample one should use a value of 0.81 2 ≈ 0.66 instead. Assuming no or only minor changes in gradients and p being a natural number, we can prove that this heuristic is the exact solution to make sure that momentum and velocity in LAMB are unaffected by packing. This can be proven by mathematical induction. Note that p ≥ 1 by definition. Theorem D.1. For any p ∈ N and assuming that respective gradients on a batch of b random samples are (approximately) the same, choosing β 1 := β p 1 ( ) β 2 := β p 2 . (2) as hyperparameters in the LAMB optimizer ensures that the momentum and velocity after p separate update steps are the same as with one packed update step with p × b samples. Proof.

• Base Case:

For p = 1 the left and right side of the equation are the same which matches exactly the unpacked case. Hence, the theorem holds for p = 1. • Inductive hypothesis: Suppose the theorem holds for all values of p up to some k, k ≥ 1. • Inductive proposition: The theorem holds for p = k + 1. • Proof of the inductive step: Let l be the loss function, w t the weight vector after t updates, and x t 1 , . . . , x t b the respective underlying data to calculate the gradient g t . For a single update step in LAMB with batch size b samples, we compute the gradient g t = 1 b b i=1 ∂l ∂w (x t i , w t ). Since g 1 ≈ g 2 ≈ . . . ≈ g k+1 , We have with the inductive hypothesis and the definitions in LAMB: m k = β k 1 m 0 + (1 -β k 1 )g 1 v k = β k 2 v 0 + (1 -β k 2 )g 2 1 ( ) Under review as a conference paper at ICLR 2023 Now we can calculate (with g 1 ≈ g k+1 ) m k+1 = β 1 m k + (1 -β 1 )g k+1 (6) ≈ β 1 β k 1 m 0 + (1 -β k 1 )g 1 + (1 -β 1 )g 1 (7) = β k+1 1 m 0 + (1 -β k+1 1 )g 1 The calculation for v k is the same. As reference for a packed update with p = k + 1 with β 1 and β 2 , we would get g = 1 pb p j=1 b i=1 ∂l ∂w (x j i , w 1 ) = 1 p p j=1 1 b b i=1 ∂l ∂w (x j i , w 1 ) ≈ 1 p p j=1 g 1 = g 1 (9) since we are calculating gradients over b samples which are assumed to be approximately the same. Consequently, the updates for momentum and velocity would be m k = β 1 m 0 + (1 -β 1 )g 1 ( ) v k = β 2 v 0 + (1 -β 2 )g 2 1 . Hence, β 1 = β k+1 1 and β 2 = β k+1 2 is required to map to the formula with the consecutive updates (for the same amount of data). • Conclusion: The theorem holds for any p ∈ N. Since we proved that the formulas β 1 := β p 1 , β 2 := β p 2 . hold for all p ∈ N, p ≥ 1, it is safe to assume that it is an appropriate heuristic for all p ∈ R, p ≥ 1.

E UN-PADDING SCALING ESTIMATE

To demonstrate the severity of the load-imbalance issue in Section 4.4 we consider the scaling of an un-padding approach with a per-device batch size of 32 running on eight devices (20) . From there, we readily extrapolate the performance to both larger and smaller cluster sizes by fitting a Gumbel distribution to the observed processing times as described in this section. On a single device with batch size 32 un-padding outperforms packing and exceeds the theoretical upper-bound for packing. As the number of devices increases to two or more, the proposed packing approach outperforms the dynamic un-padding approach. On a cluster with 32 accelerators the speed-up from un-padding drops to 50% and with 2048 devices the speed-up is only 30%. In contrast, the speed-up due to packing is independent of the number of accelerators and stays at 1.913. Switching to a smaller batch size would reduce the load-imbalance issue to some extent, but would also result in under-utilization of the available memory and compute. Firstly, we retrieve the per-batch processing time for an un-padding implementation running pretraining on the Wikipedia dataset from (20) . These processing times were obtained using 8 GPUs each with a per-device batch size of 32. We also retrieve the throughput numbers for the same system running with padding from (43) and use that as the baseline to compare the un-padded throughput against. The throughput on the 8 GPU system is effectively limited by the slowest of the eight batches being processed in parallel. The Gumbel distribution is particularly suited to modelling the maximum or minimum value of a fixed size collection of i.i.d. samples (in this case batches). We observe that on 8 GPUs the throughput (i.e. speed-up) distribution indeed closely resembles a Gumbel distribution with α 1 = 1.6 and β 8 = 0.13 as shown in Figure 6 . We can extrapolate the performance on the 8 GPU system to larger clusters by recognizing that the processing time for each cluster is effectively determined by the slowest batch being processed. Specifically, we could randomly sample (without replacement) two processing times for the 8 GPU system, and record the max of the two as the processing time for a system of 16 GPUs. However, this simple approach is too sensitive to outliers in the data and would result in an under-estimate of the performance of un-padding on large systems. We mitigate the effect of outliers in the data Right: statistical estimate of speed-up distribution on a 1 GPU system running un-padding by avoiding directly sampling the processing times. Instead, we fit a Gumbel distribution to the processing times of a single batch of size 32 running on one GPU. To perform the fit, we observe that the cdf on one GPU (P 1 ) is related to the cdf on 8 GPUs (P 8 ) through (40)(section 1.3): (1 -P 8 (s)) = (1 -P 1 (s)) 8 In other words, if the speed-up on the cluster is larger than s, this implies that the speed-up on every GPUs in the cluster was at least s. Assuming P 1 is Gumbel and given the 8 GPU Gumbel parameters α 8 and β 8 , we need to fit two parameters, α 1 and β 1 . Consequently for the median (s = α 8 -β 8 ln(ln(2)), P 8 (s) = 0.5), we have: 0.5 = (1 -P 1 (α 8 -β 8 ln(ln(2)))) 8 . And since P 8 is Gumbel, we also have an equation for the mode (s = α 8 , P 8 (s) = e -1 ): (1 -e -1 ) = (1 -P 1 (α 8 )) 8 . We solve these two non-linear equations simultaneously using the standard SciPy optimization package. Listing 1: Infer Gumble distribution parameters. The resulting estimated speed-up Gumbel distribution for a single device has α = 1.94, β = 0.108 and is shown in Figure 6 [right]. To simulate the performance of a cluster of size n with a batch size of 32 per device, we take the minimum over n samples from this distribution. Repeating this process to generate many samples allows us to estimate the expected speed-up for any given cluster size. Unfortunately, we cannot make any statistical inference about the processing times of individual sequences since the data is only provided at the granularity of 32 sequences per batch, and it is not clear how much of the computation is done in parallel and how much in serial.

F TECHNICAL BACKGROUND ON PACKING F.1 CANONICAL PACKING PROBLEM

The bin packing problem deals with the assignment of items into bins of a fixed capacity such that the number of utilized bins is minimized. In the canonical formulation of the packing problem a vector s(i) of length n is used to represent the items being packed, where s(i) denotes the length of the i-th sequence/item. The allocation of items into bins is tracked through the assignment matrix B, where B ij ∈ {0, 1} states whether the i-th sequence should be placed into the j-th bin. In the worst case scenario, every item is assigned to its own bin, thus B ∈ R n×n . Notably, s grows linearly in the number of sequences/items being packed and B grows with the square. To mask out unused bins y j ∈ {0, 1}, denotes whether the j-th bin is being used. The optimization objective is to minimize the sum of y j while making sure to assign each s i to exactly one bin and not exceeding the maximum bin capacity s m for each bin. This problem formulation is well known as bin packing (14) .  Bin packing is a strongly NP-complete ( 14) problem. Producing an exact and optimal solution is possible with a variety of existing algorithms, for example with the branch-and-cut-and-price algorithm (36) . However, given that we want to apply it for very large n (16M for the Wikipedia dataset) an approximate approach is required.

F.2 APPROXIMATE BIN PACKING PROBLEM

Approximate packing approaches are divided into online and offline algorithms (12) . Online algorithms process incoming sequences one-by-one in a streaming fashion, whereas offline algorithms have a holistic view of all samples to be packed but typically still operate on a per sample basis. This results in best case time and memory complexities of at least O(n log(n)) and solutions that can sometimes be far from optimal, especially for the online algorithms which do not have access to a holistic view of the datasets. The simplest online approach (next-fit) would be to keep a single open bin at any given time. An incoming sequence is added to this open bin if it fits, otherwise the bin is closed (can never be appended to again) and a new one is opened to accommodate the new sequence (12) . In the case of the Wikipedia pre-training dataset almost 25% of the sequences are of length 512, which makes this approach very inefficient since bins would frequently be closed because the incoming sequence did not fit. More specifically, this approach is not able to efficiently combine one long sequence with one shorter sequence, when the number of long sequences is large. The algorithms that come closest to the approaches proposed in this paper are the online harmonic-k algorithm (15) , which creates harmonic sized bins for the assignment decision, and the offline Modified First Fit Decreasing method (13; 35), which sorts the data, groups it into 4 size categories and defines a strategy adjusted to these sizes. In our approaches, we make three major simplifications. We make the problem of bin packing less dependent on n by operating on the histogram of sequence lengths with bin size 1. Hence, we replace s(i) by its histogram b and the bin assignment y, B by a mixture of strategies x, where the set of all available packing strategies is modeled as the matrix A (see also Section F.4.2). Then, we do not solve the full packing problem but focus on a fixed packing depth (in other words the well known 3-partition problem). Last but not least, we solve the limited depth packing problem only approximately either with a non-negativity-constrained linear least squares (3) (NNLS) followed by rounding to nearest integer solution or by applying Worst-Fit (13; 35) to the histogram, sorted from largest to smallest (in contrast to using an unsorted dataset). An exact solution would not be appropriate, since the 3-partition problem is strongly NP-complete (37) as well.

F.3 DEFINITIONS

In this section, we standardize the terms used throughout our methods. Firstly, the terms pack and bin may be used interchangeably. Secondly, the presented packing schemes impose a limit on how many sequences can be packed into any given bin. This limit is referred to as the maximum packing depth. For simplicity, we require the different sequence lengths in a pack to always add up exactly to the bin capacity s m (we can always generate a padding sequence of just the right length to fill-up the bin). A packing strategy is a sorted list of sequence lengths, for example [5, 7, 500] , such that the total sequence length is no more than s m and the number of sequences in the pack does not exceed the maximum packing depth. The output of a packing scheme is typically as set of packing strategies and the corresponding repeat count for each strategy stating how many times each strategy should be repeated in order to cover the entire dataset. The strategy repeat count is also referred to as the mixture of strategies. The objective of the packing algorithm is to jointly design a set of packing strategies and their repeat counts, such that the amount of padding is (approximately) minimized. The presence of padding in the packs can either be implicit or explicit. For instance for s m = 512 the strategy [2, 508] has an implicit padding of 2 (needed to fill the pack up to the s m ). Alternatively, the strategy repeat count may over-subscribe a particular sequence length leading to explicit packing. For instance constructing a pack of [4, 508] may require a new padding sequence of length 4 be constructed, if there are not enough sequences of that length in the dataset. The packing algorithms, we present, use both representations.

F.4 NON-NEGATIVE LEAST SQUARES HISTOGRAM-PACKING

The first algorithm proposed in this paper is suitable for settings where it is desirable to achieve a high packing efficiency with a limited packing depth. The algorithm is deterministic and has three major components described in Sections F.4.1, F.4.2 and F.4.3.

F.4.1 ENUMERATING PACKING STRATEGIES OF FIXED PACKING DEPTH

Listing all unique ways of packing up to a maximum packing depth can be achieved through dynamic programming. We only consider packing at most up to 3 sequences per pack. This is the smallest packing depth that can eliminate the need for most padding on the Wikipedia dataset. Increasing the depth to 4, increases the size of the packing problem drastically and yields no throughput benefit. 2With only two sequences, packing would be not as efficient since the distribution on sequence length is not symmetric. We use dynamic programming to enumerate all feasible ways/strategies that up to M sequences of length 1 -512 can be packed into a bin of length 512. For example, a packing strategy may be [512] or [6, 506] or [95, 184, 233] . To avoid listing the same strategy multiple times, we enforce the sequence lengths within a pack to occur in sorted order, for example, [95, 184, 233] is equivalent to [184, 95, 233] and should only be listed once. This reduces the search space as well as the space of potential solutions by a factor of 6 approximately and thus significantly accelerates the optimization process. If you had the same strategy repeated 6 times instead of having just one instance of that strategy with weight X, you will have six instances with weight x/6 (for example, or any other distribution). This would conflict with integer rounding of the solutions and with convergence of optimization algorithms.

F.4.2 CONSTRUCTING THE PACKING MATRIX

The number of rows in the packing matrix is equal to the number of different sequence length categories. For instance, if we are using a granularity of 1 token to distinguish between different sequence lengths, then there are "maximum sequence length" rows. Each column of the matrix corresponds to a valid packing strategy (given the depth of packing). An example packing matrix for fitting up to 3 sequences into sequence length 8 is given in Table 4 . Each column of the matrix represents a packing strategy. For instance, the first column represents the strategy [1, 1, 6] Rows represent the number of sequences in these packs with a certain length. The last column represents a pack with only a single sequence of length six. 2 1 1 1 0 0 0 0 0 0 0 1 0 0 2 1 1 0 0 0 0 0 1 0 0 2 0 1 0 0 0 0 1 0 1 0 0 0 2 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

F.4.3 SOLUTION OF THE NNLS APPROXIMATE PACKING PROBLEM

A solution of the packing problem is the mixture of packing strategies x that minimizes the amount of padding in the packed dataset. We solve directly for the mixture (positive real numbers) and recover the padding as the negative portion of the residual (see Section F.4.4). min x∈R m A • x -b 2 s.t. x ≥ 0 The solution vector x will represent the mixture of the columns of A, in other words the mixture of valid packing strategies such that A • x is as close as possible (in the least squares sense) to the histogram of sequence lengths b. We obtain a solution with a non-negative least squares implementation (41; 45) Interestingly in the case of sequence length 512 only 634 out of the 22102 available packing strategies of depth up to 3 are used (3%).

F.4.4 PADDING AS THE RESIDUALS OF THE PACKING PROBLEM

We compute the residuals of the least squares solution (after rounding the mixture to integer) as: r = b -A • round(x) The negative portion of the residuals represents sequences that we are "short". That is, there is a deficit of those sequences and we are over-subscribing to them. The positive portion of the residuals represents sequences which have failed to be packed. Typically, there is a deficit of short sequences and a surplus of long sequences as demonstrated by the following plot. In total, there are n = 16'279'552 sequences in the Wikipedia pre-training dataset. After the non-negative least squares packing (and rounding to integer solution) there are 56'799 unpacked sequences left un-packed (about 0.352%). The residual on sequence lengths 1 to 8 are [-4620, -4553, -4612, -4614, -3723, -3936, -3628, -3970]. These negative residuals imply that we need to add this many sequences of their corresponding sequence length to realize the mixture of packing strategies. In total the first iteration introduces 7.9410 6 tokens of padding. In contrast large sequence lengths have a positive residual (a surplus of unused sequences). For sequence lengths 504 to 512 the values are [3628, 3936, 3724, 4613, 4612, 4553, 4619, 0] . Note that sequence length 512 has a residual of 0 since they do not need packing. Intermediate sequence lengths typically have non-zero (but much smaller) residuals. The detailed code for the algorithm is provided in Listing 2.

F.4.5 RESIDUAL WEIGHTING

A natural extension of the non-negative least squares problem introduced in Section F.4.3 is to weight the residuals on different sequence length differently. We should not significantly penalize a deficit in short sequence lengths (smaller than 8 tokens) as adding up to 8 tokens of padding is not much overhead. Similarly, a surplus in long sequences is not worrisome because the amount of padding needed to achieve a sequence length of 512 is small. Reducing the weight of the residual on the first 8 tokens to 0.09 leads to the following residual plot shown on the right in Figure 8 . In this case the residual is almost entirely shifted to the shorter sequences and the positive residual on the longer sequences has virtual disappeared. This section discusses the choice and effect of the weighting parameters in the NNLSP packing algorithm. To simplify the problem of selecting reasonable defaults for the residual weights, we use just two parameters to completely describe the weights: an "offset" parameter and a "weight" parameter. Originally, all sequence length residuals are given the same weight of 1. This results in a packing with leftover long sequences, because there are not enough short sequences to pack them with. To reduce the residual on long sequences, we could either increase the residual weight on long sequences or reduce the weight on short sequences. We chose to reduce the weight on short sequences. Specifically, sequence lengths up to the "offset" length have a reduced "weight". The other residual weights stay at 1. To start, we chose an offset of 8 tokens, which is the smallest power of 2 for which there are examples in the Wikipedia dataset. We decrease the weight on sequences shorter than the "offset" from 1 to 0.9 to 0.09 to see which order of magnitude is the most appropriate. On visual inspection (looking at the residual plots as in Figure 8 ), we found that 0.9 still left too many long sequences unpacked. So, we reduced the weight a further order of magnitude to 0.09. This seemed sufficient to encourage nearly all long sequences to pack. While visual inspection helps in understanding how many long/short sequences are leftover, we are also interested in the impact the weights have on the overall efficiency of the packing. Without any weighting, we get 99.746359% efficiency, whereas the weighted approach results in 99.746274% efficiency. Hence, we conclude that the impact of the weights on the packing efficiency is very limited. Additionally, using an "offset" length of 4, resulted in similar numbers, for the full range of weights from 0 to 1. Using a weight of 0 for an "offset" length of 8 resulted in insignificantly higher efficiency of 99.7519%, whereas using an "offset" length of 16 reduces performance to 99.38964%. A weight of 0 implies that the residual on those lengths can be safely ignored, i.e., the packing algorithm can thus add as many short sequences as it chooses without any penalty. It is very interesting that this does not significantly impact the packing efficiency, and can even have a slightly positive impact. However, increasing the "offset" length further significantly decreases the performance with weight 0. Keeping the weight at 0.09 and increasing the length reduces performance slightly, for example with 99.53% at length 256 and 99.728% at length 16. For our Squad analysis, weighting improved the efficiency slightly from 96.94% to 97.38%. Fine tuning further with direction grid search, delivered a local optimum of 98.767% efficiency with length 64 and weight 0.002. Overall the influence of different residual weights on the packing efficiency (and the acceleration factor) is less than 1%. This might differ from application to application, but it shows that we are able to use the residual weights to achieve secondary targets (like not having leftover long sequences) without significantly compromising the packing efficiency.

G COMPLEXITY ANALYSIS OF THE PROPOSED PACKING APPROACHES

Since approximate packing algorithms have a complexity of at least O(n log(n)) and we would like to be able to tackle datasets with 2K million samples, we will discuss the complexity of our packing algorithms in this section. The complexity depends on the maximum sequence length s m , the number of samples n, and the packing depth d. To create the histogram, we have to iterate over the data once (O(n)). Our histograms will be binned by size 1, meaning one bin for each sequence length. Hence, a dictionary can be generated (O(s m )) and used for the sorting (O(1)). The respective histogram vector has dimension s m .

G.1 COMPLEXITY ANALYSIS OF NON-NEGATIVE LEAST-SQUARES HISTOGRAM-PACKING

For a packing depth of one, there is only the strategy [s m ]. For a packing depth of two, we add the strategies [1, s m -1], ..., [s m -sm 2 ] which results in an additional sm 2 potential strategies. Following the dynamic programming approach, the number of possible additional strategies of depth three can be calculated with # potential strategies = sm 3 j=1 sm-j 2 i=j 1 = sm 3 j=1 s m -j 2 -(j -1) ≈ sm 3 j=1 s m 2 - 3 2 j ≈ s m 2 free variables in the strategy vector x. Also note that A can be precomputed when s m is known and is independent of the number of samples. Given a problem matrix with dimension i × j, Luo et al. (42) indicate that the asymptotic complexity of most solution approaches is O(ij 2 ), whereas they propose an O(ij) solution. Since we use the standard SciPy implementation (41), our estimated total time complexity for NNLSHP is O(n + s 5 m ). For s m = 2048, the estimate would be 350 540 potential strategies which is still far less than the number of samples. For packing depth 4, we calculate (47): sm 4 k=1 sm -k 3 j=k sm-j-k 2 i=j 1 ≈ sm 4 k=1 sm -k 3 j=k s m -k + 2 -3j ≈ sm 4 k=1 1 12 (s + 4 -4k)(s + 3 -4k) ≈ 1 288 s(2s 2 + 9s + 4) = 1 288 s(s + 4)(2s + 1) So with s m = 512, there would be around 940K strategies. In our implementation, this number of strategies would be too high to create the problem matrix. One alternatives to simplify would be to not use the exact length of sequences but to only consider even numbers for the sequence length and round up. That way arbitrary sequence length could also be handled and the limiting factor would be the complexity of the attention layer in BERT which does not scale well with the sequence length.

G.2 COMPLEXITY ANALYSIS OF SHORTEST-PACK-FIRST HISTOGRAM-PACKING

The complexity calculation of SPFHP is straightforward. We go over the whole data once for the histogram sorting. Next, we iterate over each of the s m bins in the histogram. Lastly, we iterate over all strategies that were encountered so far. It can be proven that, at each iteration, the number of strategies can be maximally increased by one. In each step, we potentially add a sequence to existing strategies but a new strategy is opened up only in the final step, when we either create a new strategy or we split one of the existing strategies into two. Hence, the number of strategies is bounded by s m and the overall time complexity is bounded by O(n + s 2 m ). The space complexity is O(s 2 m ) since we need to store up to s m strategies with maximum s m counts for different sequence length.

H PERFORMANCE COMPARISON TO GREEDY PACKING IN T5

T5 ( 24) is normally trained on the C4 dataset. However, to give an idea of the difference in packing efficiency and acceleration compared to our newly introduced algorithm, we can analyse the performance of greedy aggregation of samples on our given Wikipedia dataset. We take the histogram and cast it back to a list of different sequence lengths since this is all that matters for analysing packing behaviour. Next, we randomly shuffle the dataset and iterate with the greedy aggregation algorithm multiple times to account for randomness. We iterate sequence by sequence and combine them provided the maximum sequence length of 512 is not yet reached. If it is exceeded, the packed sequence is considered finished and a new sequence is started. The greedy packing algorithm itself takes a bit more than 10 seconds, since we are operating on single sequences and not histogram counts. The efficiency of this approach is 78.24% (standard deviation of 0.005) compared to our 99.75% for NNLSHP. The respective acceleration would be around 1.566x compared to our 2x. With respective separator tokens, the performance decreases around 0.13% for one separator token and 0.27% when two separator tokens are required between two sequences. Following the brief documentation at tensor2tensor [link], two separator tokens would be expected in the T5 processing. In addition to the packing preprocessing, our paper proposes, rather than using separator tokens, to instead modify the masking of the attention matrix during training. The RoBERTa paper shows that avoiding contamination of sequences from different documents can consistently improve downstream F1 scores by 0.35%.

I IMPACT OF NSP LOSS

When running packed BERT base without the NSP loss but keeping everything else the same, we observed that downstream performance on SQuAD reduced the F1 measure by 1.31% and EM by 1.15%. For the packing in approaches like RoBERTa or T5, it is crucial that there is no NSP loss because that would circumvent putting arbitrary sequences together in contrast to our approach that can handle multiple sequences from different documents without cross-contamination. Liu et al. (16) argument that NSP can be omitted because "removing the NSP loss matches or slightly improves downstream task performance". In their experiments, they compare the normal BERT setup with NSP ("SEGMENT-PAIR") to the "DOC-SENTENCES" approach, where there is no NSP and data in one sequence comes only from one document. For the "SEGMENT-PAIR" approach, the paper does not address, how much padding tokens are still present. Assuming, it is around 40%, their correction in batch sizes for each step would result in a significant increase in training steps for the "DOC-SENTENCES" approach. It is well known that BERT performance increases with longer pretraining time. Our results indicate that NSP loss might be still relevant, depending on the dataset generation process. With our approach, we can get the acceleration benefits of T5 and RoBERTa while keeping the predictive performance by avoiding cross-contamination. Due to the length distribution, it is not anymore sufficient to concatenate only 3 sequences to obtain perfect packing for maximum sequence length 1024 or 2048. Instead, around 6 and 12 sequences are required. This cannot be solved by NNLSHP anymore due to search space complexity but requires an online heuristics like SPFHP or the slightly better LPFHP, introduced in Section R that is based on Best-Fit and splitting counts in the histogram in contrast to the rather simple First-Fit descending. Figure 10 shows the achieved speed-ups with LPFHP depending on the maximum number of allowed sequences. K PACKING SQUAD 1.1 We tokenized SQuAD (25) for BERT (6) with maximum sequence length 384 and visualized the histogram over the sequence length (Figure 11 ). The distribution looks similar to the Wikipedia dataset but is slightly less skewed. However, the maximum sequence length only had an occurrence of 1.2% compared to 23.5%. Hence, the theoretical un-padding speedup is 2.232. In Table 5 , we can see that SPFHP does not concatenate more than 3 samples and obtains 97.54% efficiency in contrast to a maximally used depth of 16 with 99.60% efficiency on Wikipedia, because of the less skewed distribution. Note that we have less than 90 000 samples. Hence, NNLSHP is less efficient because the rounding in the residuals has a much larger impact compared to more than 16 million sequences in the Wikipedia dataset. 

L PACKING GLUE

To explore a variety of datasets and emphasize that skewed distributions are common, we explored all datasets in the GLUE benchmark (31; 30) that came with training data. We loaded the datasets using the HuggingFace dataset loading API (46) . For preprocessing, we followed the implementation in the HuggingFace transformers repository (32) 3 and extracted the respective data processing snippets to obtain tokenized data with a maximum sequence length of 128. The histogram of the sequence length for each of the included datasets is displayed in Figure 12 and the packing results are given in Table 6 . Each dataset benefits from packing. The lower the mean, the higher the packing factors are that can be reached but with a higher packing depth. In this section, we show that packing can benefit other domains than NLP like ASR. We use the LibiSpeech dataset (23) and preprocess it as described at a reference implementation. 4 The resulting histograms for the subsampled audio sample lengths and respective text labels are provided in It can be seen that the audio sequence length is dominated by long sequences with 38% of required padding to meet the max sequence length of 330. Thus the theoretical optimal speed-up of 1.6x cannot be reached. However, 80% efficiency are possible with any of the proposed packing algorithms to achieve 1.3x speed-up. This can be already achieved by combining up to 2 sequences. To achieve almost perfect packing efficiency, a sequence length around 457 and concatenating up to 8 sequences is required. Due to the quadratic increased computational load that usually comes with longer sequence length, increasing the sequence length is not practical. If processing and packing the text data independently of the audio, 99.99% efficiency could be achieved with a speed-up of 2.24x.

N PACKING PAPER ABSTRACTS (PUBMED)

This section analyses the length of abstracts to give an intuition about how different documents can be in length. Figure 14 depicts the length of abstracts in characters extracted from PubMed. 5If these abstracts were directly used as sequences, a character length of 1000 could result in 1.9x speed-up from packing. The potential speed-ups for length 2000, 3000, 4000 would be 2x, 3x, and 4x, respectively. Note that, document clean-up procedures would usually eliminate documents that are too short or too long for data sanitizing purposes. Note that for the processing in BlueBERT (44) , paper titles and abstracts get separated into sequences, tokenized, and then combined with the BERT sequence combination approach for a maximum sequence length of 128 tokens. Thus, it results in a different distribution.

O MLPERF™ PHASE 2 LEARNING CURVES

This section provides further learning curves related to Section 4.2. 

P FULL PRETRAINING OF BERT BASE AND LARGE LEARNING CURVES

This section provides further learning curves related to Section 4.3. 

Q NOTE ON CHANGING THE SEQUENCE LENGTH FOR OPTIMAL PACKING

An interesting aspect of packing is that the maximum sequence length for packing could be larger than the maximum sequence length in the underlying dataset that gets packed. For the QM9 dataset, this means that by setting the maximum sequence length to 36 instead of 27 an optimal 1.6x speed-up can be easily achieved. Note that the choice of maximum sequence length depends on the underlying machine learning algorithm. Due to the squared computational and memory complexity of self-attention in BERT and other transformers, the maximum sequence length is usually kept as small as possible for these models. So an increase for packing alone is not practical. For algorithms with linear complexity as for example Graph Neural Networks, implemented in PyG, larger maximum sequence length can be chosen to ensure, optimal packing is always possible.

R FINE-TUNED LONGEST-PACK-FIRST HISTOGRAM-PACKING

In the main paper, we focused on SPFHP due its simplicity. In this section, we analyse the effect of applying the "Best-Fit" algorithm (12) . Here, the longest pack that still fits the sequence is chosen instead of the shortest one. We can see that longest-pack-first histogram-packing (LPFHP) uses a much higher packing depth when no limit is set (29 instead of 16). Splitting the histogram counts results in slightly higher numbers of used strategies compared to SPFHP where the number of used strategies is limited by the maximum sequence length. The best efficiency of LPFHP is 99.949% with packing factor of 2 which is slightly higher than the 99.75% (1.996 packing factor) for NNLSHP and 99.6% for SPFHP (1.993 packing factor). All algorithms are very close to the upper limit. Note that for NNLSHP, we only fill up the unpacked samples with padding. Applying best-fit on the remains, similar results can be expected. Although the benefits of the improved algorithm are negligible, we share the concept and code below in case packing is applied to other data with a different distribution that would benefit more from it, or for applications where only perfectly packed sequences without padding are of interest.

S EXTENDED NNLS WITH PADDING TOKEN WEIGHTING

In Section F.4.4, we defined the residual as r = b -A • round(x) and discovered that a positive residual corresponds to sequences that we did not pack at all and should be avoided. Negative residuals correspond to padding and should be minimized. Due to this discrepancy, we decided to set small weights for very short sequences (that don't occur in the data). However, it was not possible to directly optimize the amount of padding. A negative residual component for length i, r i , results in |r i | • i padding tokens, however a positive residual actually results into (512 -r i ) • i padding tokens. This cannot be addressed by our weighting approach in min x∈R m (wA) • x -(wb) 2 s.t. x ≥ 0 Working within the NNLS approach, we can strictly enforce a non-positive residual r (before rounding to integer). To that end, we define a new auxiliary variable r ≈ -(b -Ax) which is the negative of the residual, r. This will allow us to reformulate the objective r ≤ 0 to the non-negative constraint: r ≥ 0. min x∈R m (wA) • x -(wb) 2 + w • A • x -w • b -w • r 2 s.t. x ≥ 0 r ≥ 0 This will enforce r = Ax -b ≥ 0 due to the large weight, w := 10 6 , and no upper limits on r. Now, we can set w i := i to optimize for the padding tokens. Due to the use of the squared error, we would however optimize the squared sum of padding tokens instead of the preferred sum of padding tokens. To accomplish the latter, we would have to replace the L2-norm problem by an L1-norm problem which would be too complex to solve. Note that due to rounding, the unwanted positive residuals r (r < 0) might still occur. This could be avoided by rounding up x instead of normal rounding of x.  A -D m , where 0 m is an m × m matrix with m being the maximum sequence length, 512, and D m is a unit matrix of the same dimensions as 0 m . Since, we are already close to optimum especially on the Wikipedia dataset, the results are only a little bit better. The processing time however increases from 30 to 415 seconds without considering the increased time for constructing the processing matrix. Since the slightly improved algorithm might be nevertheless relevant for other applications, we share it in Listing 9.

T IMPLEMENTATION CHALLENGES AND TRICKS

Whereas the model changes are described in Section 3.2, getting them implemented in the most efficient way can require a bit more effort. This section points out a few tricks that we used in our code. T.1 PACKING ALGORITHMS Whereas the packing algorithm implementations might look trivial, they can become quite intricate. For example, when splitting and distributing bins like for example combining 2 sequences of length 256 to a sequence of length 512, the number of categories can drastically increase and thus the search space. Hence, it is valuable to test each adjustment while changing the packing algorithms. If a solution is not provided right away, the algorithm switched probably to a way less efficient complexity category.

T.2 POSITIONAL ENCODING

This approach was implemented as described in Section 3.2.1 by providing the index of the item with the data. Note that for any other part in BERT, the exact position does not matter. This allows to actually rearrange the data to our advantage. We can start with the up to 72 mask tokens and have an additional mask, that tell us, which tokens are the mask tokens, a list that provides their true labels, and with the positional encoding, we can determine their position in the sequence. The NSP tokens get moved from the beginnings of their sequences to the end.

T.3 ATTENTION

For the attention mask, we realised creating them on host can have a major cost in data transfer due to its size. Instead, one can create the mask on the accelerator. Therefore, we implemented a custom operation using C++ and PopArt: https://github.com/graphcore/examples/blob/ master/nlp/bert/popart/custom_ops/attention_mask.cpp. Note that in most cases, the attention mask gets not multiplied but added for efficiency. Hence, the "softmask_mask" is used instead of the multiplication mask from Figure 2 in our implementation.

T.4 AVOIDING LOSS UNPACKING

Note that the MLM loss is applied on a token level and does not need any loss unpacking. However, for NSP, theoretically, the NSP tokens would be distributes within a sequence. During dataset creation however, we arranged the tokens and moved all NSP tokens to the end. Due to our packing strategy, we also know that those tokens are limited to a maximum number of 3. This, we can apply the NSP head to the 3 potential positions and just provide a mask to filter out the relevant NSP tokens. This way, we need much less memory and compute for the unpacking for the NSP loss.

T.5 TESTING

The ultimate approach to test the correctness of the implementation is to check, if packed and unpacked sequence provide the same values and gradients. Due to large numeric variations, we implemented this test in FP32 for our PyTorch Huggingface implementation This way, we could prove that with the correct adjustments, unpacked sequences processed with vanilla BERT result in the exact same losses and weight updates as the packed sequences processed with the modified packed BERT version.

T.6 LOSS BALANCING

This section addresses a challenge, called loss imbalance, that is usually faced with small batch sizes with different appearance when running packed compared to vanilla BERT. It can also translate to other scenarios where losses get averaged with large amounts and variance of underlying padding in the data or variance in the underlying "sequences/segments/components" in a batch. This is highly relevant since model sizes increase and already now, the microbatch size when running BERT large on the IPU is 3 and on the GPU for large scale training, a batch size of 3 is used on a single GPU to limit the total batch size to 12960 aggregated over 4320 GPUs. 6The main question is, how much influence/weight in a gradient update does a single MLM token and a single NSP token get and how does this change with batch size, packing, or other factors that woule be expected to be invariants? Let us look into two extreme cases: batch size 1 and a batch being the full dataset. Note that in the BERT model, we first take the mean over all MLM tokens and over all NSP tokens and then add the losses up. For a batch size of 1, there are two extreme cases in the vanilla BERT setting. In case 1, we have 1 MLM token and 1 NSP token. So each token gets a weight of 1 in the final sum. In case 2, we have 76 MLM tokens and 1 NSP token. So each MLM token gets a weight of 1/76 in the overall loss/gradien/weight update and the NSP token, again gets a weight of 1. This means, the MLM tokens of short sequences get a weight of 1 and it reduces linearly down to 1/76 for maximum sequence. Thus, short sequences get more influence in the weight update and the ratio of weights compared to NSP changes, too, even though it is unclear how the ratio influences the final result. Let us assume perfect packing efficiency for packed BERT. Hence, we have 76 MLM tokens and a weight of 1/76 for the MLM tokens in every case independent of the batch size. However, with a maximum packing depth of 3, the number of NSP tokens can range between 1 and 3 and thus the weights can be 1, 1/2, 1/3. This means that NSP loss for a sequence of length 512 gets 3 times more weight than the NSP loss for a single sequence compared to packing 3 sequences for example of length 170 together. Again, the ratio between NSP and MLM changes, too. Now lets look at the other extreme case of a batch being the full dataset of size L (which behaves similar to the case of a large batch size between 12k-1000k which is common). Again, for vanilla BERT, the NSP weight is 1/L in any case. Assuming 50% padding, which can be common as previously shown, and again a maximum of 76 MLM tokens per sequence, we get a total of 76 • 0.5 • L MLM tokens with the respective reciprocal value for the weight. There is no variation. 76 • 0.5 is the average number of MLM tokens per sample. Assuming a packing factor of 2, the respective maximum batch size can only be L/2. This fits to our scheme of reducing the batch size to avoid further adjustments of hyperparameters. For packed BERT, the number of MLM tokens is doubled compared to the average case in vanilla BERT and thus the weight is 1/(76 • 1.0 • (L/2)), assuming a packing efficiency of 100%. The number of NSP tokens is 2 • (L/2) and the respective weight is 1/L. Again there is no variation and the weights between packed and vanilla BERT are identical. This seems more like an ideal case that is less dependent on how samples are put together. Also, it ensures equivalence between packed and vanilla setup. Getting weights calculated correctly in a distributed setup (data parallel processing as well as pipelining) where each replica has a small batch size down to 1 is challenging. Each replica would need separate gradients for NSP and MLM loss, then aggregate a weighted sum for those separate gradients, and only afterwards add up the gradients before the optimiser update. This is infeasible because of challenges in framework implementation, large increase of memory requirements, roughly doubling of the computational workload for the backpropagation, and more than doubling the communication overhead for weights. We propose a simplified approach that generalizes from the weights, we observed for large batches, to the weights in tiny batches. Instead of averaging using the real number of tokens, we propose using the expected number of tokens instead. Technically that means, the mean aggregation gets replaced by a sum aggregation multiplied by a constant weight. Let b be our batch size, e the token efficiency, p the packing factor, and m the maximum number of MLM tokens in a sample. This means, for vanilla BERT with sequence length 512, we have something like e = 0.5, p = 1, m = 76 and for packed BERT, we have e = 1, p = 2, m = 76. Let l i,k M , i ∈ I(k), k ∈ {1, .., b} be the active MLM losses and l j,k N , j ∈ J(k), k ∈ {1, .., b} be the active NSP losses in a sequence. Then we balance the MLM loss calculation like: mean(l M ) =  Note that when logging the loss, it should be averaged over multiple batches to get a representative result that is comparable to values previously obtained. This approach is straightforward to implement in any framework, even though some fine-tuning might be required when working with low precision. In our experiments, loss balancing only reduced the noise in the NSP loss. Other than that, it had no influence on the loss curves. 



We avoid the ambiguous terms "bin" and "sample/sequence"and use "pack" instead to refer to the multiple sequences concatenated during packing. For data distributions that are more skewed than Wikipedia this might look different. https://github.com/huggingface/transformers/blob/master/examples/ text-classification/run_glue.py https://github.com/mlcommons/training/tree/master/rnn_speech_ recognition/pytorch https://huggingface.co/datasets/pubmed https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/ benchmarks/bert/implementations/pytorch/config_DGXA100_540x8x3x1_new.sh# L2



Figure 2: Attention mask code [left], respective zero-one mask [middle], and vectorized unpacking of the sequence loss[right]. White rectangles correspond to padding.

Figure 3: Comparison of learning curves for packed and unpacked processing, where all experiments converged to the target accuracy within the same number of training samples(3 million). [left] same effective batch size (ebs is batch size times packing factor), [middle] different heuristic adjustments of the hyperparameters (batch size 1500 for all runs, such that ebs for packed runs is 1500 * 2), and [right] realized speed-up from packing (in excess of desired 2x). Further learning curves are provided in Section O.

packed BERT baseline no pos. emb. adjustment

Figure 4: Comparison of learning curves with and without mask or positional embedding adjustment in our packed BERT approach. The grey accuracy baseline to reach is 72.1%.

Figure 5: Comparison of the theoretical speed-up as the number of accelerators is increased.

Canonical packing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Approximate bin packing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-negative least squares histogram-packing . . . . . . . . . . . . . . . . . . . . . . . . . Discussion of residual weight choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complexity analysis of the proposed packing approaches 24 G.1 Complexity Analysis of non-negative least-squares histogram-packing . . . . . . . . . . . . . 24 G.2 Complexity Analysis of shortest-pack-first histogram-packing . . . . . . . . . . . . . . . . . Performance Comparison to GREEDY Packing in T5 Full pretraining of BERT base and large learning curves Q Note on changing the sequence length for optimal packing R Fine-tuned longest-pack-first histogram-packing S Extended NNLS with padding token weighting T Implementation Challenges and Tricks T.1 Packing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T.2 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T.3 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T.4 Avoiding loss unpacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T.6 Loss Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 6: Left: Speed-up from un-padding on 8 GPUs closely resembles a Gumbel distribution. Right: statistical estimate of speed-up distribution on a 1 GPU system running un-padding

import numpy as np from scipy import stats, optimize alpha_8 = 1.6038 beta_8 = 0.1288 def g(x): alpha_1, beta_1 = x dist = stats.gumbel_r(loc=alpha_1, scale=beta_1) # Equations for median and mode median = alpha_8 -beta_8 * np.log(np.log(2)) equation1 = 0.5 -dist.sf(median) ** n_gpu mode = alpha_8 equation2 = (1-np.exp(-1)) -dist.sf(mode) ** n_gpu return (equation1 ** 2 + equation2 ** 2) res = optimize.minimize(g, [alpha_8, beta_8], method="Nelder-Mead") alpha_1, beta_1 = res.x

min y∈{0,1} n ,B∈{0,1} n×n n j=1 y j Minimize the number of bins. s.t. j=1 b ij = 1 ∀i Assign each length/sequence to only one bin. n i=1 s(i)b ij ≤ s m y j ∀j Cumulative length cannot exceed capacity.

Figure 7: Visualization of the residual of the NNLS packing problem

Figure 8: Visualization of the weighted residual of the NNLS packing problem

Figure 9: Sequence length distributions for different sequence lengths in Wikipedia BERT pre-training dataset and according theoretical speed-up.

Figure 10: Speed-ups achieved by LPFHP for different maximum sequence length and maximum number of packed sequences.

Figure 11: SQuAD 1.1 BERT pre-training dataset sequence length histogram for maximum sequence length of 384.

Figure 12: GLUE dataset sequence length histograms for maximum sequence length of 128.

Figure 13: LibriSpeech sequence length histograms of preprocessed audio data [top] as well as target text data [bottom].

Figure 14: Abstract length distribution in PubMed.

Figure 15: Comparison of learning curves for packed and unpacked processing with reduced batch size for the packed approach.

Figure 16: Comparison of learning curves for packed and unpacked processing with heuristics applied.

Figure 17: Comparison of learning curves for packed and unpacked processing in the optimized setup.

Figure 18: Comparison of learning curves for BERT base phase 1 (sequence length 128) with packed and unpacked processing.

Figure 19: Comparison of learning curves for BERT base phase 2 (sequence length 384) with packed and unpacked processing.

Figure 20: Comparison of learning curves for BERT large phase 1 (sequence length 128) with packed and unpacked processing.

Figure 21: Comparison of learning curves for BERT large phase 2 (sequence length 384) with packed and unpacked processing.

balanced(l N ) = k∈{1,..,b} j∈J(k) l j,k N b • p .

Key performance results of proposed packing algorithms (SPFHP and NNLSHP) on IPU.

Measured speed-ups in BERT pretraining with packing.

SQuAD 1.1 scores after BERT pretraining with packing.



of packing two length-1 sequences and one length-6 sequence together to form a pack of length 8. The number of strategies (and columns in the matrix) is discussed in Section G. For a packing depth of 3 and maximum sequence length, we obtain around Example packing matrix for sequence length 8. Columns represent different kinds of packs.

Performance results of proposed packing algorithms for SQuAD 1.1 BERT pre-training.

Performance results of proposed packing algorithms for the GLUE dataset. Only the baseline and the SPFHP packing results without limiting the packing depth are displayed.

In contrast to SPFHP, we additionally consider splitting the histogram count, if it can fit multiple times. A simple example is sequence length 256, where we divide the respective histogram count by 2 to create the optimal pack with strategy [256, 256] instead of the strategy [256]. This latter strategy would be complemented by other sequences but would probably not result in an optimal packing. The implementation of this approach is much more complex than the SPFHP implementation. The code is provided in Listing 8 and the results in Table7. Performance results of longest-pack-first histogram-packing for Wikipedia BERT pre-training with maximum sequence length 512.

To put the new formulation into a solver, we replace

Listing 3: Shortest-pack-first histogram-packing from collections import defaultdict import numpy as np def add_pack(pack, count, tmp, final, limit, offset): """Filter out packs that reached maximum length or number of sequences.""" if len(pack) == limit or offset == 0: final[offset].append((count, pack)) else: tmp[offset].append((count, pack)) def pack_using_spfhp(histogram, max_sequence_length, max_sequences_per_pack): """Shortest-pack-first histogram-packing algorithm.""" reversed_histogram = np.flip(histogram) # Initialize main strategy data dictionary. # The key indicates how many tokens are left for full length. # The value is a list of tuples, consisting of counts and respective packs. # A pack is a (sorted) list of sequence length values that get concatenated.

J WIKIPEDIA WITH LONGER SEQUENCE LENGTH

The histogram raw data for Wikipedia with different maximum sequence length is provided in Listing 6 and visualized in Figure 9 . We can see that with increasing maximum sequence length, long sequences become more and more rare and the resulting benefits from packing drastically increase. Keeping in mind that the BERT dataset generation process decreases the size of a maximum of 50% of the sequences, we can infer that having a different dataset generator that truncates any short sequence, would result in significant loss of data (> 25% for length 512). """Evaluate shortest-pack-first histogram-packing algorithm.""" stats_data = [["pack. depth", "# strat. used", "# packs", "# tokens", "# padding tok.", "efficiency (%)", "pack.factor", "time"]] for max_sequences_per_pack in [1, 2, 3, 4, 8, 16 "cola": ("sentence", None), "mnli": ("premise", "hypothesis"), "mrpc": ("sentence1", "sentence2"), "qnli": ("question", "sentence"), "qqp": ("question1", "question2"), "rte": ("sentence1", "sentence2"), "sst2": ("sentence", None), "stsb": ("sentence1", "sentence2"), "wnli": ("sentence1", "sentence2"), } glue_keys = ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'rte', ' # List all unique ways of packing to the desired maximum sequence length strategy_set = get_packing_strategies(0, 1, max_sequence_length, max_sequences_per_pack) print(f"Packing will involve {len(strategy_set)} unique packing strategies.") # Get the packing matrix corresponding to this list of packing strategies A = get_packing_matrix(strategy_set, max_sequence_length) # Weights that penalize the residual by the number of resulting padding tokens. f"Speed-up theoretical limit: {speedup_upper_bound:3.4f}\n", f"Achieved speed-up over un-packed dataset: {old_number_of_samples/new_number_of_samples:3.5f}") return strategy_set, strategy_repeat_count

