MQSP: MICRO-QUERY SEQUENCE PARALLELISM FOR LINEARLY SCALING LONG SEQUENCE TRANSFORMER

Abstract

Long sequence modeling of Transformer gains prevalence in fields involving long texts and high-resolution images and videos but suffers from quadratic memory complexity. Existing work investigates low-complexity variants or parallel methods to handle it. The former attempts to approximate full attention and is limited by a single device's capacity. The latter struggles to manage quadratic memory of attention maps, leading to insufficient sequence scalability. In this work, we propose a novel parallel method named Micro-Query Sequence Parallelism. MQSP slices sequences across devices and projects local queries, keys, and values in selfattention. For communication and memory efficiency, MQSP all-gathers the queries while keys and values remain locally to acquire the local attention map, on which a distributed softmax gets conducted to amortize memory by column. Meanwhile, the queries get further partitioned as Micro-Q to divide the computation and recycle the attention map by row, jointly decomposing the quadratic memory to achieve linear scalability. The evaluation result shows that MQSP scales up sequence length linearly and achieves 4.5× sequence length of ColossalAI's sequence parallelism and 4.3× of Megatron-LM3, enabling training BERT-large of 78848 sequence length on 32 A100 GPUs. MQSP can reduce up to 78.6% of memory occupation and achieve up to 3.3× throughput when training on 17408 sequence length. The convergence quality experiment proves that MQSP provides the means for long sequences with guaranteed convergence, bringing the potential for the Transformer to explore longer sequences.

1. INTRODUCTION

Transformer (Vaswani et al., 2017) , an attention-based model initially proposed for natural language processing (NLP), shows its promising potential in computer vision (CV), multi-modality, and more (Carion et al., 2020; Dosovitskiy et al., 2021; Liu et al., 2022; Arnab et al., 2021; Neimark et al., 2021; Radford et al., 2021; Wang et al., 2022) . The self-attention associations between arbitrary pairs of tokens enable the Transformer to learn global context-aware sequence representation for many modalities. Furthermore, there is an emerging trend toward long-range modeling, which scales up the sequence length of the Transformer. Long-range modeling is essential for the long texts in question answering, document classification, and other NLP tasks, as well as the high-resolution pictures and the series of video frames in image modality. As the sequence gets extended, memory consumption increases rapidly due to the quadratic complexity of self-attention, inevitably exceeding the limit of a single device (e.g., GPU). This problem obstructs the exploration of the Transformer for modeling longer sequences. Recently researchers focused on sparse mechanisms in self-attention (Child et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020) or low-complexity substitutes (Wang et al., 2020; Xiong et al., 2021; Choromanski et al., 2021; Qin et al., 2022) to approximate full attention, boosting sequence length. However, besides concern about performance influence, the memory upper bound of a single device still limits those methods from scaling up further. Much existing work investigates parallel methods to distribute the Transformer model across the devices, such as tensor parallelism (Shoeybi et al., 2019) , pipeline parallelism (Harlap et al., 2018; Huang et al., 2019; Narayanan et al., 2021; Yang et al., 2022) , and zero redundancy optimizer (ZeRO) (Rajbhandari et al., 2020; Ren et al., 2021) . However, these methods separate the model parameters across different dimensions, not alleviating the enormous intermediate activations introduced by long sequence self-attention.Therefore, several more recent methods are devoted to partitioning Transformer along the sequence dimension. ColossalAI's sequence parallelism (Li et al., 2021) transfers keys and values in ring-style to compute the partial attention map. Megatron-LM3 (Korthikanti et al., 2022) modifies the conjugate operators to parallel layer-norms and dropouts in sequence dimension. Despite amortizing some intermediate activations, these approaches still consume local memory proportional to global sequence length, making it difficult to scale up to a longer sequence. In this paper, we propose MQSP, a novel sequence parallel method, to diminish the quadratic memory overhead and efficiently scale up the long sequence Transformer. MQSP splits the input sequence to n devices for parallel computation and projects local queries, keys, and values in self-attention. We design a distributed self-attention to handle the global context-awareness in self-attention. Specifically, MQSP synchronizes the queries across the devices through an all-gather operation while the keys and values remain local, acquiring local attention maps amortized along the column dimension. Since the rows for softmax are incomplete locally, we conduct a distributed softmax with hierarchical reduction across the devices, which introduces negligible communication. More importantly, we divide local queries into m finer-grained queries, called Micro-Q, and process them step by step to get the corresponding attention outputs. Each micro step's memory space for the attention map would be shared along the row dimension, jointly decomposing the quadratic memory for linearly scaling up the sequence length. The proposed MQSP shows significant sequence scaling ability compared with the existing leading approaches. Our evaluation indicates that for the Transformer BERT-large (Kenton & Toutanova, 2019) , MQSP could scale up to 78848 sequence length with 32 A100 GPUs, 4.5× of ColossalAI's sequence parallelism, and 38912 sequence length with 16 A100 GPUs, 4.3× of Megatron-LM3. In memory usage and throughput comparison, MQSP saves 78.6% memory and achieves 3.3× speedup, demonstrating the superiority of its distributed attention and communication method. Furthermore, the convergence quality experiments on wikitext, SQuAD, QQP and MRPC datasets prove that MQSP maintains the convergence quality for scaling up sequence length, bringing the prospect of exploring longer sequences to Transformers.

2. RELATED WORK

This section briefly introduces the self-attention complexity and the parallel methods for Transformer. Self-Attention Complexity. In the Transformer proposed by Vaswani et al. (2017) , self-attention is the vital module for the global dependencies modeling. Omit the batch and multi-head dimensions for brevity. For the input hidden states x ∈ R L×dm , where L is the sequence length and d m is the model hidden size, the attention mechanism can be formulated as: Q, K, V = L qkv (x), C = softmax(S)V = softmax( QK ⊺ √ d k )V (1) Where L qkv is the linear layer projecting the tokens to Q, K ∈ R L×d k , and V ∈ R L×dv in the query, key, and value embedding spaces. The scaled dot product of Q and K produces the attention scores map S ∈ R L×L , which incurs the quadratic complexity O(L 2 ). Subsequently, the softmax operation along the rows converts S to attention probabilities P , which reweights V to the context output C. Many recent approaches introduce sparse mechanisms, such as Sparse Trans. (Child et al., 2019) , Longformer (Beltagy et al., 2020), and BigBird (Zaheer et al., 2020) , or low-complexity substitutes, such as Linformer (Wang et al., 2020) , Nyströmformer (Xiong et al., 2021 ), Performer (Choromanski et al., 2021) , and Cosformer (Qin et al., 2022) . These methods algorithmically approximate the full attention with sparsity or low-rank assumption, reducing complexity to O(LlogL) or O(L). However, unsatisfied assumptions could meet performance deficiency in a broad task spectrum, and accommodating the entire Transformer within one device still limits the further expansion of the sequence length. Thus we set our sights on scaling up standard Transformer through more devices. Parallel Methods for Transformer. Parallelism approaches have been the innovative techniques for training the large Transformer. Pipeline parallelisms (Harlap et al., 2018; Huang et al., 2019; Narayanan et al., 2021; Yang et al., 2022) split the model layerwise without handling self-attention within layers. ZeRO (Rajbhandari et al., 2020) decomposes the continuous linear layers and divides self-attention along multi-heads dimension (Fig. 1a ). The methods above deal with model parameters but not the tremendous intermediate activations in self-attention that are quadratic to the sequence length, insufficient to scale up the sequence. Therefore, slicing along the sequence dimension comes into mind for scaling long sequence Transformer. Intuitively, the input sequence x could be split into n chunks, x → {x 0 , x 1 , .., x n-1 }, where x i ∈ R L n ×dm , and fed into n devices to compute in parallel. With similar insight, ColossalAI (Bian et al., 2021) proposed sequence parallelism (Li et al., 2021) , or ColAISP for short. Considering the global associations of local query/key/value embeddings, which are projected in self-attention as Q i , K i , V i = L qkv (x i ), ColAISP designs Ring Self-Attention as: C i = softmax(S i,: )V = softmax( Q i RingQK K ⊺ 0 , K ⊺ 1 , ..., K ⊺ n-1 √ d k ) RingAV V ⊺ 0 , V ⊺ 1 , ..., V ⊺ n-1 ⊺ By ring-style transferring K i and V i , ColAISP circularly computes Q i K ⊺ j to collect the partial attention scores map S i,: ∈ R L n ×L (Fig. 1b ). It indicates that ColAISP requires quadratic device resources n to scale up L. Additionally, the efficiency of ring communication suffers from the weakest link, e.g., inter-node bandwidth. Megatron-LM3 (Korthikanti et al., 2022) modifies its conjugate operators to all-gather/reduce-scatter to introduce sequence parallelism in layer-norms and dropouts, yet leaves the self-attention in tensor parallel mode, resulting in the same quadratic attention maps. This section introduces the proposed Micro-Query Sequence Parallelism. We analyze the communication patterns in self-attention and propose Micro-Q for reused memory. Then the distributed softmax is described. Furthermore, we compare memory usage with other methods and analyze scalability.

3. METHOD MLP

Add & Norm Add & Norm MLP Add & Norm Add & Norm MLP Add & Norm Add & Norm Attention Map 𝑥 ! 𝑥 "# ! 𝑄 $ 𝑉 $ 𝐾 $ 𝑉 ! 𝐾 ! 𝑉 "#! 𝐾 "#! 𝑄 ! 𝑄 "# ! 𝑥 $ Self-Attention … … … 𝑦 ! 𝑦 "# ! 𝑦 $ softmax

3.1. MICRO-QUERY SEQUENCE PARALLELISM

Instead of dividing model parameters along different dimensions in previous parallel methods, we focus on the sequence dimension that affects intermediate activations. When the input sequences are partitioned and fed to devices in parallel, most modules, such as multi-layer perceptron (MLP), dropout, and layer normalization, are computed independently of the sequence dimension. However, as illustrated in Fig. 2 , self-attention across devices does not natively support sequential parallelism. In MQSP, we dive into the self-attention mechanism to attain distributed attention with efficient communication and memory usage. Attention Analysis. Recalling the general form of self-attention in Eq. 1, the self-attention for each query can be analyzed independently. Assuming the k-th query in Q is q k ∈ R 1×d k , the corresponding context c k is calculated as: c k = softmax(S row k ,: )V = softmax( q k K ⊺ √ d k )V Where S row k ,: ∈ R 1×L represents the k-th row of the attention scores map. ColAISP transfers K i as a ring and multiplies with q k to collect the complete S row k ,: , which encounters efficiency deficiency in a bandwidth-imbalanced environment and includes the L factor of memory overhead. In contrast, MQSP chooses to share the q k across the devices, remaining K i and V i locally. In this way, each rank only needs to compute S row k ,i ∈ R 1× L n , its corresponding columns of attention scores: c k = n-1 i=0 d-softmax(S row k ,i )V i = f(d-softmax( f(q k )K ⊺ i √ d k )V i ) Where d-softmax means the distributed softmax operation to acquire the corresponding columns of attention probabilities, further described in section 3.2. For the communications required here, we define f as q k synchronization across devices and f as locally reweighted V i reduce-summation, e.g., broadcast/reduce or all-gather/reduce-scatter, as shown in Fig. 4 . f and f are conjugate, which means the forward pass of one equals the backward pass of the other. Considering the symmetric workload across devices and the efficiency of the duplex collective communication, we adopt all-gather/reducescatter as the conjugate operators to demonstrate our method (Fig. 3 ). Micro-Q. The complete attention includes the whole sequence q k∈[0,L) in Eq. 4, leading to the equally significant attention scores S :,i ∈ R L× L n . To this end, we propose Micro-Q, the finer-grained query, to reduce memory consumption orthogonally. For a number m of Micro-Q, the local query Q i gets chunked as {q 0 i , ..., q m-1 i }, as depicted in the red box of Fig. 3 , where q j i ∈ R L mn ×d k would be the j-th micro-step's input on the i-th device. The following f all-gathers q j i∈[0,n) to form the shared attention + d-softmax + reweight and concatenated Q j ∈ R L m ×d k . Hence the distributed attention in a minor range could be conducted as: C j i = d-softmax(S j i )V i = d-softmax( Q j K ⊺ i √ d k )V i (5) Where the C j i is the sub-context matching Q j on the i-th device. The memory complexity of each micro-step's attention scores map S j i would be R L m × L n , illustrated as in Fig. 1c . The memory space could be reused across the micro-steps, implemented through the checkpointing technique (Chen et al., 2016) . The existing methods typically employ layerwise checkpointing, not addressing the excess intermediate activations within a single layer, while MQSP fractionizes it to finer-grained partitions. At last, f conducts reduce-scatter to sum up the sub-contexts C j i back to their ranks, as the reduced micro-context c j i . Each rank concatenates micro-contexts to produce its local context C i . Comparison. Compared with R L n ×L in ColAISP, MQSP requires only R L m × L n memory space for the attention map. Coefficient n along column axis and coefficient m along row axis make joint efforts to disintegrate the quadratic memory of the Transformer, enhancing the scalability of sequence length. In concern of communication, ColAISP transfers queries and keys in rings while MQSP collectively transfers queries and contexts, incurring comparable communication volumes. However, in the heterogeneous network environment, e.g., a cluster with 4 nodes × 8 GPUs with Nvlink, the ringstyle ColAISP suffers from the bottleneck of the low inter-node bandwidth. In contrast, the duplex collective MQSP benefits from the hierarchical NCCL (Nvidia). Additionally, if ColAISP adopts the Micro-Q method to reduce memory overhead, the ring communication would be repeated m times, unlike amortized in MQSP. Since the inputs and outputs are uncorrelated among the micro-steps, we could alleviate communication overhead by overlapping with the other step's computation, further boosting efficient distributed self-attention (Appx. A.3).

3.2. DISTRIBUTED SOFTMAX

This section introduces the distributed softmax with low cost in MQSP. A similar technique could be found in ArcFace (Deng et al., 2019) to recognize large-scale faces. ArcFace sums up the local denominator in the forward pass, while the backward formula is simplified with a cross-entropy loss. Here we analyze the general softmax. Defining the whole sequence of scores as [s 0 , s 1 , ..., s L-1 ] (σ = max(s i∈[0,L) )) and the probabilities as [p 0 , p 1 , ..., p L-1 ], the original form is formulated as: p i = exp(s i -σ) L-1 j=0 exp(s j -σ) , ∇s i == ∇p i × p i (1 -p i ) + L-1 j̸ =i ∇p j × (-p i p j ) In MQSP, the scores are distributed across devices, which incurs O( n-1 n L) communications cost and O(L) memory overhead in the form of Eq. 6. To this end, we convert the arithmetic form and hierarchically reduce the maximum or summation, to attain an efficient distributed softmax: p i = θ i r sum (Θ i ) , ∇s i = λ i -p i × r sum (Λ i ) Where r sum means the reduce-sum operation. θ i = exp(s i -σ) and its local summation is Θ i .  λ i = ∇p i × p i )D 2 10BDL + 2BHL 2 MEGATRON-LM3 (8 + 4 H ) D 2 n (2 + 8 n )BDL + 2BH L 2 n COLAISP (8 + 4 H )D 2 11BD L n + 2BH L 2 n MQSP (8 + 4 H )D 2 10BD L n + BD L m + 2BH L 2 mn 3.3 MEMORY USAGE ANALYSIS This section analyzes the memory usage of the model parameters and intermediate activations in a single Transformer layer. We omit the gradients and optimizer states proportional to the model parameters and the statistical buffers or masks in dropout and layer normalization. B, H, L, n, and m represent the batch size, multi-head size, sequence length, device number, and Micro-Q number, respectively. To be concise, we assume D = d m = Hd v = Hd k , consistent with most implementations. As for the model parameters, the linear layers in MLP take 8D 2 , and the L qkv and output layer in self-attention take 4D 2 H . It is the same for the vanilla transformer and sequence parallel methods, while Megatron-LM3 divides it by n. For the intermediate activation, the MLP takes 5BDL, and the self-attention takes 5BLD + 2BHL 2 . MQSP divides the MLP part by n and introduces the 1 mn factor in the attention map and an all-gathered Micro-Q buffer BD L m . We apply the same memory analysis on Megatron-LM3 and ColAISP (Appx. A.4), as shown in Tbl. 1. According to the analysis, tensor parallelism has memory superiority in model parameters, yet the intermediate activations consume most memory in the long sequence Transformer. The activations of previous methods include L or L 2 n factors, which limit their sequence scalability. Contrastively, MQSP adjusts the granularity of Micro-Q m, attaining efficient memory usage. Assuming the remained upper bound memory for intermediate activations as M and adjusting m equivalent to n: m = n, 10BD L n + BD L m + 2BH L 2 mn ≤ M ⇒ L ≤ ( ( 11D 2H ) 2 + M 2BH - 11D 2H )n = Cn (8) The maximum L grows proportionally to n with a constant C, demonstrating the linear scalability of MQSP. It meets L mn ≥ 1 to ensure Micro-Q includes at least one query, which indicates Eq. 8 in the condition of n ≤ C, L ≤ C 2 . Moreover, owing to the query granularity flexibility, we can set a larger m to obtain finer-grained Micro-Q to save memory, attaining further sequence length scaling.

4. EVALUATIONS

This section evaluates the proposed MQSP, verifying its convergence quality and comparing it with other parallel methods in sequence length scalability, memory footprint, and throughput. We further investigate the influence of the Micro-Q setting. Specifically, we implement MQSP with Pytorch-1.9 (Paszke et al., 2019 ), referencing Bert (Kenton & Toutanova, 2019) implemented in the HuggingFace transformers (Wolf et al., 2020) . The experimental hardware is a private cluster, each node containing Intel(R) Xeon(R) Platinum 8369B CPU, 760-GB of RAM, and eight 80-GB A100 GPUs with Nvlink, resulting in 96 times intra-node bandwidth compared with inter-node (300 GBps v.s. 25 Gbps).

4.1. QUALITY OF CONVERGENCE

MQSP distributes Transformer in sequence dimension to scale up long sequences with the full attention in the vanilla Transformer. To verify the convergence of MQSP, we experiment with our MQSP and distributed data parallel (DDP) on the datasets, including WikiText-103 (Stephen et al., 2017) , SQuAD (Rajpurkar et al., 2016) , QQP, and MRPC (Wang et al., 2018) . This subsection demonstrates the superior sequence length scalability of the proposed MQSP. We compare the maximum sequence lengths achieved by the Transformer using different distributed approaches as the number of devices increases. We evaluate the methods by training the BERTlarge model (Kenton & Toutanova, 2019) . We use the batch size of 16 and the Adam optimizer following Kenton & Toutanova. Notably, because the checkpointing technique in the Micro-Q mechanism erases layers' stacked self-attention activations, we apply layerwise checkpointing to eliminate the accumulated activations across the Transformer layers for a fair comparison. Furthermore, for each particular number of devices n, we configure MQSP in two ways: a) MQSP-eq, where m = n for linear scalability as analyzed in section 3.3, and b) MQSP-mem, where m = L/n as the finestgrained Micro-Q to investigate the largest scaling potential. MQSP-mem serves as an upper bound reference, and for more about the Micro-Q setting, please refer to section 4.5. As shown in Fig. 6 , the maximum sequence lengths of these methods are measured on 2 n GPUs, ranging from 1 to 32. The metric data range for Megatron-LM3 is limited to 2 4 GPUs because the tensor parallel size must be able to divide multi-head dimension, which is 16 in BERT-large. In comparison, the scalability of the sequence parallel methods is less constrained. We observe that MQSP-eq acquires 3.2× and 4.3× longer sequence than ColAISP and Megatron-LM3 (38912 v.s. 12160/8960) when 2 4 GPUs, and 4.5× than ColAISP (78848 v.s. 17408) when 2 5 GPUs. MQSP-eq requires only a quarter of GPUs to achieve the maximum sequence of ColAISP and Megatron-LM3, denoted by the dotted arrow lines. As the GPUs increase, Megatron-LM3 and ColAISP climb at a consistently decreasing rate, while MQSP-eq scales up almost linearly with a slope of C = 2464. As analyzed in section 3.3, MQSP maintains linear scalability under the condition of n ≤ 2464, L ≤ 2464 2 , which is practically a hardly attainable upper bound. It demonstrates that our MQSP resolves the quadratic memory overhead in the long sequence Transformer and achieves superior sequence length scalability. Moreover, MQSP-mem achieves further scalability, about 2× compared with MQSP-eq, even on a single device (6144 v.s. 2688) . It proves that the flexibility of the Micro-Q brings the potential to a more extended sequence, orthogonal to the sequence parallelism. 

4.3. MEMORY FOOTPRINT

In the sequence length experiment, MQSP achieves longer sequences with superior scalability, reflecting the advantage of MQSP's low memory footprint. Here, we conduct a more specific memory footprint comparison. Inheriting the model and training settings from the previous section, we arrange the experimental environments as 1n8g, 2n16g, and 4n32g, where XnY g represents Y -GPUs of X nodes in our cluster. MQSP below represents MQSP-eq for linear scalability setting. In addition, since the checkpointing in Micro-Q erases most stacked activations, we also evaluate MQSP without layer checkpointing, w/o l ckpt for short, to compare the memory and throughput. With the consistent hardware resources, we scales up the sequence length and compare their maximum allocated memory during training, as shown in the top row of Fig. 7 . Training the BERTlarge on 8192 sequence on 2n16g, we vary m to measure the maximum allocated memory and the throughput. As shown in Fig. 9 , the memory gain saturates as m reaches a certain level, for Micro-Q's attention computation is no longer where the maximum memory allocation occurs. Moreover, the throughput drops slowly as m increases and then drops off rapidly after the memory gain saturates. The reason could be that excessive segmentation results in inefficient tiny computations and communications. The result indicates that we could make the trade-off between speed and memory, benefiting from the flexibility of the Micro-Q mechanism. Consider an arbitrary row of the attention map of the scores and probabilities as S row k ,: , P row k ,: ∈ R 1×L . To be specific, they include [s 0 , s 1 , ..., s L-1 ] and [p 0 , p 1 , ..., p L-1 ]. For a numerically stable softmax, s i would minus their maximum value σ = max(s i∈[0,L) ) before exponentiating:

5. CONCLUSION

p i = exp(s i -σ) L-1 j=0 exp(s j -σ) Mathematically, according to the derivative Jacobian matrix of softmax: ∂p i ∂s i = p i (1 -p i ), ∂p i ∂s j = -p i p j (i ̸ = j) (10) For the training error e, the gradients for s i could be backpropagated as: 



Figure 1: The attention maps for (a) Megatron-LM3, (b) ColossAI sequence parallelism, and (c) the proposed MQSP, with memory complexity of O( H n L 2 ), O(H L 2 n ), and O(H L 2 mn ).

Figure 2: Illustration of sequence parallelism. Besides self-attention with global associations and softmax, other modules natively support sequence parallelism.

Figure 3: Overview of the MQSP self-attention. f and f are the conjugate all-gather/reduce-scatter communication operations. The dotted box for m times would reuse memory space. The red dotted arrow lines represent the aligned row for distributed softmax.

Figure 4: Conjugate pair comparisons. Broadcast/reduce: asymmetric simplex one-to-all communication for n times. All-gather/reduce-scatter: symmetric duplex collective communication for once.

and its local summation is Λ i . Please refer to Appx. A.2 for details of mathematical derivation. It introduces only O(n -1) complexity to allreduce-sum the scalar of local summation, negligible compared with Eq. 6. The pseudo-code of the distributed softmax is given in Algo. 1.Algorithm 1 The forward and backward pass of the distributed softmax. 1 # s: local attention scores 2 # [..., qDim, LocalSeqDim] 3 def d_softmax_forward(s): 4 max_local = s.max(-1) # [..., qDim] 5 max_global = all_reduce(max_local, op=MAX) 6 s_exp = (s -max_global[..., None]).exp() 7 sum_local = s_exp.sum(-1) # [..., qDim] 8 sum_global = all_reduce(sum_local, op=SUM) 9 p = s_exp / sum_global[..., None] 10 return p 1 # p: local attention probabilities 2 # p_grad: gradient of p 3 # both in [..., qDim, LocalSeqDim] 4 def d_softmax_backward(p, p_grad): 5 P = p_grad * p 6 sum_local = P.sum(-1) # [..., qDim] 7 sum_global = all_reduce(sum_local, op=SUM) 8 s_grad = P -p * sum_global[..., None] 9 return s_grad

Figure 5: Part of the training loss.

Figure 6: Comparison of maximum sequence lengths for training the Transformer as the GPU scale increases.

Figure 7: Comparison of the memory footprint and the token throughput with sequence length scaling up. w/o l ckpt means without layer checkpointing.

Figure 9: Memory (bar) and throughput (line) for different Micro-Q settings.Micro-Q serves as the core part of the proposed MQSP, introducing the number m of Micro-Q as a new hyperparameter. Thus this subsection explores the effect of how we set m. Training the BERTlarge on 8192 sequence on 2n16g, we vary m to measure the maximum allocated memory and the throughput. As shown in Fig.9, the memory gain saturates as m reaches a certain level, for Micro-Q's attention computation is no longer where the maximum memory allocation occurs. Moreover, the throughput drops slowly as m increases and then drops off rapidly after the memory gain saturates. The reason could be that excessive segmentation results in inefficient tiny computations and communications. The result indicates that we could make the trade-off between speed and memory, benefiting from the flexibility of the Micro-Q mechanism.

The pseudo-code of the implementation of Micro-Q in self-attention.1 class DistributedSelfAttention(nn.Module):

e s i = L-1 j=0 ∇ e p j × ∂p j ∂s i = ∇ e p i × p i (1 -p i ) + L-1 j̸ =i ∇ e p j × (-p i p j )(11)

Memory consumption comparison of model parameters and intermediate activations for the vanilla Transformer and the different parallel methods for the Transformer.

The convergence results.We set the same parallel size as data parallel, e.g., n = 8 in one node, and identical training hyperparameters. The convergence results are shown in Tbl. 2, and the training loss curves are depicted in Fig.5. The similar convergence quality demonstrates that MQSP provides a means for long sequences under the guarantee of convergence, which brings exploration space for Transformer toward longer sequences.

Except for Megatron-LM3's advantage in short sequences with fewer model parameters, MQSP occupies less memory footprint than other methods on most configurations, saving up to 78.6% memory when 17408 on 4n32g, as denoted by the dotted arrow line. Even with stacked activations, MQSP w/o l ckpt also requires less memory in long sequences. It indicates that MQSP expects less memory to support the training of long sequence Transformer. Furthermore, the memory advantage of MQSP grows with longer sequences, benefiting from its advantageous memory efficiency in self-attention. In addition to the capability for training longer sequence Transformer, training efficiency also deserves attention. We also conduct the token throughput comparison, as shown in the bottom row of Fig.7. The throughputs generally decline with sequence length scaling up due to the quadratic computation complexity in self-attention. For the maximum sequence in different environments, MQSP achieves similar token throughput, demonstrating that MQSP could scale up N × sequence length with N × devices and N × time consumption.In 1n8g, Megatron-LM3 has throughput advantages, and MQSP w/o l ckpt is comparable with ColAISP, while MQSP scales up to 18432 sequence without a significant drop in throughput. When training longer sequences, which are 8192 for Megatron-LM3 and 17408 for ColAISP, their insufficient scalability incurs the inter-node parallel group, marked as A and B in Fig.7. Our MQSP supports the same length within one node, achieving 2.1× and 3.3× throughput per device to Megatron-LM3 and ColAISP, respectively. MQSP also has throughput advantages in multi-node environments, benefiting from efficient collective communications. In addition, MQSP w/o l ckpt gains further throughput with less recomputation while maintaining better sequence scalability than other methods. Time Ratios. To analyze the throughput advantage of MQSP, we measure the time consumption ratios in a Transformer layer for MQSP and ColAISP. Configuring two environments as 1n4g and 2n16g, we train these two sequence parallel methods on their maximum sequence and profile the execution timeline of the forward passes. Here we adopt no overlapping to directly show the intra-node and inter-node communication costs. Fig.8exhibits the ratios of each part. Pie charts of time consumption with the communication part emphasized. The numeric suffix represents the number of nodes and GPUs. It indicates MQSP's efficient inter-node communication.The order parts are the projection of L qkv , dot-production of q and k, softmax or d-softmax, v reweighting, d-softmax communications, ring or conjugate communications, and the MLP. It demonstrates that d-softmax introduces negligible communication costs. The communication costs occupy acceptably low ratios in intra-node scope for both MQSP and ColAISP, 11% and 12%, respectively. In an inter-node environment, the inadequate bandwidth between nodes results in different increased communication ratios, 19% for MQSP but 65% for ColAISP, as a sharper performance drop. It proves that the duplex collective communication adopted by MQSP brings an advantage in a heterogeneous network environment, compared with ColAISP's ring-style one restricted by the lowest link.

This paper presents the Micro-Query sequence parallelism, an efficient distributed method for linearly scaling long sequence Transformer. MQSP achieves distributed self-attention through all-gathering queries, maintaining only partial columns of attention map with a low-cost distributed softmax. Furtherly, MQSP introduces the finer-grained query, Micro-Q, to reuse memory among the rows of attention map, jointly decomposing the quadratic memory. MQSP attains 4.5× sequence length compared with ColAISP and 4.3× with Megatron-LM3, achieving up to 78848 sequence on 32 A100 GPUs. The flexibility of Micro-Q boosts further scalability orthogonally, even on a single device. MQSP saves 78.6% memory and achieves 3.3× speedup in memory and throughput evaluations. With guaranteed convergence, MQSP facilitates scaling longer sequence Transformer. PengCheng Yang, Xiaoming Zhang, Wenpeng Zhang, Ming Yang, and Hong Wei. Group-based interleaved pipeline parallelism for large-scale dnn training. In International Conference on Learning Representations, 2022. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283-17297, 2020. The pseudo-code of the conjugate communication operators.

annex

Specifically, we get each device's local maximum scores σ i = max(s j∈[il,(i+1)l) ) before communicating to collect σ = r max (σ i ), where r max means the allreduce-max operation. Defining the exponent of the normalized scores as θ i = exp(s i -σ), we sum them up in stages:Where r sum means the reduce-sum operation. Similarly, for the backward pass, we define λ i = ∇ e p i × p i and its local summationThe form of backpropagation could be changed as:A.3 OVERLAPPING As shown in Fig. 10 , the computation of softmax on Micro-Q and values reweighting can overlap with other micro steps' communication of all-gather and reduce-scatter, offsetting the time overhead.

All-Gather

Reduce-Scatter • Summation:Intermediate activations:• MLP: Input BLd m , and intermediate BL4d m .• Self-attention: Input BLd m , L qkv produces 3 × BHLd k , attention scores and probabilities 2 × BHL 2 , and reweighted value BHLd v .• Summation:Megatron-LM3.Model parameters: Model parameters: Same as the Vanilla Transformer.Intermediate activations: Model parameters: Same as the Vanilla Transformer.Intermediate activations: 

