MQSP: MICRO-QUERY SEQUENCE PARALLELISM FOR LINEARLY SCALING LONG SEQUENCE TRANSFORMER

Abstract

Long sequence modeling of Transformer gains prevalence in fields involving long texts and high-resolution images and videos but suffers from quadratic memory complexity. Existing work investigates low-complexity variants or parallel methods to handle it. The former attempts to approximate full attention and is limited by a single device's capacity. The latter struggles to manage quadratic memory of attention maps, leading to insufficient sequence scalability. In this work, we propose a novel parallel method named Micro-Query Sequence Parallelism. MQSP slices sequences across devices and projects local queries, keys, and values in selfattention. For communication and memory efficiency, MQSP all-gathers the queries while keys and values remain locally to acquire the local attention map, on which a distributed softmax gets conducted to amortize memory by column. Meanwhile, the queries get further partitioned as Micro-Q to divide the computation and recycle the attention map by row, jointly decomposing the quadratic memory to achieve linear scalability. The evaluation result shows that MQSP scales up sequence length linearly and achieves 4.5× sequence length of ColossalAI's sequence parallelism and 4.3× of Megatron-LM3, enabling training BERT-large of 78848 sequence length on 32 A100 GPUs. MQSP can reduce up to 78.6% of memory occupation and achieve up to 3.3× throughput when training on 17408 sequence length. The convergence quality experiment proves that MQSP provides the means for long sequences with guaranteed convergence, bringing the potential for the Transformer to explore longer sequences.

1. INTRODUCTION

Transformer (Vaswani et al., 2017) , an attention-based model initially proposed for natural language processing (NLP), shows its promising potential in computer vision (CV), multi-modality, and more (Carion et al., 2020; Dosovitskiy et al., 2021; Liu et al., 2022; Arnab et al., 2021; Neimark et al., 2021; Radford et al., 2021; Wang et al., 2022) . The self-attention associations between arbitrary pairs of tokens enable the Transformer to learn global context-aware sequence representation for many modalities. Furthermore, there is an emerging trend toward long-range modeling, which scales up the sequence length of the Transformer. Long-range modeling is essential for the long texts in question answering, document classification, and other NLP tasks, as well as the high-resolution pictures and the series of video frames in image modality. As the sequence gets extended, memory consumption increases rapidly due to the quadratic complexity of self-attention, inevitably exceeding the limit of a single device (e.g., GPU). This problem obstructs the exploration of the Transformer for modeling longer sequences. Recently researchers focused on sparse mechanisms in self-attention (Child et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020) or low-complexity substitutes (Wang et al., 2020; Xiong et al., 2021; Choromanski et al., 2021; Qin et al., 2022) to approximate full attention, boosting sequence length. However, besides concern about performance influence, the memory upper bound of a single device still limits those methods from scaling up further. Much existing work investigates parallel methods to distribute the Transformer model across the devices, such as tensor parallelism (Shoeybi et al., 2019) , pipeline parallelism (Harlap et al., 2018; Huang et al., 2019; Narayanan et al., 2021; Yang et al., 2022) , and zero redundancy optimizer (ZeRO) (Rajbhandari et al., 2020; Ren et al., 2021) . However, these methods separate the model parameters across different dimensions, not alleviating the enormous intermediate activations introduced by long sequence self-attention.Therefore, several more recent methods are devoted to partitioning Transformer along the sequence dimension. ColossalAI's sequence parallelism (Li et al., 2021) transfers keys and values in ring-style to compute the partial attention map. Megatron-LM3 (Korthikanti et al., 2022) modifies the conjugate operators to parallel layer-norms and dropouts in sequence dimension. Despite amortizing some intermediate activations, these approaches still consume local memory proportional to global sequence length, making it difficult to scale up to a longer sequence. In this paper, we propose MQSP, a novel sequence parallel method, to diminish the quadratic memory overhead and efficiently scale up the long sequence Transformer. MQSP splits the input sequence to n devices for parallel computation and projects local queries, keys, and values in self-attention. We design a distributed self-attention to handle the global context-awareness in self-attention. Specifically, MQSP synchronizes the queries across the devices through an all-gather operation while the keys and values remain local, acquiring local attention maps amortized along the column dimension. Since the rows for softmax are incomplete locally, we conduct a distributed softmax with hierarchical reduction across the devices, which introduces negligible communication. More importantly, we divide local queries into m finer-grained queries, called Micro-Q, and process them step by step to get the corresponding attention outputs. Each micro step's memory space for the attention map would be shared along the row dimension, jointly decomposing the quadratic memory for linearly scaling up the sequence length. The proposed MQSP shows significant sequence scaling ability compared with the existing leading approaches. Our evaluation indicates that for the Transformer BERT-large (Kenton & Toutanova, 2019), MQSP could scale up to 78848 sequence length with 32 A100 GPUs, 4.5× of ColossalAI's sequence parallelism, and 38912 sequence length with 16 A100 GPUs, 4.3× of Megatron-LM3. In memory usage and throughput comparison, MQSP saves 78.6% memory and achieves 3.3× speedup, demonstrating the superiority of its distributed attention and communication method. Furthermore, the convergence quality experiments on wikitext, SQuAD, QQP and MRPC datasets prove that MQSP maintains the convergence quality for scaling up sequence length, bringing the prospect of exploring longer sequences to Transformers.

2. RELATED WORK

This section briefly introduces the self-attention complexity and the parallel methods for Transformer. Self-Attention Complexity. In the Transformer proposed by Vaswani et al. (2017) , self-attention is the vital module for the global dependencies modeling. Omit the batch and multi-head dimensions for brevity. For the input hidden states x ∈ R L×dm , where L is the sequence length and d m is the model hidden size, the attention mechanism can be formulated as: Q, K, V = L qkv (x), C = softmax(S)V = softmax( QK ⊺ √ d k )V Where L qkv is the linear layer projecting the tokens to Q, K ∈ R L×d k , and V ∈ R L×dv in the query, key, and value embedding spaces. The scaled dot product of Q and K produces the attention scores map S ∈ R L×L , which incurs the quadratic complexity O(L 2 ). Subsequently, the softmax operation along the rows converts S to attention probabilities P , which reweights V to the context output C. Many recent approaches introduce sparse mechanisms, such as Sparse Trans. (Child et al., 2019 ), Longformer (Beltagy et al., 2020 ), and BigBird (Zaheer et al., 2020) , or low-complexity substitutes, such as Linformer (Wang et al., 2020 ), Nyströmformer (Xiong et al., 2021 ), Performer (Choromanski et al., 2021 ), and Cosformer (Qin et al., 2022) . These methods algorithmically approximate the full attention with sparsity or low-rank assumption, reducing complexity to O(LlogL) or O(L). However, unsatisfied assumptions could meet performance deficiency in a broad task spectrum, and accommodating the entire Transformer within one device still limits the further expansion of the sequence length. Thus we set our sights on scaling up standard Transformer through more devices. Parallel Methods for Transformer. Parallelism approaches have been the innovative techniques for training the large Transformer. Pipeline parallelisms (Harlap et al., 2018; Huang et al., 2019; Narayanan et al., 2021; Yang et al., 2022) split the model layerwise without handling self-attention within layers. ZeRO (Rajbhandari et al., 2020) spreads the model's parameters and the optimizer states and conducts the same computation. Tensor parallelism in Megatron-LM (Shoeybi et al., 2019) 

