MQSP: MICRO-QUERY SEQUENCE PARALLELISM FOR LINEARLY SCALING LONG SEQUENCE TRANSFORMER

Abstract

Long sequence modeling of Transformer gains prevalence in fields involving long texts and high-resolution images and videos but suffers from quadratic memory complexity. Existing work investigates low-complexity variants or parallel methods to handle it. The former attempts to approximate full attention and is limited by a single device's capacity. The latter struggles to manage quadratic memory of attention maps, leading to insufficient sequence scalability. In this work, we propose a novel parallel method named Micro-Query Sequence Parallelism. MQSP slices sequences across devices and projects local queries, keys, and values in selfattention. For communication and memory efficiency, MQSP all-gathers the queries while keys and values remain locally to acquire the local attention map, on which a distributed softmax gets conducted to amortize memory by column. Meanwhile, the queries get further partitioned as Micro-Q to divide the computation and recycle the attention map by row, jointly decomposing the quadratic memory to achieve linear scalability. The evaluation result shows that MQSP scales up sequence length linearly and achieves 4.5× sequence length of ColossalAI's sequence parallelism and 4.3× of Megatron-LM3, enabling training BERT-large of 78848 sequence length on 32 A100 GPUs. MQSP can reduce up to 78.6% of memory occupation and achieve up to 3.3× throughput when training on 17408 sequence length. The convergence quality experiment proves that MQSP provides the means for long sequences with guaranteed convergence, bringing the potential for the Transformer to explore longer sequences.

1. INTRODUCTION

Transformer (Vaswani et al., 2017) , an attention-based model initially proposed for natural language processing (NLP), shows its promising potential in computer vision (CV), multi-modality, and more (Carion et al., 2020; Dosovitskiy et al., 2021; Liu et al., 2022; Arnab et al., 2021; Neimark et al., 2021; Radford et al., 2021; Wang et al., 2022) . The self-attention associations between arbitrary pairs of tokens enable the Transformer to learn global context-aware sequence representation for many modalities. Furthermore, there is an emerging trend toward long-range modeling, which scales up the sequence length of the Transformer. Long-range modeling is essential for the long texts in question answering, document classification, and other NLP tasks, as well as the high-resolution pictures and the series of video frames in image modality. As the sequence gets extended, memory consumption increases rapidly due to the quadratic complexity of self-attention, inevitably exceeding the limit of a single device (e.g., GPU). This problem obstructs the exploration of the Transformer for modeling longer sequences. Recently researchers focused on sparse mechanisms in self-attention (Child et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020) or low-complexity substitutes (Wang et al., 2020; Xiong et al., 2021; Choromanski et al., 2021; Qin et al., 2022) to approximate full attention, boosting sequence length. However, besides concern about performance influence, the memory upper bound of a single device still limits those methods from scaling up further. Much existing work investigates parallel methods to distribute the Transformer model across the devices, such as tensor parallelism (Shoeybi et al., 2019 ), pipeline parallelism (Harlap et al., 2018; Huang et al., 2019; Narayanan et al., 2021; Yang et al., 2022) , and zero redundancy optimizer (ZeRO) (Rajbhandari et al., 2020; Ren et al., 2021) . However, these methods separate the model parameters

