COMPOSITE SLICE TRANSFORMER: AN EFFICIENT TRANSFORMER WITH COMPOSITION OF MULTI-SCALE MULTI-RANGE ATTENTIONS

Abstract

Since the introduction of Transformers, researchers have tackled the expensive quadratic complexity of the attention mechanism. While significant complexity improvements have been achieved, they often come at the cost of reduced accuracy. In this paper, we propose Composite Slice Transformer (CST), a Transformerbased network equipped with a composition of multi-scale multi-range attentions, boosting both efficiency and modeling capability. After stacking fixed-length slices of the input sequence, each layer in CST performs a pair of fine-and-coarse-grained attentions with short and long ranges in a sequential manner, coupled with a volatile instant positional embedding. In addition to significantly reduced O(N L+N 2 /L 2 ) complexity for sequence length N and slice length L, CST achieves superior performance on a variety of tasks. We show that CST surpasses recently published efficient Transformers on the Long Range Arena benchmark, demonstrating the bidirectional long-range dependency modeling capability of our model with a comparable complexity. It also outperforms the standard Transformer by a margin of 6.9% in average accuracy across the five classification tasks. On the wordlevel WikiText-103 autoregressive language modeling task with various sequence lengths, and the masked language modeling followed by GLUE benchamrk, CST outperforms most other efficient Transformers while being competitive against the Transformer.

1. INTRODUCTION

Transformers (Vaswani et al., 2017) are one of the most important recent advances in artificial intelligence. Since they can be combined in a straightforward fashion with advanced training methods and auxiliary modules, Transformers have proven extremely effective as a versatile backbone architecture for achieving state-of-the-art performance in many domains such as natural language processing (Devlin et al., 2018; Yang et al., 2019; Brown et al., 2020; Raffel et al., 2020; Sanh et al., 2022) , vision processing (Dosovitskiy et al., 2020; Liu et al., 2021b; Radford et al., 2021) , visual language modeling (Alayrac et al., 2022) , speech recognition (Dong et al., 2018; Gulati et al., 2020; Shi et al., 2021) , and reinforcement learning (Chen et al., 2021b; Janner et al., 2021) . Despite this versatility, Transformers possess an expensive memory and computational complexity of O(N 2 ) with respect to input length N in the multi-head self-attention computation. As a result, Transformers are often not applied to long sequence data. Since the introduction of Transformers in (Vaswani et al., 2017) , recent work has focused on improving Transformer complexity through various techniques, achieving efficiency gains in complexity and memory requirements (Tay et al., 2020c) , with several models attaining O(N ) complexity (Wang et al., 2020; Katharopoulos et al., 2020; Ma et al., 2021) . Unfortunately, these efficiencies come at the cost of reduced accuracy and often lack the ability to model fine-grained token dependencies, limiting the application of these recent improvements to a wider range of practical problems. Recent studies (Zhu et al., 2021; Ren et al., 2021; Nguyen et al., 2021; Zhu & Soricut, 2021; Han et al., 2021) show that combining efficient techniques for global sequence modeling with fine-grained limited range attention can improve accuracy, maintain low complexity, and enable longer context windows, allowing such models to outperform the standard, full-resolution Transformer on certain tasks. However, optimally combining sequence modeling at different levels of granularity remains an open problem. To this end, we propose the Composite Slice Transformer (CST), an efficient Transformer-based network architecture, consisting of a composition of attentions applied to a stacked slice representation of the input sequence at different scales, coupled with a multi-scale volatile instant positional embedding. For fixed slice length L, CST has a complexity of O(N L+N 2 /L 2 ), which is comparable or more efficient to linear complexity in many practical settings (Section A.4). Since slicing restricts fine-grained token interaction across boundaries, CST also leverages an extended local attention, preventing context fragmentation (Dai et al., 2019b) and enabling seamless sequence modeling. These improvements allow CST to outperform the standard Transformer on several benchmark tasks. Similar to (Dosovitskiy et al., 2020) , CST abstracts the input sequence into another with fewer tokens to compute a low-resolution attention with longer range and a fixed-size segment-wise embedding. With a segment length L, the reduced sequence has length N/L; evaluating attention on this reduced sequence has complexity O(N 2 /L 2 ). With an appropriately chosen L, CST can achieve a significant complexity reduction without a loss of performance. Along with the segment-level attention, CST also leverages a full-resolution local attention to form a multi-scale multi-range attention (MSMRA) through a composition with the global attention, improving model expressiveness (Section A.5), unlike (Zhu & Soricut, 2021; Zhu et al., 2021; Nguyen et al., 2021; Ren et al., 2021; Han et al., 2021) . CST extends the ideas of position-infused attention (Press et al., 2021) and applies them to MSMRA, which we refer to as multi-scale volatile instant positional embedding (MS-VIPE). In addition to its effectiveness as a positional embedding, MS-VIPE provides further parameter efficiency by only requiring storage of reduced lengths of local (L) and global (N/L) attentions instead of the full sequence length N . We evaluate our model on bidirectional long-range dependency modeling tasks, an autoregressive language modeling task, and natural language understanding tasks. In our experiments (Section 5), CST achieves state-of-the-art performance among all Transformer-based models (including the standard Transformer and recently proposed efficient Transformers) and demonstrates strong performance on a wide variety of tasks. The paper is organized as follows. In Section 2 and Section 3, we discuss recent efficient Transformer developments, discuss strengths and weaknesses, and outline the inspiration for CST. In Section 4, we present our model, Composite Slice Transformer, and describe its features. In Section 5, we provide our experimental results on the Long Range Arena, WikiText-103 autoregressive language modeling, and the GLUE benchmarks, demonstrating the efficiency and versatility of CST. In Section 6, we conclude with a summary of our work. Tay et al. (2020c) provides a comprehensive study of proposed approaches to accelerate the attention mechanism of Vaswani et al. (2017) . The key differentiating factor among the approaches presented is the modeling assumptions used to approximate the original attention map. For example, common assumptions include sparsity (Child et al., 2019; Ho et al., 2019; Beltagy et al., 2020; Ainslie et al., 2020; Zaheer et al., 2020; Tay et al., 2020a; Kitaev et al., 2019) and low-rankness (Wang et al., 2020; Choromanski et al., 2020; Katharopoulos et al., 2020) . Chen et al. (2021a) combines these two assumptions. Other architectures leverage additional memory for compressing the global context (Lee et al., 2019; Ma et al., 2021; Jaegle et al., 2021) . In order to capture fine-grained token interactions that might be missing in the abstractive attention, (Zhu & Soricut, 2021; Zhu et al., 2021; Nguyen et al., 2021; Ren et al., 2021) use a process akin to leverage a full-resolution local attention to form a multi-scale multi-range attention (MSMRA). These approaches, however, do not compose the local full-resolution attention and global reduced-resolution attentions in a series. Absent from this literature, serial composition and positional embedding would improve accuracy while preserving the efficiency (see the following sections for how we address this enhancement). More recently, there also have been proposed sequence models with structures or strong inductive biases achieving significant improvements in benchmarks such as (Gu et al., 2021; Mehta et al., 2022; Ma et al., 2022; Li et al., 2022) . (Press et al., 2021) propose the position-infused attention (PIA), which is an attention module with a layer-wise positional embedding applied only to queries and keys, to address token representation reusability issue in increasing window attention due to positional information draft, while avoiding expensive computation of relative position encoding (Dai et al., 2019b) . Another study of adding positional embedding to values is conducted in (Tsai et al., 2019) , concluding that positional embedding on value does not lead to performance improvement. We extend these ideas in CST and apply in MSMRA.

3.1. TRANSFORMERS AND MULTI-HEAD SELF-ATTENTION

A Transformer layer consists of a multi-head self-attention sublayer followed by a feed-forward network (Vaswani et al., 2017) with an optional cross-attention sublayer when used as a decoder. The multi-head self-attention is defined as the concatenation of the self-attention output of all attention heads: Y = [Y 0 , Y 1 , ..., Y H-1 ] 2 , where [•] r denotes concatenation in the r th dimension, and each of the outputs Y h ∈ R N ×d h is a scaled dot-product attention computed from the input X ∈ R N ×D as Y h = softmax Q h K h ⊤ √ d h V h = AV h . In Eq. 2, Q h = XW q,h , K h = XW k,h , and V h = XW v,h are queries, keys and values, respectively, expressed as linear transformations of the input X by W •,h ∈ R D×d h . We assume the queries, keys, and values have the same hidden dimension: d h = D/H. For the rest of the paper, we omit the head index h and scaling factor 1/ √ d for simplicity. We denote the query, key and value at some position index i by k i , v i , q i ∈ R 1×d , respectively. In this context, the attention output at the i th token position y i ∈ R 1×d h corresponds to y i = softmax q i K ⊤ V. (3) Due to the nonlinearity and normalization property of the softmax function, QK ⊤ must be computed to obtain the attention weight, followed by value aggregation through the attention weights, AV, resulting in O(N 2 ) complexity with respect to the sequence length N for the self-attention.

3.2. ABSTRACTIVE ATTENTIONS AND MULTI-SCALE MULTI-RANGE ATTENTION (MSMRA)

We refer to the family of efficient attention approaches in which the lengths of the attention operands are reduced to M < N by applying an abstraction function ϕ(•) as abstractive attentions. This approach results in reduced attention complexity while retaining the form of basic attention computation in Eq. 3. Many recent approaches follow this template (Wang et al., 2020; Ma et al., 2021; Dosovitskiy et al., 2020) . We focus on cases where the operands of attention, i.e., query, key, and value, are abstracted, noting that there are other possible choices, e.g., abstracting only query and key. In such cases, the attention mechanism is reduced to y i ′ = softmax q i ′ K ⊤ V, where Q, K, and V are abstracted queries, keys, and values, respectively, q i ′ is i ′th row of Q, and y i ′ is the attention output token with the abstracted query token q i ′ obtained by q i ′ = ϕ {q i∈Ω i ′ } . In order to restore the resolution of the output, since the query is abstracted, we define a one-to-many mapping function ψ(•) as A more detailed description of abstractive attention is presented in Section A.1. y i∈Ω i ′ = {ψ (y i ′ )} i . Although many previous abstractive attention approaches have achieved sub-quadratic or linear complexities, they typically result in degraded benchmark performance. However, Transformer-based models that leverage multi-scale attention by combining local attention and global attention perform competitively against the standard Transformer. In fact, these models can outperform the standard Transformer on certain tasks, while still maintaining efficiency (Zhu & Soricut, 2021; Zhu et al., 2021; Nguyen et al., 2021; Ren et al., 2021) . While other types of MSMRA are described in Section A.2, our proposed attention mechanism is essentially an MSMRA of the form y i = y l,i + ψ y g,i ′ , where y l,i is the local attention output and y g,i is the global attention output, leveraging a onedimensional version of the token abstraction used in (Dosovitskiy et al., 2020) for the global attention, but with additional composition of local attention.

4. COMPOSITE SLICE TRANSFORMER

We describe the key ideas of Composite Slice Attention (CSA) and CST network, a Transformerbased model with CSA replacing the full softmax dot-product attention in the standard Transformer. CSA leverages both full-resolution attention in limited ranges and abstracted attention to capture long-range interactions. Unlike previous approaches, the multi-scale, multi-range attentions are combined through function composition in a serial connection, which allows information passing across representations at different scales and improves expressiveness of the model. See Figure 1 for an illustration of the CSA module and Appendix A.3 for full architecture diagrams.

4.1. SEQUENCE REPRESENTATION AS STACKED SLICES AND CSA

In a high-level categorization, the multi-scale multi-range attention of the CST corresponds to the combination of block-wise local window attention (Beltagy et al., 2020) and patch-based global attention (Dosovitskiy et al., 2020) in a one-dimensional form. CSA layer starts with converting the input sequence X ∈ R N ×D into a stack of slices S ∈ R N/L×L×D by slicing it with a fixed length L, implemented simply as a Reshape operation. Two attentions with different granularity are then performed sequentially in each direction. First, the batch size B and the number of slices N/L is combined as a new batch size BN/L, so that we parallelize the local attention for all slices. Then, the local attention is performed across the tokens within each of these new batches: Y l = softmax Q l K ⊤ l V l , where Q l , K l , and V l ∈ R BN/L×L×d are the queries, keys, and values for each local attention head, computed as SW l,q , SW l,k , and SW l,v , respectively, with W l ∈ R D×d . Next, the dimension of length L in the local attention output is collapsed using an abstraction function ϕ to get the slice embedding S ∈ R BN/L×D . Specifically, we use a simple mean pooling S s = ϕ(Y l,s ) = L-1 l=0 m s,l Y s,l / L-1 l=0 m s,l , where s and l denote the slice index and the token index, respectively, and m ∈ R N/L×L is the stack of slices of the binary attention mask that is optionally given in the case that the input sequence is padded; e.g., when a data batch has samples with different lengths.foot_0 Then the second, global attention along the slice dimension is performed to model full-range slice-wise interaction in a reduced resolution: Y g = softmax Q g K ⊤ g V g , where Q g , K g , and V g is the abstracted queries, keys, and values for the global attention obtained by applying W g,q , W g,k , and W g,v to S. Finally, the output of the global attention is broadcasted to the original resolution and added to the local output: Y i = Y l,i + Y g,i = Y l,i + ψ Y g,i ′ , where ψ(Y g,i ) is a broadcasting function that restores only the sequence length, i.e., Y g,i = Y g,i ′ for i ∈ Ω i ′ , since the output of the local attention still holds the full resolution. This process can be implemented as a simple broadcast-add operation. Slice Extension: Addressing Fine-Grained Context Fragmentation The local attention of the stacked slice representation of the sequence S is strictly bounded, resulting in potential context fragmentation (Dai et al., 2019b) . Although the global attention models the slice-wise interactions, it may not be sufficient to capture the important fine-grained token dependencies. To allow token-level interaction between slices, we slightly extend each slice into its neighbors, allowing slices to share tokens. This extended local attention can be computed by having keys K l, ext and values V l, ext transformed from extended stacked slices S ext , where α denotes the extension ratio. The extension can be implemented by concatenating shifted-slices in keys and values: S ext = 0 (α-1)L/2 , S :-1,(3-α)L/2:L 1 , S, S 1:,0:(α-1)L/2 , 0 (α-1)L/2 1 2 ∈ R N/L×αL×D , for α ≤ 3 where the Python notation of indices selection is used. Slice Extension and Multi-Scale Causal Mask for Autoregressive Sequence Modeling In order to apply CST to autoregressive sequence processing tasks, we propose custom causal masking and slice extension schemes. For the local token-level attention, we apply an L × L causal mask map with -∞ above the main diagonal and zero elsewhere to the score tensor Q l K ⊤ l . However, since the leftmost tokens in each slice have few (and possibly zero) tokens to attend to, we extend the keys and values only on the left-hand side to encourage better fine-grained dependency modeling, i.e., S AR, ext = 0 (α-1)L/2 , S :-1,(3-α)L/2:L 1 , S 2 . (12) For the global attention, an extra care must be taken to prevent leftward information leakage while computing the slice embedding via mean pooling. The diagonal elements in the N/L × N/L casual mask are set to be -∞ in contrast to the local counterpart. In addition, at the slice index t, the shifted query Q g,t-1 is used for query instead of Q g,t (note that K g,t is handled by the global casual mask). Increased Expressiveness of CST We mathematically motivate the improved performance of CST over competing efficient Transformer models. We show that a given function that CST is able to represent is ϵ away from a rational function with Euclidean degree 4d(ϵ), while competing approaches such as H-Transformer-1D (Zhu & Soricut, 2021) can only approach a rational function with Euclidean degree d(ϵ). This implies that CST is more expressive than other approaches that do not involve compositions of multi-scale attentions. We state the main result below and present the details in Appendix A.5. Proposition 1. For any fixed ϵ > 0, there exists some Euclidean degree d(ϵ) = O(log(ϵ)) such that, Y H g ⊆ S ϵ M,d(ϵ) , and Y CS g ⊆ S ϵ M,4 d(ϵ) , where M corresponds to the total number of weights in CST, and S ϵ M,d is the space of ratios of real analytic functions that are ϵ away from a rational function with Euclidean degree d with input in R M .

4.2. MULTI-SCALE VOLATILE INSTANT POSITIONAL EMBEDDING (MS-VIPE)

Since we reduce the input lengths of both global and local attentions, the full positional embedding of the maximum input sequence length is no longer necessary. For the local attention, we can limit the positional embedding length to the attention range (i.e., the slice length L), sharing the embedding across slices. In addition, as we aggregate the tokens from each slice for the global attention, it is more natural to have a separate positional embedding of length N/L at the scale of slice embedding instead of aggregating the full-resolution positional embedding with the same process as the token embedding. The total number of parameters is ((L + N/L)D), less than that of a conventional positional embedding (N D). To this end, CST uses two positional embeddings P l ∈ R 1×L×D and P g ∈ R N/L×D applied at different scales, in a fashion similar to that used by (Han et al., 2021) , but with a few crucial differences: first, instead of adding the positional embedding to the stacked slices of token embedding at the embedding layer and having aggregated positional information as the layers stack up (Press et al., 2021) , CST applies them instantly at the corresponding attentions in each layer before the linear transformations. Second, the positional embeddings are applied for the queries and the keys, not for the values, to prevent accumulation of positional information in sequence representations. Since the positional embedding in all layers are added, they can be accumulated over the layers and can undesirably dominate the contextual information at top layers which potentially leads to performance degradation. Our experiments in Section A.6.2 show that the multi-scale volatile instant positional embedding (MS-VIPE) is more effective as compared to the conventional absolute full-length positional embedding. Equations ( 8) and ( 9) are rewritten as: Y l = softmax {(S + P l ) W l,q }{(S + P l ) W l,k } ⊤ SW l,v , Y g = softmax { S + P g W g,q }{ S + P g W g,k } ⊤ SW g,v . For the extended local attention, we modify the corresponding positional embedding P l to have the extended length αL, similarly to K l, ext .

4.3. COMPLEXITY AND PARAMETER REDUCTION IN CST

CST has linear plus decimated quadratic complexity O(N L + N 2 /L 2 ) compared to the O(N 2 ) complexity of the standard Transformer. However, in a practical range of sequence lengths, e.g., from a few hundreds to a few tens of thousands, CST has a comparable or better efficiency than other efficient transformers with linear complexity O(N M ) with choices of the abstraction lengths M , e.g., from 64 to 256 or higher for even longer sequences, since the slice length L for CST is typically less than M , even with additional for query, key, and value transformations of 3(N/L)d 2 , which is almost negligible. Furthermore, unlike most efficient transformers that can have similar or higher complexity than the standard transformers with short input sequences, CST has better efficiency even in such cases. See Section A.4 for more details of practical complexity analysis.

5. EXPERIMENTS

To demonstrate the computational efficiency and sequence modeling capability of CST, we evaluate our model in three different contexts: (1) bidirectional long-range dependency modeling on classification tasks (2) word-level auto-regressive language modeling and (3) masked language modeling and transfer learning to natural language understanding on short sequences. Throughout this section, unless stated otherwise, it is implied that we train each model from random initialization, reporting its test performance from the model with the best validation result in each case.

5.1. BIDIRECTIONAL LONG-RANGE DEPENDENCY MODELING

Datasets and Baselines The Long Range Arena (LRA) benchmark (Tay et al., 2020b ) is a suite of classification tasks that evaluate long-range dependency modeling capabilities with datasets from several different modalities with lengths ranging from 1k to 16k. We evaluate CST on the five tasks broadly used in literature, ListOps, Text, Retrieval, Image, and Pathfinder, where the maximum (Qin et al., 2021) , a group of abstractive attention methods: Reformer (Kitaev et al., 2019) , Linformer (Wang et al., 2020) , Luna (Ma et al., 2021) , Nyströmformer (Xiong et al., 2021) , and another group of multi-scale multi-range attention methods: H-Transformer-1D (Zhu & Soricut, 2021) , Long Short Transformer (Zhu et al., 2021) , and Scatterbrain (Chen et al., 2021a) , as well as the standard Transformer (Vaswani et al., 2017) . We outline experimental setup for the Long Range Arena in Appendix A.6.1. Our setup is favorable to the baseline Transformer to provide a fair condition to the challengers, explaining the difference in performances of Transformers from previous work.

Results

Our experimental results on the LRA benchmark are given in Table 1 . We observe that CST surpasses the state-of-the-art efficient Transformers with large margins. The best performance is achieved using a slice length 8, outperforming the Transformer baseline by 4.23 points in average score across all tasks. The largest improvements are observed on the Text, Retrieval, and Image tasks, while a minimal degradation is observed on the ListOps task. In addition, as discussed in Section 4.1, CST has the added benefit of requiring fewer parameters for a given model size thanks to its positional embedding characteristics, although this benefit may be marginal as the network size increases. We also measure relative speed, and relative GPU memory usage for each model under comparison on a NVIDIA Tesla V100 GPU with 32GB memory on Retrieval dataset with the sequence length of 4k and report the results in Figure 2 . Here, relative speed is computed as the number of training steps per second divided by that of Transformer, and GPU memory usage is measured using the PyTorch command, torch.cuda.max_memory_allocated(), both on the Retrieval task. It is clearly seen that CST outperforms all models while having competitive computation and memory efficiency to other efficient Transformer models. We present an ablation study regarding the choice of positional embedding and aggregation method on the LRA benchmark in Section A.6.2. In summary, variants of CST consistently outperform all other efficient models (Table 7 ), except that switching the MSMRA composition order, i.e. attentions in global-local order, performs slightly worse than Long Short Transformer (Zhu et al., 2021) . We demonstrate that the composition of MSMRA in the local-global order and the proposed positional embedding are the most important components contributing to CST's accuracy. While the best overall performance is achieved with L = 8, we find that the slice length L has less effect than other hyperparameters. We also find that attention outperforms an MLP-based local token mixing scheme and that mean pooling aggregation outperforms max pooling and linear projection (Table 8 ).

5.2. AUTOREGRESSIVE LANGUAGE MODELING

In this section, to evaluate its applicability as language model, we conduct an autoregressive language modeling experiment on the word-level WikiText-103 dataset (Merity et al., 2016) .

Dataset and Experimental Setup

The WikiText-103 (Merity et al., 2016) dataset is a collection of featured articles on Wikipedia. It has 103M/218K/246K words in training/validation/test sets, respectively. The task in this experiment is to predict the next word given the history of past words. We follow the experimental setup in (Nguyen et al., 2021) and that in (Chen et al., 2021a) for the context window length 256 and 1024, respectively, and train CSTs to compare the results with their reported results. While the dataset allows for a larger context, we used the 256 and 1024 context window lengths to match the baselines. CST also uses the causal mask and slice embedding described in Section 4.1. More discussion of the experimental setup can be found in Appendix A.7.1.

Results

We report the perplexities of the best performing variant of CST (L = 32, α = 3) compared to other state-of-the-art efficient Transformer models in Table 2 and Table 4 . CST outperforms other efficient Transformer including Linear Transformer (Katharopoulos et al., 2020) and Performer (Choromanski et al., 2020) , kernel method-based linear-complexity models, and FMMformer, while being comparable to Scatterbrain (Chen et al., 2021a) , noting that the latter two are MSMRA-based models. In Table 3 , we provide an ablation study demonstrating the impact of L and α on validation and test perplexity. We observe that longer local attention length leads to better perplexities while being much shorter than the the context window length, e.g. CST with L = 32, α = 3 has the local window length 64 that is 4 and 16 times smaller than the context window lengths 256 and 1024, respectively. We believe that addressing the missing global context, as discussed in A.9, and a better hyperparameter search will improve the perplexity. We also observe consistent 1.5x speed-ups across configurations compared to the standard Transformer in Table 3 .

Additional Experiment on PG-19

We additionally conduct an experiment on PG-19 dataset (Rae et al., 2019) . We use various combinations of sequence length N , slice length L, and extension ratio α to match the attention complexity of CSTs to those of Transformers with N = 256 and 512. While the validation and the test perplexities with discussion can be found in A.7.2, we observe CST consistently outperforms the Transformer counterparts with the same attention complexities.

5.3. BIDIRECTIONAL LANGUAGE MODELING AND TRANSFER LEARNING

We further evaluate CST on bidirectional language modeling and transfer learning on GLUE benchmark with relatively short input sequences, i.e., N = 128. Masked language modeling (MLM) was proposed in (Devlin et al., 2018) as a pretraining method of transformer encoder models and it greatly improves downstream natural language understanding (NLU) performances when fine-tuning the pretrained models. We follow the experimental setup of (Devlin et al., 2018) for conducting both pre-training and fine-tuning including datasets, masking rate, batch size, optimizer settings, and evaluation metrics with a few exceptions. We report the MLM validation perplexities and GLUE scores with accuracy on each task for a variation of L and α in Table 5 . Further experimental details, more experimental results, and discussion can be found in A.8. CST consistently outperforms other efficient transformers and closes the gap to the baseline BERT model by 0.46 validation perplexity and 2.45 points in GLUE score.

6. CONCLUSION

In this paper, we present Composite Slice Transformer (CST), an efficient Transformer-based network equipped with composition of multi-scale multi-range attentions. Using stacked slice representation of input sequences, a CST layer performs a set of local fine-grained attention and global coarsegrained attention in a sequential manner at a low complexity cost of O(N L + N 2 /L 2 ). In addition to the reduced complexity, we also show that CST improves performance on various sequence modeling tasks. On Long Range Arena, word-level autoregressive language modeling on WikiText-103, masked language modeling, and natural language understanding benchmarks, CST significantly surpasses strong baselines including recently proposed efficient Transformers, and sometimes the standard Transformer. We highlight limitations and potential directions for future work in A.9. 

A APPENDIX

A.1 ABSTRACTIVE ATTENTIONS Abstractive attentions are the family of efficient attention approaches in which the lengths of the attention operands are reduced to M (< N ) by applying an abstraction function, resulting in reduced complexity of the attention while retaining the form of basic attention computation in Eq. 3. Abstractive attentions can be further categorized to either resolution preserving or non-preserving, according to which operands are chosen to be abstracted. Resolution non-preserving attention is the result of abstracted queries, further producing abstracted output. This categorization is determined by the requirement of the given task. For example, tasks such as language modeling and machine translation require the full resolution at the output to be retained. In such cases, it is common to have only abstracted keys and values while the query resolution is retained; the abstractive attention can be expressed as y i = softmax q i K ⊤ V, K = k 0 , ..., k j ′ , ..., k M k 1 , k j ′ = ϕ {k j∈Ω j ′ } , where Ω j ′ denotes the abstraction range with the cardinality |Ω j ′ | = M k for the j ′th key abstraction k j ′ and ϕ(•) : K Ω j ′ ∈ R |Ω j ′ |×d h → k j ′ ∈ R 1×d h is a many-to-one abstraction function. The abstracted values V j ′ can be expressed similarly. The queries Q can be similarly abstracted to Q as q i ′ = ϕ q {q i∈Ω i ′ } , where q i ′ is the i ′th row of Q. The attention mechanism is reduced to y i ′ = softmax q i ′ K ⊤ V, where an attention output vector y i ′ is obtained at each abstract position i ′ . In order to restore the resolution of the output, we define a one-to-many mapping function ψ y as y i∈Ω i ′ = {ψ y (y i ′ )} i . Resolution non-preserving abstraction is often used for the tasks where the full-resolution output is not needed such as sequence-level classification problem. However, in some cases, with additional processing leveraging representations at a lower layer, e.g., cross attention with input tokens, it is possible to restore the resolution (Dai et al., 2020; Jaegle et al., 2021) .

A.2 MULTI-SCALE MULTI-RANGE ATTENTIONS (MSMRA)

Although many previous abstractive attention approaches have achieved sub-quadratic or linear complexities, they typically come at the cost of degraded benchmark performance. However, Transformerbased models that leverage multi-scale attention by combining local attention and global attention perform competitively against the standard Transformer; in fact, these models can outperform the standard Transformer on some benchmarks, while still maintaining efficiency (Zhu & Soricut, 2021; Zhu et al., 2021; Nguyen et al., 2021; Ren et al., 2021) . The local attention, also known as sliding window attention, limits the attention range to the vicinity of query locations. That is, the key abstraction with the whole abstraction range and the query locationdependent abstraction function becomes K l,i = ϕ sliding k,i (K) = K⊙(H(i-j -w/2)-H(i-j +w/2)) for the i th query token, where H is the Heaviside step function, w is the window length, and ⊙ is the element-wise product. This results in the following equation: y l,i = softmax q i K ⊤ l,i V l,i For better computational efficiency, block-wise key abstraction in Equation equation 22 can be adopted as K l,i = ϕ block k,i (K) = K ⊙ (H(t i -j -w/2) -H(t i -j + w/2 )) for a block-wise attention where t i = (b -1/2)w for the block index b such that (b -1)w ≤ i < bw. For the global attention, abstractive attention with either positional abstractions (Zhu & Soricut, 2021; Yang et al., 2021; Ren et al., 2021) or contextual abstractions (Ma et al., 2021; Zhu et al., 2021 ) can be employed. The former can be loosely seen as having patch embedding in ViT (Dosovitskiy et al., 2020) . MSMRAs can also be categorized according to how the two attentions are combined. While one approach involves concatenating the abstractions of multi-scale keys and values for a single attention operation (Zhu & Soricut, 2021; Zhu et al., 2021; Yang et al., 2021) : y i = softmax q i K l,i , K g ⊤ 1 V l,i , V g 1 , another separates attentions at different scales and combining the outputs (Han et al., 2021) (possibly with some weighting coefficients): y i = y l,i + ψ y (y g,i ) . In this case, other non-attentive methods such as kernel method (Nguyen et al., 2021) can also be used for the global attention. CST belongs to the latter of the two approaches, where the local and global attentions are performed separately and their outputs are combined. This is closely related to Transformer-In-Transformer (TNT) (Han et al., 2021) ; however, since TNT has a path for the local attention that is independent of the global attention, information exchange between patches is not allowed for the features in the local attention path. Unlike TNT, the composition of multi-granular attentions in CST enables two-way information passing. This is more suitable for modeling highly non-stationary data, such as natural language text data for example, where the locality assumption does not hold.

A.3 ARCHITECTURE OF COMPOSITE SLICE TRANSFORMER NETWORK

We present the overall architecture of CST alongside the detailed CSA architecture in Figure 3 . CST consists of a fine-grained local attention with computational complexity O(N L) and a coarse-grained global attention with complexity O(N 2 /L 2 ). A shared multi-scale positional embedding, MS-VIPE, is applied to all layers. CSA replaces self-attention sublayer in each Transformer block to form a Composite Slice Transformer (CST) network. Two sets of W q , W k , W v transformations in CST may also affect the parameter count. While TNT (Han et al., 2021) reduces the hidden dimension for local attention to limit the network size increase, we keep the same dimension for representations at different scales and share the weights between local and global attentions, resulting in no increase in parameter count.

A.4 CONSIDERATION ON PRACTICAL EFFICIENCY AND EFFECTIVENESS OF CST

In this section, we provide an intuition for how and why a multi-scale multi-range attention with O(N L + N 2 /L 2 ) complexity can be a better alternative to the vanilla self-attention than abstraction attentions with linear complexities in terms of both complexity and modeling capability. Improved expressiveness from composition of them is further discussed in Section A.5. A linear complexity can be easily considered to be more efficient than a quadratic complexity. A linear complexity often accompanies another variable; i.e., the abstraction length M in the case of abstractive attention such as (Wang et al., 2020; Ma et al., 2021) , resulting in O(N M ) complexity. It is obvious that the quadratic complexity O(N 2 ) is higher than the linear complexity whenever the abstraction length M is smaller than the sequence length, which is mostly true by the definition of abstraction. However, when a notion of sequence decimation comes into play, we come to a little different conclusion. Consider an attention on an abstracted sequence of patch embeddings in (Dosovitskiy et al., 2020) or slice embeddings in CST. Then the complexity of the abstractive attention becomes a decimated quadratic O(N 2 /L 2 ). In addition to this, if a full-resolution local attention is used as in CST or (Han et al., 2021) as a multi-scale multi-range attention, the complexity becomes O(N 2 /L 2 +N/L•L 2 ) = O(N 2 /L 2 +N L). We plot the comparison of the linear complexity O(N M ) and O(N 2 /L 2 +N L) in Figure 4a with several practical choices of the number of abstractions M and the decimation ratio, e.g., slice length, L. Here, one can find that a decimated quadratic complexity attention can have better efficiency a linear complexity attention, when the sequence length is less than a few tens of thousands which is considered as a practical range of sequence lengths in many tasks and data types. Figures 4b shows effective number of tokens for each abstraction for both cases with the same choices of M and L. While a decimation, e.g., slicing, retains constant numbers of tokens in each abstraction, the linear complexity attention methods that uses fixed number of abstraction has linearly increasing effective number of tokens per abstraction with respect to the sequence length. Given a fixed hidden dimension, larger number of effective tokens per abstraction requires the abstraction process to compress more information, resulting in loss of potentially important information and negatively affecting the modeling capability. Since the complexity of CST has a decimated quadratic term N 2 /L 2 , the efficiency benefit compared to linear-complexity method keeps closing as the sequence length becomes larger, and eventually the efficiency order is reversed, as it can be seen in 4a. However, we argue that our model can still be beneficial in two aspects. First, it is still more efficient than the standard Transformer. Second, it is possible to find an optimal set of hyperparameters such as the slice length L and the extension ratio α which result in comparably efficient and more effective in modeling capability to the linearcomplexity counterpart, considering the discussion made in this section on the effective number of tokens per abstraction.

A.5 IMPROVED EXPRESSIVENESS OF CST SLICE ATTENTION

Recall from Eq.equation 10 that the attention mechanism under consideration takes the form, Y = Y l + ψ(Y g ) where Y l represents local attention, Y g represents global attention and ψ(•) is a (linear) one-to-many map. In this section, we will compare CST with the H-transformer described in Zhu & Soricut (2021) . In particular, we argue that, for a given number of weights (degrees of freedom), our proposed attention mechanism, which involves the composition of attention mechanisms, is more expressive than the H-matrix-based approach proposed Zhu & Soricut (2021) . Zhu & Soricut (2021) shares similar characteristics with CST, but does not involve composition, which explains our superior prediction performance (Section 5). To explain why this is so, we first define γ (X; W q , W k , W v ) = softmax (XW q )(XW k ) T (XW v ), which is the standard attention mechanism found in Eq.equation 2. Here, X is generally a matrix or a 3-tensor; in the latter case, we apply attention independently to each matrix slice along the third dimension. Expressing our local and global attention using this notation leads to, 2 Y CS l = γ (S; W l,q , W l,k , W l,v ) ∈ R N/L×L×D (27) Y CS g = γ Y CS l × 2 µ; W g,q , W g,k , W g,v ∈ R N/L×D ( ) where µ = 1 L , 1 L , ... , 1 L T is a fixed vector of mask values, S corresponds to the stack of L-length slices of X described in Section 4.1, and × 2 indicates a tensor product along the second dimension, i.e., [Y CS l × 2 µ] ij = L k=1 [Y CS l ] ijk µ k (29) which corresponds to mean pooling within each slice. For the purpose of comparison, we focus on a 2-level H-transformer where each element at level 1 has L children at level 0. 3 In this case, it can be shown that the H-tranformer approximation takes the form of Eq. equation 25 with, 4 Y H l = γ (S; W l,q , W l,k , W l,v ) = Y M l ∈ R N/L×L×D (30) Y H g = γ (S × 2 µ; W g,q , W g,k , W g,v ) ∈ R N/L×D (31) Note that the main difference between H-Transformer these and CST lies in Eq.equation 31 and Eq.equation 28; in the former, the argument to the global attention is S × 2 µ, whereas it is Y CS l × 2 µ in the latter. 2 In practice, it is common to have equal weights at local and global scales, i.e., W l,• = Wg,•. This is a special case of the analysis presented here and changes the conclusion in no significant way. 3 More levels have little impact on our conclusion since H-transformers ultimately exhibit the same level of nonlinearity in the weights whatever the number of levels. 4 Refer to Eq.( 29) in Zhu & Soricut (2021) for two levels, i.e., Y = Y (0) + P (0) Ỹ (1) . In this case, P (0) corresponds to the one-to-many map ψ(•) whereas Ỹ (1) = Ã(1) (R (1) T V ) corresponds to applying the level-1 matrix Ã(1) to the vector V after averaging over slices (i.e., applying R (1) T ). In this context, this is analogous to applying a global attention to the averaged slices vector, namely, S ×2 µ. To demonstrate our claim, let us first introduce some quantities of interest, starting with the families of parameterized functions that the aforementioned attention mechanisms encompass. First, assume without loss of generality that all the weights are limited to the interval [-1, 1].foot_1 Then, any attention function learned during the training process must belong to the following sets, Y CS l := Y ∈ R N/L×L×D : Y ijk = γ ijk (S; W l,q , W l,k , W l,v ) ∀ i, j, k Y CS g := Y ∈ R N/L×D : Y ij = γ ij (σ(S) × 2 µ ; W g,q , W g,k , W g,v )), σ(•) ∈ Y M l ( ) in the case of CST and, Y H l := Y ∈ R N/L×L×D : Y ijk = γ ijk (S; W l,q , W l,k , W l,v ) ∀ i, j, k = Y H l (34) Y H g := Y ∈ R N/L×D : Y ij = γ ij (S × 2 µ ; W g,q , W g,k , W g,v ) in the case of H-transformer. We want to show that Y CS g has more expressive set than Y H g , which explains why CST exhibits better empirical prediction performance than its competitors. To this end, let us introduce the following quantities: Definition 1. Let p(x), q(x) be polynomials of Euclidean degree d from R M to R. We denote by S M,d the set of rational functions with numerator and denominator of Euclidean degree at most d, i.e., S M,d := p(x) q(x) : deg(p), deg(q) ≤ d, x ∈ R M imbued with the metric, 6 d p(x) q(x) , a(x) b(x) := ||p(x) -a(x)|| ∞ + ||q(x) -b(x)|| ∞ ( ) where ||f (x)|| ∞ = max x∈[-1,1] M |f (x)|. We also introduce S M,∞ to indicate the space of ratios of real analytic functions, which is a vector space. Further, given any fixed positive ϵ ∈ R, we define, To do so, we need the following result adapted from Trefethen (2017), Theorem 1. Let f (x) be an analytic function from R M to R. Then for every fixed ϵ > 0, there exists a polynomial p(x) of Euclidean degree S ϵ M,d := p(x) q(x) ∈ S M,∞ : d p(x) q(x) , S M,d ≤ ϵ d(ϵ) = O(log(ϵ)) such that, ||f (x) -p(x)|| ∞ = max x∈[-1,1] D |f (x) -p(x)| ≤ ϵ ( ) We are now ready to demonstrate our main result, Proposition 2. For any fixed ϵ > 0, there exists some Euclidean degree d(ϵ) = O(log(ϵ)) such that, Y H g ⊆ S ϵ M,d(ϵ) , and, Y CS g ⊆ S ϵ M,4 d(ϵ) , ( ) where M corresponds to the total number of weights. Proof. First, recall that (assuming a single head) the parameterized functions γ(•) from Eq. equation 26 takes the following form γ ij (X; W q , W k , W v ) = k e [(XWq)(XW k ) T ] ik [XW v ] kj m,n e [(XWq)(XW k ) T ]mn . This should be recognized as the ratio of two analytic functions (because the exponential is analytic) paramatrized by the elements of the weight matrices. Following Theorem 1, for every fixed set of weights W g = [W q , W k , W v ], there exists polynomials {p Wg ij (x)}, { q Wg ij (x)} of Euclidean degree d(ϵ) = O(log(ϵ)) such that max x∈[-1,1] D k e [(XWq)(XW k ) T ] ik [XW v ] kj -p Wg ij (x) ≤ ϵ 2 (42) max x∈[-1,1] D m,n e [(XWq)(XW k ) T ]mn -q Wg ij (x) ≤ ϵ 2 . ( ) This means that d γ ij (X; W q , W k , W v ) , p W ij (x) q W ij (x) ≤ ϵ and that γ ij ( • ; W q , W k , W v ) ∈ S ϵ M,d(ϵ) . In particular, this shows that Y H g ⊂ S ϵ Mg,d(ϵ) since every element of Y H g has the form of Eq.equation 41. A similar conclusion can be reached for Y H l and Y CS l since elements of these families have the same functional form as those of Y H g . All that remains is Y CS g . In this case, it suffices to note that the result of polynomial composition creates polynomials with Euclidean degree bounded by the sum of the degree of the polynomials involved in the composition. Indeed, the above analysis shows that elements of Y CS g are within ϵ distance from rational functions of the form, p Wg ij 1 L L k=1 p W l ijk (x) q W l ijk (x) q Wg ij 1 L L k=1 p W l ijk (x) q W l ijk (x) , for some polynomials {p W l ijk (x)}, { q W l ijk (x)} of degree d(ϵ). Upon expanding the polynomials and expressing terms using a common denominator, this expression should be recognized as the ratio of two polynomials, each of which has degree 4d(ϵ), following polynomial composition. Thus, we conclude that Y CS g ⊂ S ϵ M,4 d(ϵ) . This proves our claim. In other words, Proposition 2 shows that our proposed family of attention mechanisms belongs to S ϵ M,4 d(ϵ) which possesses more expressive power (since 4 d(ϵ) > d(ϵ) ) than the family of functions S ϵ M,d(ϵ) to which global attention mechanism without composition, such as H-transfomers (Y H l + ψ Y H g ), belong. We claim that this is the reason why our proposed approach performs better than the approach proposed by (Zhu & Soricut, 2021) and other similar mechanisms where linear combinations of attentions, rather than composition, are used. In other words, for the same number of parameters (weights), our proposed approach can capture more complex attentive interactions than that of (Zhu & Soricut, 2021) and related mechanisms, which leads to a richer set of attentions over which to train the network and, ultimately, better performance.

A.6.1 EXPERIMENTAL SETUP

We follow the experimental setup described in (Xiong et al., 2021) with a few exceptions in hyperparameters. Specifically, we use the same Transformer encoder network backbone that consists of 2 transformer layers with pre-norm (Xiong et al., 2020) with 4 attention heads and the model dimension 256 across all tasks. Further, we replace the self-attention sub-layer by efficient attention alternatives including the slice attention of CST. To obtain label output for classification, we additionally use a classification head network with 2 fully-connected layers of the same hidden dimension with the feed-forward network in the backbone network and a ReLU layer between them. We aggregate the output sequence from the encoder by a mean pooling across the length dimension and feed it to the classification head. For a fair comparison between models with different characteristics, we optimize the learning rates for the baseline Transformer model on each task while keeping all other hyperparameters fixed. For the experimental setup, this allows for a strong baseline model, advantageous for the standard Transformer. We then evaluate each model with a learning rate search within a small bracket [0.5l 0 , l 0 , 2l 0 ], where l 0 is the base learning rate providing the highest validation accuracy of the Transformer in each task, and report the test accuracy of the model trained with a learning rate that gives the highest validation accuracy among the bracket. We note that our experiments have stronger baselines than those found in literature (Tay et al., 2020b; Xiong et al., 2021) as highlighted in Table 1 . For other efficient Transformer models, we perform a model-specific hyperparameter search as described below and report the test accuracy of the best models. For the Long Range Arena benchmark, we performed experiments on our PyTorch version implementation of the original open-source JAX/Flax code (Tay et al., 2020b) where we followed its dataset preparation procedure and verified the results, and implemented our model on it. Since we found the performances of LRA tasks have relatively high variances, especially on certain tasks such as ListOps, we first performed a hyperparameter search with the standard Transformer with the base setup in (Xiong et al., 2021) . The hyperparameters we used for the search include number of layers {2, 4}, number of attention heads {2, 4}, and model dimension {128, 256}, and learning rate {1, 2, 5} × 10 {-3,-4,-5,-6} . In addition, for better convergence, we also increased the number of training steps in some tasks. The hyperparameters we determined after the search can be found in Table 6 . As this step is essentially a hyperparameter tuning for the baseline Transformer, we note that we have stronger baseline Transformer model than those reported in (Tay et al., 2020b) and (Xiong et al., 2021) , and compare them in Table 1 . While fixing all other hyperparameters, we performed a learning rate sweep using a learning rate bracket as described in 5.1. For each model and task, we pick the best model on the validation dataset with early stopping, and report its accuracy on the test dataset. For the model training, we used AdamW optimizer (Loshchilov & Hutter, 2017) with linear learning rate warmup and decay, but without weight decay and other training hyperparameters as shown in Table 6 . Since each efficient Transformer has its own hyperparameters and they can affect the performance, we tried a few different values, around the default values used in each paper. Specifically, they are: the number of random features {64, 128, 256} in Performer (Choromanski et al., 2020) , the projection length {64, 128, 256} in Linformer (Wang et al., 2020) and Luna (Ma et al., 2021) , the number of landmarks {64, 128, 256} in Nyströmformer (Xiong et al., 2021) , the number of hashes {2, 4} in Reformer (Kitaev et al., 2019) , the numerical rank of off-diagonal blocks {16 ,32, 64, 128} in H-Transformer-1D (Zhu & Soricut, 2021) , and the local window segment size {8, 16, 32} and (Zhu et al., 2021) . For Long Short Transformer, we optionally used additional 1D depth-wise convolution with the kernel 35 as described in (Zhu et al., 2021) . For scatterbrain (Chen et al., 2021a) , we follow the default setting for the model specific hyperparameters as in the official code release. Specifically, for text task, we have cluster size for query/key = 16, number of features = 16. For Listops task, we have cluster size for query/key = 32 and number of features = 32. For image and pathfinder task, we have cluster size = 16 and number of features = 16. For retrieval task, we have cluster size for query/key = 64 and number of features = 64. Additionally, we have number of hashes = 2 for all tasks. While each option produces different performances requiring different computation and memory costs, we found variants in each model have similar accuracy and report only the best accuracy among them and the corresponding complexity as comparing them is out of scope of this paper. A.6.2 ABLATION STUDY ON LRA We investigate the effect of each component in CST by performing an ablation study. We set the slice length of 16 without extension for local attention as the base configuration. Then, we train a CST variant from scratch where one component is changed. Specifically, we consider the variations in: 1) Slice length, 2) Extended local attention, 3) Positional embedding, and 4) Composition of MSMRA. The test accuracies of the variants are given in Table 7 . We note that all variants of CST consistently outperforms other models, except for one case of the global first composition. Slice Length First, we vary the slice length L and test how the local attention range and abstraction length for the global attention affects the accuracy. Intuitively, one can expect a trade-off between fine-grained token interaction modeling and coarse-grained global context capturing, since in the extreme cases either local attention or global attention converges to the standard self-attention 7 . While each task has its own best L, shorter slice length tends to give better overall accuracy. One possible interpretation is that L = 8 or 16 provides better-formed tokens to the global attention in a dynamic manner by the local attention. Local Attention Extension Ratio While slight improvements are observed on Retrieval and Image tasks, on the contrary to our expectation, extended local attention does not always improve the accuracy on LRA tasks. We conjecture it is because LRA datasets has tokens of very fine granularity, i.e., byte-level or pixel-level, and abstraction of the extended local attention output may result in redundant information for global attention. Thus, we evaluate the effect of the extension ratio on a dataset with a coarser-grained tokenization in the Section 5.2.

Position Embedding

We compare the MS-VIPE with the conventional learned positional embedding that is applied at the bottom embedding layer of the network. MS-VIPE shows consistent improvement over the conventional positional embedding except for the ListOps task where the fined-grained order of each token is essential in this particular of MSMRA Instead of the proposed composition of MSMRA that has a serial connection of MSMRA in the local-global order, we try two different connections, i.e., parallel and global-local connection. Both of them degrade the accuracy compared to the local-global composition, supporting effectiveness of our design. Local Modeling with MLP As shown in (Tolstikhin et al., 2021) and (Liu et al., 2021a) , a multilayer perceptron (MLP) can effectively replace the attention without loss of accuracy in some applications. In CST, since the slice length is fixed and relatively short, use of MLP instead of attention can be considered as another source of efficiency with a small increase of parameters. We evaluate a variant of CST with 2-layer MLP replacing the local attention, with 2 different hidden dimensions d hidden = 16 or 32 while L = 16 and α = 1. MLP in each slice is equivalent to 1D convolution with the kernel size and the stride set to the same with the slice length. We observe that using attention for local modeling module is still preferred in terms of accuracy.

Slice aggregation

We further evaluate use of different aggregation methods from the simple mean pooling. Similarly to (Rae et al., 2019) , we try max pooling and linear projection. Again, a linear projection is equivalent to 1D convolution with the kernel size and stride set to L. While showing comparable results, they lead to degraded performances in Image and Pathfinder tasks. One can expect better accuracy from the linear projection as it is a general form of other pooling methods, but this result can be interpreted as limitation of a simple linear projection for modeling sequence translations. The mean pooling has already generalized well, and with better initialization, linear projection is expected to converge to the mean pooling or slightly better.

A.7.1 EXPERIMENTAL DETAIL OF AUTOREGRESSIVE LANGUAGE MODELING

For the autoregressive language modeling experiment, we use an open-source language modeling experiment framework (Schlag et al., 2021) and plug in our CST implementation. We follow all training setup including dataset preparation and hyperparameters given in (Dai et al., 2019a) , and evaluate with the small-sized network configuration: 16 layers, 8 attention heads, 128 model dimension, and 2,048 hidden dimension. We train all models using the Adam optimizer (Kingma & Ba, 2014) , the cosine annealing learning rate scheduler with 2000 warmup steps while the base learning rate is 2.5 × 10 -4 for 500K steps with the batch size of 96. We also conduct autoregressive language modeling experiments with 1024 sequence length. Here, we roughly follow the experiment setup used in (Chen et al., 2021a) . The larger model has 512 model dimension instead of 128, and we change the learning rate to 5 × 10 -4 , number of steps to 90K, and batch size to 32, while keeping all the remaining hyperparameters same as that of the small model. Different choices of slice length L and extension ratio α results in the same local attention range. For instance, the combination of L = 8, α = 3 has the same key/value lengths with those in the L = 16, α = 1 setting, i.e., L×(3-1)/2 = L. However, since the abstraction length and the resulting granularity of the global attention only depends on L, the choices of these hyperparameters can affect performance differently. We examine how the performance is affected by this configuration. To this end, we evaluate CST with combinations of a set of slice lengths {8, 16, 32} and a set of extension ratios α {1, 2, 3}. To ensure that the current query token has no interaction with the future key tokens, we design the causal mask and local attention extension scheme to better suit CSA as described in Section 4.1, and use it throughout this experiment.

A.7.2 AUTOREGRESSIVE LANGUAGE MODELING WITH MATCHED ATTENTION COMPLEXITY

TO TRANSFORMERS ON PG-19 We additionally perform evaluation of CST on PG-19 dataset (Rae et al., 2019) compare to Transformers where their attention complexities are matched, i.e., O(N 2 /N 2 + 0.5(α + 1)N L) ≃ O(N 2 ). We use GPT-2 (Radford et al., 2019) implementation in Huggingface transformers library (Wolf et al., 2020) , and use the same architecture to implement CST equipped with CSA and MS-VIPE. We train all models for 1M training steps using the batch size 32. We use AdamW optimizer with the learning rate= 1e -3, β = [0.9, 0.99], and the weight decay= 0.1. We report the validation and test perplexities in Table 9 . Efficient attention in CST enables longer context length with the same complexity in attention computation, leading to significantly improved perplexity compared to the Transformer baseline. Note that, with the current architecture in these cases, the overall complexity can be higher in CST since there still are complexity increases in linear layers when they are processing longer inputs. We leave application of caching techniques as future work, that saves complexity in linear layers to match the overall network complexity while enjoying the long-range attention.

A.8 MORE INFORMATION AND DETAILS ON BIDIRECTIONAL MASKED LANGUAGE MODELING AND GLUE BENCHMARK

In this section, we present more details about the experimental setup and results on the bidirectional masked language modeling (MLM) task Devlin et al. (2018) and the natural language understanding tasks in the GLUE benchmark Wang et al. (2018) with discussion.

A.8.1 EXPERIMENTAL SETUP

We pre-train transformer models with the masked language modeling (MLM) objective on Book-Corpus Zhu et al. (2015) and English Wikipedia. After pre-training, the models are fine-tuned on downstream natural language understanding tasks from GLUE benchmark Wang et al. (2018) . We follow the experimental setup of Devlin et al. (2018) for conducting both pre-training and fine-tuning experiment, including datasets, masking rate, batch size, and optimizer settings, with several exceptions. We pre-train the models for 900k steps with sequences of 128 tokens. This corresponds to the first phase of BERT pre-training and we consider it to be enough for evaluating on GLUE tasks while expediting the experiment. With the baseline network architecture of BERT base Devlin et al. (2018) , we replace its self-attention layer with different types of efficient attentions such as Nyströmformer, Performer, Luna, and FNet. For downstream NLU tasks, we also follow the setup of Devlin et al. (2018) with one minor change: instead of prepending [CLS] token in the input sequence, we use mean pooling to get the classifier input vector from the sequence output, as discussed in the following part of this section.

A.8.2 RESULTS AND DISCUSSION

As shown in Table 5 , we can see that CST outperforms competing efficient transformers in terms of MLM perplexity. It can be seen that, while CSTs consistently lower perplexity than most of the efficient transformers, its perplexity decreases as slice length and extension ratio increases. However, CST does not outperform BERT base on contrary to our results on the LRA benchmark. We conjecture this is due to fine-grained masking strategy in the original MLM objective. The MLM pre-training stage involves replacing random tokens in the input sequence with a dummy token that the model must predict. After applying the positional embedding, replacing a standard token with a dummy token, i.e., [MASK] , translates to injecting high-frequency noise into the input sequence. CST and other efficient attention alternatives are approximations to the full attention, which must truncate the high-frequency modes in exchange for faster evaluation. In MLM pre-training, all of the input sequences have high-frequency content that attention approximations have difficulty capturing. Meanwhile, the sequences in the LRA benchmark are long, low-frequency sequences which can be efficiently compressed. An alternative pre-training method could eventually be developed specifically for efficient transformers that circumvents this issue. On downstream NLU classification tasks, CST also outperforms among the efficient transformers. But with under-performing pre-trained models compared to BERT, similar degradation trend is observed in evaluation on downstream tasks after fine-tuning them. One drawback of the current version of CST in the original setting is that the use of [CLS] token is not trivial. [CLS] token can be seen as a global token that summarizes the sequence in the context of the given classification task, and it is often prepended in the input sequence to summarize information from the sequence for the classification purpose. In CST, naively prepending it limits attention range in local attention part and also make the summarization in the global attention highly implicit. We believe this can be addressed by an additional well-designed architecture that uses global tokens and input them to classifier. We leave this as a future work and we instead use mean pooling for the experiment for all models. While the efficiency gain is reduced due to the short sequence length, one can still expect reduction in total FLOPS from using CST. For instance, in the simple case of N = 128, L = 64 and α = 1, the complexity of CST is O(128 • 64 + 128 2 /64 2 ) which is almost half of that of the standard transformer O(128 2 ). A.9 LIMITATIONS AND FUTURE WORK A.9.1 LIMITATIONS We summarize limitations of CST discussed in the previous sections. As discussed in A.8, there is a mismatch of using token-wise random masking for MLM pretraining. In the abstraction of sequence slices for global attention, high-frequency modes are truncated, i.e., smoothed out, that leads to loss of information. While similar arguments can be made to other transformers with abstractive attention, we plan to carefully analyze it and address the issue by designing an alternative pretraining method. There are some limitations on the casual mask and slice embedding introduced in Section 4.1, In the global attention, no information of tokens in current slice is taken into account because of the query shift and the attention mask excluding diagonal elements. As a result, there is missing information from coarse-grained modeling in the current design. This is only happening during the training phase, not in the testing, since the next token prediction with sliding window always has access to all tokens in the sequence up to the current time step. Therefore, this mismatch may prohibit the full modeling capability of CST. To address the issue, we plan to design more suitable yet efficient training objective that allows full use of abstractions in the global attention during training. We leave design of more advanced autoregressive modeling as a future work. CST requires additional computations of the complexity O(N/L) for the Q, K, and V transformation for the global attention, from the local attention output. However, this increase in complexity is negligible compared to the overall computation, especially when N/L is small, as discussed in 4.3.

A.9.2 FUTURE WORKS

There are several potential avenues for extending CST. First, we can easily generalize our work to multi-level slicing with a higher level than two. This will lead to more improvement in expressiveness of the model while adding more degree of freedoms for better efficiency while enabling its application to much longer sequences. Second, as discussed in Section 4.1, more advanced design of autoregressive sequence modeling scheme will further improve the performance. Third, while CST is based on fixed-length slicing of the sequence, dynamic or semantic slicing would further improve applicability to non-stationary sequences. With end-to-end training, the model can learn the optimal slicing of the data for given tasks and provide layer-wise dynamic tokens. To this end, efficient realization of the dynamic slicing would be the most challenging task. Finally, as a meaningful next step, modification of CST to better fit vision tasks or training a CST-based large language model would lead to more general and practical impacts.



Normalization with the sum of the binary mask, i.e., the number of nonzero mask values, instead of the slice length L, avoids biases in the mean computation induced by masked tokens. Any finite bound may be use and has little impact as long as all expressions are subject to the same constraints. It can be shown that d(•, •) is in fact a proper metric on S M,d . CSA converges to the standard self-attention with the global attention at L = 1 or the local attention at L = N with V = XW 2 v



Figure 1: Illustration of Composite Slice Attention. CSA consists of a full-resolution local attention with computational complexity O(N L) and a low-resolution global attention with complexity O(N 2 /L 2 ).

Figure 2: LRA score (y-axis), relative speed (x-axis), relative GPU memory usage (circle radius), and parameter count (color). With competitive efficiency and smaller model sizes, CSTs outperform strong efficient Transformer baselines as well as the standard Transformer by significant margins. The speed and memory usage measurements can be different depending on devices and implementations.

Architectures of CST and CSA.

Figure 4: Comparison between linear complexity attention and CSA. CSA with the complexity O(N 2 /L 2 +N L) is more efficient than linear complexity methods with O(N M ) for practical sequence lengths and choices of model-specific hyperparameters M and L. Due to increasing effective number of tokens per abstraction, performance of an abstractive attention with linear complexity can degrade as the sequence length gets large.

37) as the ϵ-ball surrounding S M,d in the topology induced on S M,∞ by the metric. Clearly, for any fixed dimension M , ϵ > 0 and Euclidean degrees d ′ > d, the family of functions S ϵ M,d ′ is more expressive than S ϵ M,d , since the latter is a proper subset of the former, i.e., S M,d ⊂ S M,d ′ . Next, we show that the families of functions Y CS g and Y H g are in fact subsets of S ϵ M,d ′ and S ϵ M,d respectively, for appropriately-chosen values of ϵ, d ′ and d with d ′ > d.

Test accuracy on Long Range Arena (LRA) benchmark.

Autoregressive language modeling perplexity on WikiText-103 with N = 256. Results for other models are taken from(Nguyen et al., 2021).

CST

Bidirectional MLM and NLU transfer learning evaluation on GLUE benchmark.

Zhenhai Zhu and Radu Soricut. H-transformer-1d: Fast one-dimensional hierarchical attention for sequences. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3801-3815, 2021.

Hyperparameters for LRA benchmark Network Configuration D embed D model D FFN d h n head n layer p dropout

Performances of CST with various slice lengths L, extension ratios α, positional embeddings P, and combinations of local and global attentions, on LRA benchmark

Performances of CST with different local modeling and slice aggregation methods, on LRA benchmark.

Autoregressive language modeling results on PG-19. Models within double-lined sections have matched attention complexities. Models at the top of each section is the baseline Transformer while others are CST variants.

