TRANSFORMER-QL: A STEP TOWARDS MAKING TRANSFORMER NETWORK QUADRATICALLY LARGE

Abstract

Transformer networks have shown outstanding performance on many natural language processing tasks. However the context length (the number of previous tokens on which the output states depend) of a Transformer network grows at best linearly with the memory and computational power used. This limitation prevents a transformer network to have very long context in a resource limited application. In this work, we propose a class of transformer networks, namely Transformer-QL (Quadratically Large), in which, the context length can grow at best quadratically with the memory and computational power used. We have empirically evaluated a Transformer-QL model in three long range language modeling datasets. The results show that Transformer-QL can provide significant improvements over other state of the art networks.

1. INTRODUCTION

Since its introduction in Vaswani et al. (2017) , Transformer networks have overtaken its predecessor Recurrent Neural Networks (RNN) in almost every natural language processing task. However, one limitation of Transformer network is its high requirement of memory and computational power. In a vanilla Transformer network, the memory and computational requirement grows quadratically with the sequence length, and thus with context length. In an effort to overcome the above limitation, Transformer-XL (Dai et al., 2019) and Compressive Transformer (Rae et al., 2020) have been recently proposed. However, in both the network, the context length can grow at best linearly with the memory and computation usage. An alternative strategy have been explored in Li et al. (2019) ; Ye et al. (2019) ; Child et al. (2019) ; Zaheer et al. (2020) ; Beltagy et al. (2020) ; Wang et al. (2020b) ; Kitaev et al. (2020) ; Katharopoulos et al. (2020) ; Choromanski et al. (2020) ; Wang et al. (2020a) . All these works have proposed to replace the vanilla self-attention network by a different one with linear or log-linear memory and computation complexity leading to novel transformer architectures with overall linear or log-linear cost. Although, they provide an improvement over the quadratic cost of the vanilla transformer network, the achieved cost is still, at best, linear. Besides, since those techniques are based on either sparsification or compression of the self attention mechanism, they struggle to accumulate long distance information (Gupta & Berant, 2020) . Several works such as Burtsev & Sapunov (2020) ; Ainslie et al. (2020) ; Gupta & Berant (2020) have proposed to increase the context length by introducing a global attention which attains to every input token, thus capable of capturing long distance dependency. However, capturing long distance dependency using those approaches involves extreme compression of state space by the global attention mechanism. Moreover, even though, they perform well on several tasks, their performance on language modeling task have not been tested. Another line of work (Zhang et al., 2019; Pappagari et al., 2019) have suggested to use hierarchical arrangement of transformer network to capture document-wide dependency. However, applicability of those networks requires hierarchical structure in the data itself. Moreover, those techniques have been proposed for document compression rather than language modeling. In this paper, we propose a class of transformer architectures, namely Transformer-QL (Quadratically Large), to alleviate the problem of capturing long distance dependency. Similar to multi-scale transformer networks (Donahue et al., 2019; Subramanian et al., 2020; Zhao et al., 2020; Dai et al., 2020) , Transformer-QL captures the contextual information in multiple temporal scalesfiner scales to capture recent past information and coarser scales to capture distance past information. Additionally like Transformer-XL, it keeps the hidden states of a past segment in memory and use it to process future segments causing the context length to grow beyond the current segment. Overall, the context length in Transformer-QL can grow up to quadratically with the memory/computational usage. The contributions of the work are as follows: • We have proposed a novel class of transformer architectures, namely Transformer-QL, in which, the context length can be made to grow linearly with memory and computation cost. Further, employing a linear cost self attention layer like Wang et al. (2020b) ; Katharopoulos et al. (2020) , the context length of Transformer-QL can be made to grow quadratically in both memory and computational cost. • We have empirically evaluated a Transformer-QL model on three long range language modeling datasets. The results show significant improvement in perplexity score over Transformer-XL and Compressive Transformer. The organization of the paper is as follows. In section 2, the proposed Transformer-QL architecture along with its background has been introduced. Section 3 provides empirical evaluation of Transformer-QL. The section also studies the sensitivity of Transformer-QL to several hyperparameters. Finally, in Section 4, the conclusion has been drawn and future directions of the work have been suggested.

2.1. TERMINOLOGY AND NOTATIONS

In a transformer network, the input sequence are partitioned into smaller segments of fixed length. Each segment is processed independently of other segments. We refer to the number of tokens in each segment as segment length. In a transformer network with recurrent memory like Transformer-XL, the hidden states of the recent past segments are preserved in a fixed length memory. We refer to the number of tokens in each layer of the memory unit as memory length. For an output state (i.e. the output states of the last layer), we use the term context length to refer to the number of past tokens on which the output state depends. In transformer network, different output states might have different context length. We respectively refer the minimum and maximum of the context lengths of all the output states in a network as minimum context length and maximum context length of the network. We refer the sum of segment length and the memory length using the term window length. We denote the segment length, memory length, window length and model dimension by n s , n m , n w and d m respectively. Thus, we have n w = n s + n m . We also use the notations s l t and m l t to denote the output and memory of l-th layer at time step t for l = 1, 2, • • • , L where L is the total number of Layers. The output and memory of embedding layer at time step t have been denoted by s 0 t and m 0 t respectively. The number of heads in the self attention layers has been denoted by H.

2.2. BACKGROUND

Transformer A transformer network consists of stacked collection of multiple transformer layers. Each transformer layer contains a multi-head self-attention layer followed by a position-wise feed forward layer. Though the memory and computational cost of position-wise feed forward layer is linear in the length of input sequence, the multi-head self attention layer has a quadratic cost. The transformer network tackles the quadratic memory and computational cost by dividing the input sequence into smaller segments and applying the transformer network on each segment independently. However, such method limits the context lengths to the segment length. Dai et al. (2019) has named this problem as context fragmentation problem. Transformer-XL Dai et al. (2019) has proposed Transformer-XL to solve the context fragmentation problem. In Transformer-XL, instead of discarding the hidden states after the computation of a segment, they are saved in memory (please refer to Figure 3 ). During the computation of the following segments, the self attention is applied over the hidden states of both the current segment and the memory, thus has an increased context length without quadratic increase in the memory and computational cost. In fact, the memory/computational cost of the self-attention layer of Transformer-XL grows only quadratically only with the segment size n s and linearly with the memory length n m . On the other hand, the context lengths get increased by a length of n m per layer. By keeping n s small and making n m large enough, the memory and computational cost of Transformer-XL can be made close to linear with respect to the context lengths. Rae et al. (2020) has proposed to improved the memory/computational cost of Transformer-XL further by keeping the part of the memory states in a compressed form. However, even with this improvement, the memory and computational cost can be at best linear in the context length. 

Compression Compression

Figure 1 : High Level Visualization of Proposed Model. The model processes the input tokens in multiple temporal scales. Each scale has several transformer layers with recurrent memory. The output of the last layer of one scale is compressed to form the input of the next scale. As the segment length gets reduced because of compression, the memory length is increased to make the total length (i.e. segment length + memory length) of all the layers same. In the figure, the blue boxes represent hidden states of the current time step where as the red boxes represent the memory states. Overview In this paper, we explore to increase the context length by compressing the hidden states hierarchically. The high level view of our architecture is shown in Figure 1 . As shown in the figure, the model processes the input sequence in several scales of temporal granularity. Each scale consists of several transformer layers with recurrent memory. The output of the last layer of one scale is compressed causing the temporal granularity as well as the segment length to reduce. As the segment length reduces, we also simultaneously increase the memory length to keep the total length (i.e. segment length + memory length) of the layer constant. Then the new segment and memory is fed as the input to the first layer of the next scale. The resulting architecture is similar to the multiscale transformer architectures (Donahue et al., 2019; Subramanian et al., 2020; Zhao et al., 2020) . Additionally, Transformer-QL keeps recurrent memory to store hidden states of previous segments. Therefore, in Transformer-QL, the layers belonging to a finer scale process contextual information in fine-grained manner, but have a smaller context length. On the other hand, a layer belonging to a coarser scale process information in coarse-grained manner, but have a longer context length (please refer to Figure 5 At the beginning begin m 0 0 , • • • , m L-1 0 ← 0 // initialize all the memory to zero t ← 1 // initialize time step to 1 while there is more data to process do The Compression Function For compression, we use one of average pooling and max pooling with pool size and stride both equal to c where c is the rate by which we compress the states while transitioning from one scale to the next. Let s l t and m l t be the output and memory states of l-th layer with length n l s and n l m respectively. We apply the compression function on the concatenation of m l The Memory Updates In Transformer-XL with segment length n s and memory length n m , the segment of length n s is get shifted into the memory. In other words, the memory for the next time step is computed as m l t+1 = concat(m l t , s l t )[-n m :] for all layer l. However, in Transformer-QL, the granularity of layers belonging to different scales are different. More precisely, a segment of length n 0 s belonging to scale 1 is compressed into a segment of length n 0 s /c i-1 at a layer belonging to scale i. Thus, in Transformer-QL, we update the memory of a layer l belonging to scale i as s 0 t ← xtW emb // embed input segment m 0 t ← Shift(s 0 t , m 0 t-1 ) // shift m l t+1 = concat(m l t , s l t [: n i h ])[-n i m :] where n i h = n 0 s /c i-1 and n i m are the shift length and the memory length at scale i for i = 0, 1, • • • respectively. The complete algorithm of Transformer-QL is shown in Figure 2 . Droppath Regularization Since the output of the last layer of every scale is summed in the accumulation layer, the path through a higher scale forms to a deeper network while the path through a lower scale forms to a shallower network. Consequently, layers in the higher scales might remain under-fitted due to lack of gradient flow through the deeper network while the layers in the lower scales get over-fitted. To alleviate this problem, we have introduced droppath regularization. In the accumulation layer, let each output be computed as s o = 1 l l i=1 s i where s i represents the (possibly over-sampled) output of scale i and l is the total number of scales. In droppath regularization with droppath probability p, we drop the output of all the scales below j with a probability p/(l -1) for j = 2, 3, • • • , l from the accumulated output. More precisely, we generate a random number u from uniform probability distribution and compute the output as s o = 1 l-j+1 l i=j s i if u ∈ (j-2)p l-1 , (j-1)p l-1 . For u ≥ p, no droppath is applied.

2.4. THE COMPLEXITY

The memory/computational complexity of a Transformer-XL network (Dai et al., 2019) with segment length n s , memory length n m and L layer is Θ((α(n m , n s ) + n s )L) where α(•, •) is the complexity of self-attention layer. The context length n c of the network is Θ(n m L). Since α(n m , n s ) = Ω(n m + n s ) (Li et al., 2019; Ye et al., 2019; Child et al., 2019; Zaheer et al., 2020; Beltagy et al., 2020; Wang et al., 2020b; Kitaev et al., 2020; Katharopoulos et al., 2020; Choromanski et al., 2020) , the memory and computational complexity of Transformer-XL in terms of context length is Ω(n c ). Similarly, the memory and computational complexity of Compressive Transformer (Rae et al., 2020) in term of context length is Ω(n c /c) where c is the compression rate. Therefore, the memory/computational complexity of both Transformer-XL network and Compressive Transformer network in term of the context length is at least linear. Consequently, increasing the context length in both the networks requires at least linear increase in the amount of both memory and computational requirements. On the other hand, a Transformer-QL network with L Transformer-XL layers and i compression layers, the context length n c becomes Θ(c i (n s + n m )) = O(c log c ns (n s + n m )) = O(n s (n s + n m )) where n s = n 0 s and n m = n 0 m are the segment and memory length in scale 1 of the network. Note that, since at most i = log c n s compression layer can be used in Transformer-QL, we have c i = O(c log c ns ) = O(n s ). If we set n m = O(n s ), we have n c = O((n s ) 2 ). However, the time and memory complexity of a Transformer-QL network is Θ(α(n s , n m )L + (n s + n m )i)) = Θ(α(n s , n m )(L + i)). Since α(n s , n m ) = Ω(n s + n m ) and we set n m = O(n s ), the memory/computational complexity of Transformer-QL becomes Ω(n s (L + i)). Therefore, the memory/computational complexity of Transformer-QL in terms of context length is Ω( √ n c (L + i)) = Ω √ n c (L + log c n s ) . Thus, the complexity of Transformer-QL can be at best sub-linear. Moreover, if we set the compression rate c to n s , the memory and computational complexity can be at best Θ( √ n c ) or, in other words, the context length can be at best quadratic in the memory and computational cost. In Appendix B, we provide an algorithm to compute a tight estimation of the context length of a Transformer-QL network. In the appendix, we have also provided a detailed illustration of the dependency structure of the hidden states of a Transformer-QL network on the past tokens. 

3. EMPIRICAL EVALUATION

In this section, we empirically evaluate the efficacy of Transformer-QL for long range language modeling task. Towards that goal, we compare the results of Transformer-QL with that of Transformer-XL (Dai et al., 2019) and Compressive Transformer (Rae et al., 2020) . Then we evaluate the sensitivity of Transformer-QL to several hyper-parameters. Datasets We compare Transformer-QL against the above two networks on three long range language modeling datasets: SimpleBooks-2 (Nguyen, 2019) , SimpleBooks-92 (Nguyen, 2019) and WikiText-103 (Merity et al., 2017 the three datasets preserve paragraph and section structures of their sources making those suitable for long range language modeling task. The statistics of the three datasets are shown in As shown in the table, Transformer-QL performs relatively worse for small model dimension like 512 and relative improvement increases as the model dimension increases. We speculate that the relatively worse performance of Transformer-QL for smaller model dimension is caused by the difficulty in compressing hidden states during switching from one scale to the next. To alleviate the problem, Donahue et al. (2019) have proposed to increase the model dimension as the model transits from a lower scale to a higher one. On the other hand, Dai et al. (2020) have suggested a novel query-only-pooling to solve the problem. We take it as a future work to try those approaches in Transformer-QL.

3.3. EFFECT OF CONTEXT LENGTH

In this section, we study the relative improvement in perplexity scores obtained by Transformer-QL over Transformer-XL for varying context length. The results are shown in Table 3 . From the table, it can be noticed that relative improvement obtained by Transformer-QL is more when the context length of the Transformer-XL networks are smaller in the first place. For example, for test n s = 08 and n m = 08, the relative improvement is as high as 8.80%. On the other hand, for the test n s = 02 and n m = 30, the relative improvement is only 2.76%. This can be explained by the fact that for the segment and memory length 02 and 30, the average context length of Transformer-XL is already large enough (241) to provide good result. By extending the average context length from 241 to 332, Transformer-QL provides only a small improvement following the law of diminishing return (Hestness et al., 2017) . 

4. CONCLUSION AND FUTURE WORK

In the work, we have proposed a class of transformer networks namely Transformer-QL in which the context length can grow quadratically in memory and computational usage. Our empirical evaluation shows that Transformer-QL can perform significantly better than other long range language modeling networks like Transformer-XL and Multi-scale Transformer by exploiting longer context length. Further more, it can perform significantly better than Compressive Transformer by exploiting the contextual information more effectively. In our empirical evaluation, we have evaluated a Transformer-QL network with only one compression layer. In future, we want to evaluate a network with more then one compression layers. Also, we have empirically found that the performance of Transformer-QL network can be worse than that of Transformer-XL when the model dimension is small. As our future work, we want explore different methods for removing this limitation. A TRANSFORMER-XL ALGORITHM The algorithm for Transformer-XL is shown in Figure 3 . At the beginning begin  m 0 0 , • • • , m L-1 0 ← 0 // initialize

B CONTEXT LENGTH OF TRANSFORMER-QL

A tight estimate of the minimum context length of a Transformer-QL network can be computed using algorithm of Figure 4 . For simplicity, we have assumed that all the division operations result into integer output. We have also assumed that there is at least one layer in every scale. The maximum context length can be obtained by adding n s to the minimum context length. Additionally, in Figure 5 , we have shown the detailed computation of minimum/maximum context length with an example. In the figure, the notation S l t1:t2 used to denote a hidden states of l-th layer and the state depends on the t 1 -th to t 2 -th tokens of the input sequence. In the example of the figure, each output state depends on at least 44 previous tokens. In other words, minimum context length of the network is 44. On the other hand, in a Transformer-XL network of same segment length, memory length and number of layers, the minimum context length would have been 4 × 4 = 16.

C STATISTICS OF DATASETS

The statistics of the datasets are shown in Table 4 . 



for a detailed illustration of the context lengths of Transformer-QL layers). To get the final output of the network, we causally combine the (possibly over-sampled) outputs of the last layer of each scale and pass those through several transformer layers (following Subramanian et al. (2020); Zhao et al. (2020)) to learn deep representation of the output. Function Compute(s, m, l) begin s ← MultiHeadSAttn l (s, m) s ← LayerNorm(s + s ) s ← LayerNorm(s + PoswiseFF l (s )) return s (b) One Transformer-XL layer over hidden states s and memory m Shift the current hidden state s into memory m

Figure 2: Transformer-QL Algorithm.

and s l t to get the output s l+1 t of length n l+1 s = (n l s + n l m )/c (for simplicity assume that n l s + n l m is divisible by c). If n l+1 s > n l s , we take the last n l s elements of s l+1 t to form the output of the compression layer. Finally, we keep a recurrent memory m l+1 t of length n l s + n l mn l+1 s making n l s + n l m = n l+1 s + n l+1 m to hold.

We compare Transformer-QL network with the following two networks:Transformer-XL(Dai et al., 2019) Transformer-XL is similar to vanilla Transformer with two modifications. It uses recurrent memory to store and access the hidden states of the past time steps. The recurrent memory enables to increase the minimum context length up to n m L where n m is the memory length and L is the number of layers. It also uses relative positional embedding of token instead of absolute positional embedding. Compressive Transformer(Rae et al., 2020) Like Transformer-XL, Compressive Transformer also uses recurrent memory. However, Compressive Transformer keeps part of the recurrent memory in compressed format, thus has an increased context length over Transformer-XL.

Figure 3: Forward pass of Transformer-XL. The functions Compute and Shift are given in Figure 2b and 2d respectively.

Figure 5: Dependency of hidden states to the past tokens in a Transformer-QL network.

Improvement in test perplexity score (lower is better) of Transformer-QL over Transformer-XL for three different model dimensions. The forth and fifth columns show the test perplexity obtained by Transformer-QL and Transformer-XL respectively.

Table 4 of Appendix C.Experimental Details For the experiments of Transformer-XL and Compressive Transformer, we have used an 8-layer network. And for the experiments of Transformer-QL, we have used a network with 3 layers in scale 1, 3 layers in scale 2 and 2 layers in the output block. Thus, the Transformer-QL has a total of eight layers as in Transformer-XL and Compressive Transformer. We set the compression rate of Compressive Transformer to 2. In Transformer-QL, we have used max-pooling layer with pool size 2 and stride 2 as the compression layer. Thus, both the Transformer-QL and Compressive Transformer have a compression rate of 2. For the experiments on the SimpleBooks-92 and WikiText-103, we have set the model dimension to 1536 and used an initial learning rate of 1 × 10 -4 . On the other hand, for the experiments on SimpleBooks-2, we have set the model dimension to 256 and learning rate to 2.5 × 10 -4 . All the models have been trained using Adam optimizer. We set the droppath probability of Transformer-QL to 0.3. The details of other hyperparameters can be found in Appendix E .ResultsThe results of the comparison are shown in Table1. The results are grouped by the window length n w of the test model as the lower bound of memory and computation requirement directly depends on it. In all the datasets and settings, Transformer-XL performs worst among all three. Worst performance of Transformer-XL is not surprising as it has smallest average context length (shown in the fifth column) for a given n w . However, Compressive Transformer has a slightly larger average context length than Transformer-QL. Yet, Transformer-QL has performed similarly or significantly better than Compressive Transformer in all the setting which indicates that Transformer-QL can exploits the contextual information more effectively than Compressive Transformer.3.2 EFFECT OF MODEL DIMENSIONIn this section, we investigate the effect of model dimension on the performance of Transformer-QL. Towards that goal, we have performed experiments on WikiText-103 dataset with varying model dimension. For each experiment, we compared the test perplexity of Transformer-QL with that of Transformer-XL. The results are shown in Table2. The improvement in test perplexity of Transformer-QL over Transformer-XL has been computed by subtracting the test perplexity of Transformer-QL from that of Transformer-XL. The relative improvement is computed by

Relative improvements in perplexity scores (lower is better) obtained by Transformer-QL over Transformer-XL on WikiText-103 dataset. The second column shows the test segment (n s ) and memory (n m ) length. The third and forth column respectively show the average test context length (n c ) of Transformer-XL and Transformer-QL network. .

Xingxing Zhang, Furu Wei, and Ming Zhou. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019,Volume 1: Long Papers, pp. 5059-5069. Association for Computational Linguistics, 2019. Yucheng Zhao, Chong Luo, Zheng-Jun Zha, and Wenjun Zeng. Multi-scale group transformer for long sequence modeling in speech separation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence,IJCAI 2020IJCAI  , pp. 3251-3257. ijcai.org, 2020.   .

Statistics of the datasets used in the experiments.D COMPARISON WITH MULTI-SCALE TRANSFORMERIn this section, we empirically compare Transformer-QL with Multi-scale Transformer(Subramanian et al., 2020). Our implementation of Multi-scale Transformer is same as Transformer-QL without any recurrent memory. The resultant Multi-scale Transformer is similar to the button-up model ofSubramanian et al. (2020). We have used hyperparameter settings same as Transformer-QL to train the Multi-scale Transformer. The result is shown in Table5. From the table, we can see that Multi-scale Transformer has been widely bitten by Transformer-QL even when the Multi-scale Transformer has been trained and tested with a larger window length.

Comparison of Transformer-QL with Multi-scale Transformer (MS-Transformer). The third and forth column respectively show the segment length (n s ) and the memory length (n m ) used during training. The fifth, sixth and seventh columns respectively show the segment, memory and the window (n w = n s + n m ) length used to compute the text perplexities.The eighth column shows the average context length (n c ) of the test models.E HYPERPARAMETER SETTINGWe used the following values for hyperparameter for the experiments on SimpleBooks-2 datasets:For the WikiText-103 datasets and model dimension 1536, the following values of hyperparameters are used:For training models of model dimensions 512 and 1024, we have used initial learning rate of 0.0005 and 0.00025 respectively keeping the rest of the hyper-parameters same.

