TRANSFORMER-QL: A STEP TOWARDS MAKING TRANSFORMER NETWORK QUADRATICALLY LARGE

Abstract

Transformer networks have shown outstanding performance on many natural language processing tasks. However the context length (the number of previous tokens on which the output states depend) of a Transformer network grows at best linearly with the memory and computational power used. This limitation prevents a transformer network to have very long context in a resource limited application. In this work, we propose a class of transformer networks, namely Transformer-QL (Quadratically Large), in which, the context length can grow at best quadratically with the memory and computational power used. We have empirically evaluated a Transformer-QL model in three long range language modeling datasets. The results show that Transformer-QL can provide significant improvements over other state of the art networks.

1. INTRODUCTION

Since its introduction in Vaswani et al. (2017) , Transformer networks have overtaken its predecessor Recurrent Neural Networks (RNN) in almost every natural language processing task. However, one limitation of Transformer network is its high requirement of memory and computational power. In a vanilla Transformer network, the memory and computational requirement grows quadratically with the sequence length, and thus with context length. In an effort to overcome the above limitation, Transformer-XL (Dai et al., 2019) (2020a) . All these works have proposed to replace the vanilla self-attention network by a different one with linear or log-linear memory and computation complexity leading to novel transformer architectures with overall linear or log-linear cost. Although, they provide an improvement over the quadratic cost of the vanilla transformer network, the achieved cost is still, at best, linear. Besides, since those techniques are based on either sparsification or compression of the self attention mechanism, they struggle to accumulate long distance information (Gupta & Berant, 2020) . Several works such as Burtsev & Sapunov (2020); Ainslie et al. (2020); Gupta & Berant (2020) have proposed to increase the context length by introducing a global attention which attains to every input token, thus capable of capturing long distance dependency. However, capturing long distance dependency using those approaches involves extreme compression of state space by the global attention mechanism. Moreover, even though, they perform well on several tasks, their performance on language modeling task have not been tested. Another line of work (Zhang et al., 2019; Pappagari et al., 2019) have suggested to use hierarchical arrangement of transformer network to capture document-wide dependency. However, applicability of those networks requires hierarchical structure in the data itself. Moreover, those techniques have been proposed for document compression rather than language modeling. In this paper, we propose a class of transformer architectures, namely Transformer-QL (Quadratically Large), to alleviate the problem of capturing long distance dependency. Similar to multi-scale transformer networks (Donahue et al., 2019; Subramanian et al., 2020; Zhao et al., 2020; Dai et al., 2020) , Transformer-QL captures the contextual information in multiple temporal scalesfiner scales to capture recent past information and coarser scales to capture distance past information. Additionally like Transformer-XL, it keeps the hidden states of a past segment in memory and use it to process future segments causing the context length to grow beyond the current segment. Overall, the context length in Transformer-QL can grow up to quadratically with the memory/computational usage. The contributions of the work are as follows: • We have proposed a novel class of transformer architectures, namely Transformer-QL, in which, the context length can be made to grow linearly with memory and computation cost. The organization of the paper is as follows. In section 2, the proposed Transformer-QL architecture along with its background has been introduced. Section 3 provides empirical evaluation of Transformer-QL. The section also studies the sensitivity of Transformer-QL to several hyperparameters. Finally, in Section 4, the conclusion has been drawn and future directions of the work have been suggested.

2. METHOD 2.1 TERMINOLOGY AND NOTATIONS

In a transformer network, the input sequence are partitioned into smaller segments of fixed length. Each segment is processed independently of other segments. We refer to the number of tokens in each segment as segment length. In a transformer network with recurrent memory like Transformer-XL, the hidden states of the recent past segments are preserved in a fixed length memory. We refer to the number of tokens in each layer of the memory unit as memory length. For an output state (i.e. the output states of the last layer), we use the term context length to refer to the number of past tokens on which the output state depends. In transformer network, different output states might have different context length. We respectively refer the minimum and maximum of the context lengths of all the output states in a network as minimum context length and maximum context length of the network. We refer the sum of segment length and the memory length using the term window length. We denote the segment length, memory length, window length and model dimension by n s , n m , n w and d m respectively. Thus, we have n w = n s + n m . We also use the notations s l t and m l t to denote the output and memory of l-th layer at time step t for l = 1, 2, • • • , L where L is the total number of Layers. The output and memory of embedding layer at time step t have been denoted by s 0 t and m 0 t respectively. The number of heads in the self attention layers has been denoted by H.

2.2. BACKGROUND

Transformer A transformer network consists of stacked collection of multiple transformer layers. Each transformer layer contains a multi-head self-attention layer followed by a position-wise feed forward layer. Though the memory and computational cost of position-wise feed forward layer is linear in the length of input sequence, the multi-head self attention layer has a quadratic cost. The transformer network tackles the quadratic memory and computational cost by dividing the input sequence into smaller segments and applying the transformer network on each segment independently. However, such method limits the context lengths to the segment length. Dai et al. (2019) has named this problem as context fragmentation problem. Transformer-XL Dai et al. ( 2019) has proposed Transformer-XL to solve the context fragmentation problem. In Transformer-XL, instead of discarding the hidden states after the computation of a segment, they are saved in memory (please refer to Figure 3 ). During the computation of the following segments, the self attention is applied over the hidden states of both the current segment and the



and Compressive Transformer (Rae et al., 2020) have been recently proposed. However, in both the network, the context length can grow at best linearly with the memory and computation usage. An alternative strategy have been explored in Li et al. (2019); Ye et al. (2019); Child et al. (2019); Zaheer et al. (2020); Beltagy et al. (2020); Wang et al. (2020b); Kitaev et al. (2020); Katharopoulos et al. (2020); Choromanski et al. (2020); Wang et al.

Further, employing a linear cost self attention layer like Wang et al. (2020b); Katharopoulos et al. (2020), the context length of Transformer-QL can be made to grow quadratically in both memory and computational cost. • We have empirically evaluated a Transformer-QL model on three long range language modeling datasets. The results show significant improvement in perplexity score over Transformer-XL and Compressive Transformer.

