TRANSFORMER-QL: A STEP TOWARDS MAKING TRANSFORMER NETWORK QUADRATICALLY LARGE

Abstract

Transformer networks have shown outstanding performance on many natural language processing tasks. However the context length (the number of previous tokens on which the output states depend) of a Transformer network grows at best linearly with the memory and computational power used. This limitation prevents a transformer network to have very long context in a resource limited application. In this work, we propose a class of transformer networks, namely Transformer-QL (Quadratically Large), in which, the context length can grow at best quadratically with the memory and computational power used. We have empirically evaluated a Transformer-QL model in three long range language modeling datasets. The results show that Transformer-QL can provide significant improvements over other state of the art networks.

1. INTRODUCTION

Since its introduction in Vaswani et al. (2017) , Transformer networks have overtaken its predecessor Recurrent Neural Networks (RNN) in almost every natural language processing task. However, one limitation of Transformer network is its high requirement of memory and computational power. In a vanilla Transformer network, the memory and computational requirement grows quadratically with the sequence length, and thus with context length. In an effort to overcome the above limitation, Transformer-XL (Dai et al., 2019) 2020) have proposed to increase the context length by introducing a global attention which attains to every input token, thus capable of capturing long distance dependency. However, capturing long distance dependency using those approaches involves extreme compression of state space by the global attention mechanism. Moreover, even though, they perform well on several tasks, their performance on language modeling task have not been tested. Another line of work (Zhang et al., 2019; Pappagari et al., 2019) have suggested to use hierarchical arrangement of transformer network to capture document-wide dependency. However, applicability of those networks requires hierarchical structure in the data itself. Moreover, those techniques have been proposed for document compression rather than language modeling. In this paper, we propose a class of transformer architectures, namely Transformer-QL (Quadratically Large), to alleviate the problem of capturing long distance dependency. Similar to multi-scale transformer networks (Donahue et al., 2019; Subramanian et al., 2020; Zhao et al., 2020;  



and Compressive Transformer (Rae et al., 2020) have been recently proposed. However, in both the network, the context length can grow at best linearly with the memory and computation usage. An alternative strategy have been explored in Li et al. (2019); Ye et al. (2019); Child et al. (2019); Zaheer et al. (2020); Beltagy et al. (2020); Wang et al. (2020b); Kitaev et al. (2020); Katharopoulos et al. (2020); Choromanski et al. (2020); Wang et al. (2020a). All these works have proposed to replace the vanilla self-attention network by a different one with linear or log-linear memory and computation complexity leading to novel transformer architectures with overall linear or log-linear cost. Although, they provide an improvement over the quadratic cost of the vanilla transformer network, the achieved cost is still, at best, linear. Besides, since those techniques are based on either sparsification or compression of the self attention mechanism, they struggle to accumulate long distance information (Gupta & Berant, 2020). Several works such as Burtsev & Sapunov (2020); Ainslie et al. (2020); Gupta & Berant (

