CLUSTER-FORMER: CLUSTERING-BASED SPARSE TRANSFORMER FOR QUESTION ANSWERING

Abstract

Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its effectiveness in modeling short sequences, self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically w.r.t. the sequence length. Therefore, long sequences are often encoded by Transformer in chunks using a sliding window. In this paper, we propose Cluster-Former, a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer, which encode local sequence information and global context jointly and iteratively. This new design allows information integration beyond local windows, which is especially beneficial for question answering (QA) tasks that rely on long-range dependencies. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.

1. INTRODUCTION

Long-range contextual understanding has proven critical in many natural language processing (NLP) tasks. For example, the relevant context for correctly answering an open-domain question can arch over thousands of words. Encoding long sequences via deep neural networks, however, has remained an expensive and challenging task due to high demand on training time and GPU memory. Traditional sequence modeling methods (Hochreiter & Schmidhuber, 1997) encode long sequences in a chronological order, which suffers high latency. In the place of sequential encoding, recent models such as Transformer (Vaswani et al., 2017) use simultaneous self-attention over the entire input instead, which has been successfully adopted in many NLP tasks such as textual entailment (Devlin et al., 2019) , dependency parsing (Zhou & Zhao, 2019), and summarization (Lewis et al., 2019) . A caveat with Transformer though is that building full connections over long sequences translates to quadratic growth on memory demand and computational complexity w.r.t. sequence length. One way to efficiently encode long sequences is to first chunk a sequence into much shorter ones with a sliding window, then build connections between the shorter sequences (Figure 1 2019) makes use of the shared memory of chunked sequences to build connections between them. However, these methods cannot encode long-range dependencies with as much flexibility or accuracy as fully-connected selfattention, due to their dependency on hand-designed patterns. Recently, several studies (Kitaev et al., 2020; Tay et al., 2020b) propose to further improve the sparse attention mechanism by hashing or sorting the hidden states into different buckets (Figure 1(c) ). These works mainly explore tasks with relatively short sequences, such as sentence-level Machine Translation (MT), where the number of hashing vectors is relatively small (less than 16 in Kitaev et al. ( 2020)), allowing randomly initialized hashing vectors to hash hidden states into correct buckets. However, how to use hashing-based attention in the context of long sequences (e.g.,, up to thousands of words) is still an unexplored territory.



(a)). For example, Child et al. (2019), Beltagy et al. (2020) and Zaheer et al. (2020) apply sparse attention to chunked sequences in hand-designed patterns in order to gather information from the chunks (Figure 1(b)). Choi et al. (2017) and Wang et al. (2019) first use a simpler model to filter chunked sequences, then process selected sequences with fully-connected self-attention. Rae et al. (

