BLOCK SKIM TRANSFORMER FOR EFFICIENT QUESTION ANSWERING

Abstract

Transformer-based encoder models have achieved promising results on natural language processing (NLP) tasks including question answering (QA). Different from sequence classification or language modeling tasks, hidden states at all positions are used for the final classification in QA. However, we do not always need all the context to answer the raised question. Following this idea, we proposed Block Skim Transformer (BST ) to improve and accelerate the processing of transformer QA models. The key idea of BST is to identify the context that must be further processed and the blocks that could be safely discarded early on during inference. Critically, we learn such information from self-attention weights. As a result, the model hidden states are pruned at the sequence dimension, achieving significant inference speedup. We also show that such extra training optimization objection also improves model accuracy. As a plugin to the transformer based QA models, BST is compatible to other model compression methods without changing existing network architectures. BST improves QA models' accuracies on different datasets and achieves 1.6× speedup on BERT large model.

1. INTRODUCTION

With the rapid development of neural networks in NLP tasks, the Transformer (Vaswani et al., 2017) that uses multi-head attention (MHA) mechanism is a recent huge leap (Goldberg, 2016) . It has become a standard building block of recent NLP models. The Transformer-based BERT (Devlin et al., 2018) model further advances the model accuracy by introducing self-supervised pre-training and has reached the state-of-the-art accuracy on many NLP tasks. One of the most challenging tasks in NLP is question answering (QA) (Huang et al., 2020) . Our key insight is that when human beings are answering a question with a passage as a context, they do not spend the same level of comprehension for each of the sentences equally across the paragraph. Most of the contents are quickly skimmed over with little attention on it. However, in the Transformer architecture, all tokens go through the same amount of computation, which suggests that we can take advantage of that by discarding many of the tokens in the early layers of the Transformer. This redundant nature of the transformer induces high execution overhead on the input sequence dimension. To mitigate the inefficiencies in QA tasks, we propose to assign more attention to some blocks that are more likely to contain actual answer while terminating other blocks early during inference. By doing so, we reduce the overhead of processing irrelevant texts and accelerate the model inference. Meanwhile, by feeding the attention mechanism with the knowledge of the answer position directly during training, the attention mechanism and QA model's accuracy are improved. In this paper, we provide the first empirical study on attention featuremap to show that an attention map could carry enough information to locate the answer scope. We then propose Block Skim Transformer (BST), a plug-and-play module to the transformer-based models, to accelerate transformer-based models on QA tasks. By handling the attention weight matrices as feature maps, the CNN-based Block Skim module extracts information from the attention mechanism to make a skim decision. With the predicted block mask, BST skips irrelevant context blocks, which do not enter subsequent layers' computation. Besides, we devise a new training paradigm that jointly trains the Block Skim objective with the native QA objective, where extra optimization signals regarding the question position are given to the attention mechanism directly. In our evaluation, we show BST improves the QA accuracy and F1 score on all the datasets and models we evaluated. Specifically, BERT large is accelerated for 1.6× without any accuracy loss and nearly 1.8× with less than 0.5% F1 score degradation. This paper contributes to the following 3 aspects. • We for the first time show that an attention map is effective for locating the answer position in the input sequence. • We propose Block Skim Transformer (BST), which leverages the attention mechanism to improve and accelerate transformer models on QA tasks. The key is to extract information from the attention mechanism during processing and intelligently predict what blocks to skim. • We evaluate BST on several Transformer-based model architectures and QA datasets and demonstrate BST 's efficiency and generality.

2. RELATED WORK

Recurrent Models with Skimming. The idea to skip or skim irrelevant section or tokens of input sequence has been studied in NLP models, especially recurrent neural networks (RNN) (Rumelhart et al., 1986) and long short-term memory network (LSTM) (Hochreiter & Schmidhuber, 1997). LSTM-Jump (Yu et al., 2017) uses the policy-gradient reinforcement learning method to train a LSTM model that decides how many time steps to jump at each state. They also use hyper-parameters to control the tokens before jump, maximum tokens to jump, and maximum number of jumping. Skim-RNN (Seo et al., 2018) dynamically decides the dimensionality and RNN model size to be used at next time step. In specific, they adopt two "big" and "small" RNN models and select the "small" one for skimming. Structural-Jump-LSTM (Hansen et al., 2018) use two agents to decide whether jump a small step to next token or structurally to next punctuation. Skip-RNN (Campos et al., 2017) learns to skip state updates thus results in reduced computation graph size. The difference of BST to these works are two-fold. Firstly, the previous works make skimming decisions based on the hidden states or embeddings during processing. However, we are the first to analyze and utilize the attention relationship for skimming. Secondly, our work is based on Transformer model (Vaswani et al., 2017) , which has outperformed the recurrent type models on most NLP tasks. Transformer with Input Reduction. On contrast to aforementioned recurrent models, in the processing of Transformer-based model, all input sequence tokens are calculated in parallel. As such, skimming can be regarded as reduction on sequence dimension. PoWER-BERT (Goyal et al., 2020) extracts input sequence token-wise during processing based on attention scores to each token. During the fine-tuning process for downstream tasks, Goyal et al. proposes soft-extract layer to train the model jointly. Funnel-Transformer (Dai et al., 2020) proposes a novel pyramid architecture with input sequence length dimension reduced gradually regardless of semantic clues. For tasks requiring full sequence length output, like masked language modeling and extractive question answering, Funnel-Transformer up-sample at the input dimension to recover. Universal Transformer (Dehghani et al., 2018) proposes a dynamic halting mechanism that determines the refinement steps for each token. Different from these works, BST utilizes attention information between question and token pairs and skims the input sequence at the block granularity accordingly. Efficient Transformer. There are also many attempts for designing efficient Transformers (Zhou et al., 2020; Wu et al., 2019; Tay et al., 2020) . Well studied model compression methods for Transformer models include pruning (Guo et al., 2020 ), quantization (Wang & Zhang, 2020 ), distillation (Sanh et al., 2019) , weight sharing. Plenty of works and efforts focus on dedicated efficient attention mechanism considering its quadratic complexity of sequence length (Kitaev et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020) . BST is orthogonal to these techniques on the input dimension and therefore is compatible with them. We demonstrate this feasibility with the weight sharing model Albert (Lan et al., 2019) in Sec. 5.

