BLOCK SKIM TRANSFORMER FOR EFFICIENT QUESTION ANSWERING

Abstract

Transformer-based encoder models have achieved promising results on natural language processing (NLP) tasks including question answering (QA). Different from sequence classification or language modeling tasks, hidden states at all positions are used for the final classification in QA. However, we do not always need all the context to answer the raised question. Following this idea, we proposed Block Skim Transformer (BST ) to improve and accelerate the processing of transformer QA models. The key idea of BST is to identify the context that must be further processed and the blocks that could be safely discarded early on during inference. Critically, we learn such information from self-attention weights. As a result, the model hidden states are pruned at the sequence dimension, achieving significant inference speedup. We also show that such extra training optimization objection also improves model accuracy. As a plugin to the transformer based QA models, BST is compatible to other model compression methods without changing existing network architectures. BST improves QA models' accuracies on different datasets and achieves 1.6× speedup on BERT large model.

1. INTRODUCTION

With the rapid development of neural networks in NLP tasks, the Transformer (Vaswani et al., 2017) that uses multi-head attention (MHA) mechanism is a recent huge leap (Goldberg, 2016) . It has become a standard building block of recent NLP models. The Transformer-based BERT (Devlin et al., 2018) model further advances the model accuracy by introducing self-supervised pre-training and has reached the state-of-the-art accuracy on many NLP tasks. One of the most challenging tasks in NLP is question answering (QA) (Huang et al., 2020) . Our key insight is that when human beings are answering a question with a passage as a context, they do not spend the same level of comprehension for each of the sentences equally across the paragraph. Most of the contents are quickly skimmed over with little attention on it. However, in the Transformer architecture, all tokens go through the same amount of computation, which suggests that we can take advantage of that by discarding many of the tokens in the early layers of the Transformer. This redundant nature of the transformer induces high execution overhead on the input sequence dimension. To mitigate the inefficiencies in QA tasks, we propose to assign more attention to some blocks that are more likely to contain actual answer while terminating other blocks early during inference. By doing so, we reduce the overhead of processing irrelevant texts and accelerate the model inference. Meanwhile, by feeding the attention mechanism with the knowledge of the answer position directly during training, the attention mechanism and QA model's accuracy are improved. In this paper, we provide the first empirical study on attention featuremap to show that an attention map could carry enough information to locate the answer scope. We then propose Block Skim Transformer (BST), a plug-and-play module to the transformer-based models, to accelerate transformer-based models on QA tasks. By handling the attention weight matrices as feature maps, the CNN-based Block Skim module extracts information from the attention mechanism to make a skim decision. With the predicted block mask, BST skips irrelevant context blocks, which do not enter subsequent layers' computation. Besides, we devise a new training paradigm that jointly trains the Block Skim

