MULTI-SPAN QUESTION ANSWERING USING SPAN-IMAGE NETWORK

Abstract

Question-answering (QA) models aim to find an answer given a question and context. Language models like BERT are used to associate question and context to find an answer span. Prior art on QA focuses on finding the best answer. There is a need for multi-span QA models to output the top-K likely answers to questions such as "Which companies Elon Musk started?" or "What factors cause global warming?" In this work, we introduce Span-Image architecture that can learn to identify multiple answers in a context for a given question. This architecture can incorporate prior information about the span length distribution or valid span patterns (e.g., end index has to be larger than start index), thus eliminating the need for post-processing. Span-Image architecture outperforms the state-of-the-art in top-K answer accuracy on SQuAD dataset and in multi-span answer accuracy on an Amazon internal dataset.

1. INTRODUCTION

Answering questions posted as text to search engines or spoken to virtual assistants like Alexa has become a key feature in information retrieval systems. Publicly available reading comprehension datasets including WikiQA (Yang et al., 2015) , TriviaQA (Joshi et al., 2017) , NewsQA (Trischler et al., 2016) , and SQuAD (Rajpurkar et al., 2016) have fostered research in QA models. SQuAD is one of the most widely-used reading comprehension benchmarks that has an active leaderboard with many participants. Even though there are models that beat human-level accuracy in SQuAD, these QA systems can do well by learning only context and type-matching heuristics (Weissenborn et al., 2017) but may still be far from true language understanding since they do not offer robustness to adversarial sentences (Jia & Liang, 2017) . To better measure performance, SQuAD v2.0 Rajpurkar et al. (2018) extends v1.1 by allowing questions that have no explicit answers in a given paragraph. QA can be modeled as a task to predict the span (i.e., start and end indices) of an answer given a question and an input paragraph. To find the answer span, language representation models such as BERT can be used to associate a question with a given paragraph Devlin et al. (2019) . BERT is pre-trained on unsupervised tasks using large corpora. Its input representation permits a pair, which is well suited for having a question and a passage as input. By fine-tuning BERT on SQuAD, a QA model can be obtained. Questions without an answer are treated as having a span that begins and ends with the special BERT token: [CLS] . In this way, a BERT-based QA model can offer an actual answer or 'no-answer" to all questions in SQuAD v1.1 and v2.0 datasets. Prior work on QA assumes presence of a single answer or lack of any answer Seo et al. (2016) , Devlin et al. (2019) . Furthermore, they assume a separable probability distribution function (pdf ) for start and end indices of an answer span, which leads to a separable loss function. This approach has two major disadvantages: 1) It prevents the QA model from predicting multiple spans without postprocessing. 2) Since a separable pdf is used, the QA model can not learn to evaluate compatibility of start and end indices, thus suffering from performance degradation. Pang et al. (2019) consider a hierarchical answer span by sorting the product of start and end probabilities to support multiple spans. However, they still assume a separable pdf for start and end indices. To the best of our knowledge, a multi-span QA architecture has not been proposed. We introduce Span-Image architecture to enable multi-span answers (or multiple answers) given a question and a paragraph. Each pixel (i,j) in the span-image corresponds to a span starting at ith position and ending at jth. Typical image processing networks like 2D convolutional network layers are used. Span-Image architecture enables the model to couple start and end indices to check for their compatibility. Constraints such as "the end index has to be bigger than the start index", can be automatically embedded into the model. Moreover, other span characteristics such as "shorter answers are more likely to occur" (see Figure 1 ), can be learned by the model, thus eliminating the need for post-processing or regularization. Figure 1 : Span length histogram on SQuAD shows that shorter answers are more likely to occur. Span-Image architecture can incorporate this prior information since each output pixel predicts a span whose length is known by the model. Our contributions are summarized as below: • We present Span-Image architecture, a novel method that enables multi-span answer prediction. • Specially designed image channels in the proposed architecture can help the QA model capture span-characteristics and eliminate the need for post-processing. • Span-Image network is modular and can be added to most DNN-based QA models without requiring changes to the previous layers.

2. MULTI-SPAN PREDICTION

In this section we first highlight BERT QA architecture, and present our span-image architecture that consumes BERT outputs.

2.1. BERT QA ARCHITECTURE

The QA task in BERT uses a separable pdf : p(s S , s E ) = p(s S ) × p(s E ) where s S and s E denotes one-hot variables of length N for start and end indices for a paragraph of length N , respectively. Therefore, BERT QA architecture assumes start and end index probabilities to be independent from each other. Given predicted probabilities p BERT (s S ) and p BERT (s E ) as outputs of BERT, a question q of length M and a passage g of length N , the QA loss fuction for fine-tuning BERT is then given by Loss(q, p, t S , t E ) = H(t S , p BERT (s S )) + H(t E , p BERT (s E )), where H is the cross-entropy function, and t is target span with start and end indices t S and t E , respectively. BERT has two separate outputs for start and end indices, which makes it impossible for the model to check for compatibility of s S and s Efoot_0 or utilize information such as span length (i.e., s E -s S ) in its predictions. Figure 2 shows BERT QA architecture.

Softmax Softmax

Tok1 [CLS] TokM … Question TokM+1 [SEP] Tok M+N … Paragraph Affine Transformation Affine Transformation … BERT … E M+1 EM+N … Figure 2: BERT QA architecture predicts start and end index probability distributions separately.

2.2. SPAN-IMAGE NETWORK

Span-image network does not dictate a separable pdf for start and end indices, hence p(s S , s E ) = p(s S ) × p(s E ). Given a question q and paragraph g, BERT outputs D dimensional vector sequence BERT (q, g) of length M + N + 2 (see Figure 2 ). Let's denote the last N vectors in the sequence, which corresponds to paragraph g, with BERT g (q, g). Using two affine transformations denoted by W S and W E , each of which has D units, we create 2 vector sequences W S (BERT g (q, g)) and W E (BERT g (q, g)) of length N . A pixel at location (i, j) has D channels and is given by span im i,j = W S (BERT g (q, g)) i • W E (BERT g (q, g)) j , where • denotes element-wise multiplication of D-dimensional vectors in i th and j th locations of W S (BERT g (q, g)) and W E (BERT g (q, g)), respectively. Hence, span-image span im shown in Figure 3 , is a 3-dimensional tensor of depth D and of height and width N . This enables us to borrow techniques such as 2-dimensional convolutional filtering, max-pooling, and ReLU from convolutional neural network (CNN) architectures for image classification. The output of the spanimage network is an N × N logit-image, logit im, with a single channel (i.e., a logit for each possible pixel/span). Each channel in span im is a matrix of rank 1. Therefore, each channel is separable and has limited potential beyond the separable approach described in Section 2.1. However, applying twodimensional convolutional layers improves performance and makes logit im non-separable, thus eliminating the independence assumption on start and end indices. The probability of each span can be computed by applying sigmoid function on each pixel or softmax in logit im. Using sigmoid makes no assumption on number of spans, while using softmax assumes a single span in every paragraph. The best function to use depends on the QA dataset. For example, in our experiments, using softmax gave us best results for fine-tuning BERT on SQuAD while sigmoid performed better on our internal multi-span datasets. Denoting p(s S = i, js E = j) by p i,j for simplicity, span probabilities for single-span and multi-span datasets can be computed by p sigmoid i,j = sigmoid logit imi,j , if training datasets can have multi-span answers p sof tmax i,j = sof tmax logit im i,j , if training datasets only have single-span answers (3) Target image, target im, is a binary image with zeros at every pixel except those corresponding to target spans. (i.e., target im(i, j) = 1 for any target span in g with start index i and end index j). Given logit im and target im, the loss function using sigmoid is given below where H is the cross-entropy function. The loss function for softmax is given by Loss(q, p, target im) = H target im, p sof tmax . Loss(q, p, target im) = i,j H target imi,j, p sigmoid i,j /N 2 , (5) Note that target im and p sof tmax are joint pdfs on s S and s E , while p sigmoid i,j is a pdf for binary variable indicating if (i, j) is an answer span or not.

2.2.1. INCORPORATING SPAN CHARACTERISTICS

Typical post-processing on BERT QA output involves checking for valid spans and sorting them with the multiplication of start/end index probabilities. As shown in Figure 1 , a priori information about the span length distribution can be used to break ties or prefer between spans that have close probabilities. In BERT QA, this can only be achieved by implementing a post-processing technique that penalizes spans based on their lengths. Our span-image architecture, however, has an inherent capability to learn and incorporate such patterns. We simply create a new channel span ch, and our model learns how to utilize this channel to capture span-characteristics during training. span ch = -1 if i < j j -i if j ≥ i, j -i < ς ς if j -i > ς. ( ) span ch is concatenated to span im increasing its depth to D + 1.

3.1. IMPLEMENTATION DETAILS

In multi-span QA tasks, we compare the standard BERT-QA model by Devlin et al. (2019) , which has a separate output layer for start and end index prediction, against the following variants of the span-image architecture, which is described in Section 2.2: • bert-qa: The BERT base model that is available from Transformers (Wolf et al., 2019) library as bert-base-uncased. • bert-ms-sigmoid: Consists of the BERT base model augmented by the span-image network, which involves two 2D convolution layers with 100 and 50 filters, respectively. Both layers use 3x3 filters. The output layer has sigmoid activation on span-image pixels to enable multiple span predictions. • base-ms-softmax: Replaces the sigmoid activation of the bert-ms-sigmoid with softmax activation. Softmax serves as a useful regularization when task dictates one answer (noanswer counts as a "null" answer). • bert-ms-sigmoid-sl: bert-ms-sigmoid model with a span-length indicator channel concatenated to the span-image. • bert-ms-softmax-sl: bert-ms-softmax model with a span-length indicator channel concatenated to the span-image.

3.2. SQUAD

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100,000+ crowd-sourced question/answer pairs (Rajpurkar et al., 2016) . The SQuAD v2.0 task extends the SQuAD 1.1 problem with question/answer pairs, in which there may be no answer to the question. We use SQuAD v2.0 in all our experiments. This makes the QA task more realistic and challenging.

3.3. INTERNAL AMAZON MULTI-ANSWER DATASET

Consumers use the total quantity information of a product to compare its value offer against similar products. To provide Amazon customers with accurate quantity information, we formulated the quantity extraction as a multi-span question answering problem where the question is "what is the total quantity of this item in terms of its unit of measure?", and the context is the textual item description provided by sellers. The unit of measure can be volume (e.g, liquid detergent), weight (e.g., powder detergent), or count (e.g., number of loads) depending on the product type. To compute the total quantity correctly, all relevant quantities need to be extracted from the seller-provided text and multiplied. For example, if the product title is "Original Roast Ground Coffee K Cups, Caffeinated, 36 ct -12.4 oz Box, Pack of 2", then the applicable question is "what is the count?", and the total quantity is "36 x 2 = 72". Our dataset consists of manual labels accumulated in time. We split 80,000 out of 450,000 labels into a test set. The average number of answer spans in our dataset is 1.8.

3.4. RESULTS

The performance metrics on fine-tuning BERT with SQuAD v2.0 dataset are given in Table 1 . All models perform similarly with the exception of bert-ms-sigmoid. This is expected since SQuAD 2.0 dataset has only one answer, the softmax activation exploits this information by its inherent normalization, which can be seen as a projection to the single-answer constraint. Since span-image architecture enables BERT to generate multi-span answers, we use top-K accuracy to compare bert-qa with bert-ms-softmax-sl for K=1,3,5, and 10. As shown in Table 2 , both models perform similarly for K=1, but bert-ms-softmax-sl model significantly performs better in all metrics for K > 1.

3.4.1. AMAZON INTERNAL MULTI-SPAN ANSWER DATASET

Using top-K accuracy is a proxy to capture answer quality on SQuAD when multiple answers are output by the models. Our internal quantity dataset is a true multi-span answer dataset that we fine-tune BERT with. We measure performances on multi-span answer prediction (i.e., whether all relevant information to compute the total quantity has been extracted or not). On this task, since the span count can be any number, we compare bert-ms-sigmoid and bert-ms-sigmoid-sl with bert-qa. While both of the span-image models perform better, bert-ms-sigmoid-sl performs best. Making the QA model aware of span-length leads to small improvements. To improve bert-qa's multi-span prediction performance, we introduce a post-processing step where a weighted penalty term for the span length is added to start/end index probabilities. Results for different span-length penalty weights (λ's) are given in Table 5 . Span-length penalization improves the performance of bert-qa, but it still performs worse than span-image models when an answer is present in the input text.

4. CONCLUSION

QA models can answer complex questions by predicting a span in a given paragraph or context. To associate a question and a context, large language-representation models, which can be trained on big corpora, are utilized. As the performance of QA models improves, the need for more realistic scenarios grows. In this work, we propose span-image network to predict multiple spans of an answer. We measure its performance using top-K accuracy on SQuAD and exact match of all spans on an internal Amazon multi-span QA dataset. While performing similarly on top 1 accuracy, spanimage network significantly outperforms separable prediction for K > 1. 



One can claim attention heads in the transformer network will correlate tokens but this does not happen explicitly as in our architecture, where probability for each possible span is computed jointly.



Figure 3: Span-image network implements convolution layers as typically used in image classification tasks.

SQuAD v2.0 results for top answer prediction

SQuAD results using top-K answers

Question-answer pairs in the multi-answer dataset

Multi-Answer Dataset Results

Multi-answer dataset results using penalty weight λ

