MULTI-SPAN QUESTION ANSWERING USING SPAN-IMAGE NETWORK

Abstract

Question-answering (QA) models aim to find an answer given a question and context. Language models like BERT are used to associate question and context to find an answer span. Prior art on QA focuses on finding the best answer. There is a need for multi-span QA models to output the top-K likely answers to questions such as "Which companies Elon Musk started?" or "What factors cause global warming?" In this work, we introduce Span-Image architecture that can learn to identify multiple answers in a context for a given question. This architecture can incorporate prior information about the span length distribution or valid span patterns (e.g., end index has to be larger than start index), thus eliminating the need for post-processing. Span-Image architecture outperforms the state-of-the-art in top-K answer accuracy on SQuAD dataset and in multi-span answer accuracy on an Amazon internal dataset.

1. INTRODUCTION

Answering questions posted as text to search engines or spoken to virtual assistants like Alexa has become a key feature in information retrieval systems. Publicly available reading comprehension datasets including WikiQA (Yang et al., 2015) , TriviaQA (Joshi et al., 2017) , NewsQA (Trischler et al., 2016), and SQuAD (Rajpurkar et al., 2016) have fostered research in QA models. SQuAD is one of the most widely-used reading comprehension benchmarks that has an active leaderboard with many participants. Even though there are models that beat human-level accuracy in SQuAD, these QA systems can do well by learning only context and type-matching heuristics (Weissenborn et al., 2017) but may still be far from true language understanding since they do not offer robustness to adversarial sentences (Jia & Liang, 2017) . To better measure performance, SQuAD v2.0 Rajpurkar et al. (2018) extends v1.1 by allowing questions that have no explicit answers in a given paragraph. QA can be modeled as a task to predict the span (i.e., start and end indices) of an answer given a question and an input paragraph. To find the answer span, language representation models such as BERT can be used to associate a question with a given paragraph Devlin et al. (2019) . BERT is pre-trained on unsupervised tasks using large corpora. Its input representation permits a pair, which is well suited for having a question and a passage as input. By fine-tuning BERT on SQuAD, a QA model can be obtained. Questions without an answer are treated as having a span that begins and ends with the special BERT token: [CLS] . In this way, a BERT-based QA model can offer an actual answer or 'no-answer" to all questions in SQuAD v1.1 and v2.0 datasets. Prior work on QA assumes presence of a single answer or lack of any answer Seo et al. (2016 ), Devlin et al. (2019) . Furthermore, they assume a separable probability distribution function (pdf ) for start and end indices of an answer span, which leads to a separable loss function. This approach has two major disadvantages: 1) It prevents the QA model from predicting multiple spans without postprocessing. 2) Since a separable pdf is used, the QA model can not learn to evaluate compatibility of start and end indices, thus suffering from performance degradation. Pang et al. (2019) consider a hierarchical answer span by sorting the product of start and end probabilities to support multiple spans. However, they still assume a separable pdf for start and end indices. To the best of our knowledge, a multi-span QA architecture has not been proposed. We introduce Span-Image architecture to enable multi-span answers (or multiple answers) given a question and a paragraph. Each pixel (i,j) in the span-image corresponds to a span starting at ith position and ending at jth. Typical image processing networks like 2D convolutional network layers are used. Span-Image architecture enables the model to couple start and end indices to check for their compatibility. Constraints such as "the end index has to be bigger than the start index", can be automatically embedded into the model. Moreover, other span characteristics such as "shorter answers are more likely to occur" (see Figure 1 ), can be learned by the model, thus eliminating the need for post-processing or regularization. Figure 1 : Span length histogram on SQuAD shows that shorter answers are more likely to occur. Span-Image architecture can incorporate this prior information since each output pixel predicts a span whose length is known by the model. Our contributions are summarized as below: • We present Span-Image architecture, a novel method that enables multi-span answer prediction. • Specially designed image channels in the proposed architecture can help the QA model capture span-characteristics and eliminate the need for post-processing. • Span-Image network is modular and can be added to most DNN-based QA models without requiring changes to the previous layers.

2. MULTI-SPAN PREDICTION

In this section we first highlight BERT QA architecture, and present our span-image architecture that consumes BERT outputs.

2.1. BERT QA ARCHITECTURE

The QA task in BERT uses a separable pdf : p(s S , s E ) = p(s S ) × p(s E ) where s S and s E denotes one-hot variables of length N for start and end indices for a paragraph of length N , respectively. Therefore, BERT QA architecture assumes start and end index probabilities to be independent from each other. Given predicted probabilities p BERT (s S ) and p BERT (s E ) as outputs of BERT, a question q of length M and a passage g of length N , the QA loss fuction for fine-tuning BERT is then given by Loss(q, p, t S , t E ) = H(t S , p BERT (s S )) + H(t E , p BERT (s E )), where H is the cross-entropy function, and t is target span with start and end indices t S and t E , respectively. BERT has two separate outputs for start and end indices, which makes it impossible for the model to check for compatibility of s S and s Efoot_0 or utilize information such as span length (i.e., s E -s S ) in its predictions. Figure 2 shows BERT QA architecture.



One can claim attention heads in the transformer network will correlate tokens but this does not happen explicitly as in our architecture, where probability for each possible span is computed jointly.

