MACHINE READING COMPREHENSION WITH EN-HANCED LINGUISTIC VERIFIERS

Abstract

We propose two linguistic verifiers for span-extraction style machine reading comprehension to respectively tackle two challenges: how to evaluate the syntactic completeness of predicted answers and how to utilize the rich context of long documents. Our first verifier rewrites a question through replacing its interrogatives by the predicted answer phrases and then builds a cross-attention scorer between the rewritten question and the segment, so that the answer candidates are scored in a position-sensitive context. Our second verifier builds a hierarchical attention network to represent segments in a passage where neighbour segments in long passages are recurrently connected and can contribute to current segment-question pair's inference for answerablility classification and boundary determination. We then combine these two verifiers together into a pipeline and apply it to SQuAD2.0, NewsQA and TriviaQA benchmark sets. Our pipeline achieves significantly better improvements of both exact matching and F1 scores than state-of-the-art baselines.

1. INTRODUCTION

Teaching a machine to read and comprehend large-scale textual documents is a promising and longstanding goal of natural language understanding. This field, so called machine reading comprehension (MRC) (Zhang et al., 2019; 2020c) , has achieved impressive milestones in recent years thanks to the releasing of large-scale benchmark datasets and pretrained contextualized language models (CLM). For example, for the well-testified span-extraction style SQuAD2.0 datasetfoot_0 , current best results under the framework of pretraining+fine-tuning employing ALBERT (Lan et al., 2020) are 90.7% of exact matching (EM) and 93.0% of F1 score, exceeds their human-level scores of 86.8% and 89.5% (Rajpurkar et al., 2016; 2018) in a large margin. MRC is traditionally defined to be a question-answering task which outputs answers by given inputs of passage-question pairs. Considering the types of answers, Chen (2018) classified MRC's tasks into four categories: cloze-filling of a question with gaps (Ghaeini et al., 2018) , multiple-choice from several options (Zhang et al., 2020a) , span extraction of answer from the passage (Rajpurkar et al., 2016; 2018; Trischler et al., 2017) and free-style answer generation and summarization from the passage (Nguyen et al., 2016) . MRC is regarded to be widely applicable to numerous applications that are rich of question-style queries, such as information retrieval and task-oriented conversations. For detailed survey of this field, please refer to (Zhang et al., 2020c) for recent research roadmap, datasets and future directions. In this paper, we focus on span-extraction style MRC with unanswerable questions. Rajpurkar et al. (2018) introduced 50K+ unanswerable questions to construct the SQuAD2.0 dataset. Unanswerable questions include rewriting originally answerable questions through ways of negation word inserted or removed, antonym used, entity swap, mutual exclusion, and impossible condition. Plausible answers which correspond to spans in the given passage are attached to these unanswerable questions. Numerous verifiers have been proposed to score the answerability of questions. For example, Hu et al. ( 2019) proposed a read-then-verify system that explicitly verified the legitimacy of the predicted answer. An answer verifier was designed to decide whether or not the predicted answer is entailable by the input snippets (i.e., segments of the input passage). Their system achieved a F1 score of 74.8% and 74.2% respectively on the SQuAD2.0's dev and test sets. Zhang et al. (2020b) proposed a pipeline with two verifiers: a sketchy reading verifier that briefly investigates the overall interactions of between a segment and a question through a binary classification network following CLM, and an intensive reading module that includes a span extractor and an answerability verifier. These two verifiers are interpolated to yield the final decision of answerablility. This framework has achieved significant improvements (F1 score of 90.9% and 91.4%, respectively on the dev and test sets of SQuAD2.0) than the strong ALBERT baseline (Lan et al., 2020) . Minimizing span losses of start and end positions of answers for answerable questions is overwhelming in current pretraining+fine-tuning frameworks. However, there are still requirements for designing fine-grained verifiers for predicting questions' answerabilities by utilizing the interaction of between answer-injected questions and passages. It is valuable for us to score the linguistic correctness of the predicted answer, through replacing the interrogatives in questions by their predicted answers to check out if it is a linguistically correct sentence and in addition if the given passage can entail that rewritten question. For example, there are two reference answers ("P is not equal to NP" as a complete sentence, "not equal" as a verb phrase) to a question "What implication can be derived for P and NP if P and co-NP are established to be unequal?". When a system predicts "not equal to NP." with unbalanced arguments (i.e., containing objective argument NP yet missing subjective argument P), it is scored 0 in exact match and discounted in F1 score. Intuitively, "P is not equal to NP/not equal implication ..." should be scored higher than "not equal to NP implication ...". This motivates our first position-sensitive question-rewritten verifier. That is, we score the correctness and completeness of the predicted answer and the rewritten question by the existing CLM models, and build an entailment network that takes cross-attention of between the rewritten question and passage as inputs. On the other hand, the original passage/document is frequently too long to be directly used in pretraining+tuning frameworks. For example, as reported in (Gong et al., 2020) , passages in TriviaQA dataset (Joshi et al., 2017) averagely contain 2,622 tokens generated by BERT tokenizer (Kudo & Richardson, 2018) in their training set. Current CLMs are incapable of accepting arbitrarily long token sequences. Thus, we are forced to cut the passage into segments with fixed length (e.g., 512 tokens with strides such as 128 or 256). Then, the reference input to MRC now is a fix-length segment instead of the whole passage. When the manually annotated answer to the question is out of the scope of the segment, the question will be annotated to be unanswerable regardless its answerability in the whole passage. This brings bias to answerable questions since one individual segment loses its context, and it is possible that this segment implicitly contain clues for correctly answering the question. For example, in SQuAD2.0, only one answer span is provided for one question regardless of the multiple appearances (104,674 appearances of answer texts for the 86,821 answerable questions) of the same answer text in the given paragraph. For long-text MRC, Gong et al. (2020) proposed recurrent chunking mechanisms by first employing reinforcement learning to centralize the candidate answer span in the segment and then building a recurrent network to transfer contextual information among segments. Considering that it is non-trivial of data-preparation and multi-turn reasoning for training and inferencing under a reinforcement learning framework, we propose a different solution in this paper. We build a hierarchical attention network (HAN) (Yang et al., 2016 ) on sentence-level, segment-level and finally paragraph-level cross-attentions to questions for extracting and verifying candidate answers. Since the HAN framework includes segment sequences in a recurrent way, Gong et al. ( 2020)'s recurrent chunking can be regarded as a special case of ours. We combine these two types of verifiers together into a pipeline, and apply it to three span-extraction MRC benchmark sets: SQuAD2.0 (Rajpurkar et al., 2018 ), NewsQA (Trischler et al., 2017 ), and TriviaQA (Joshi et al., 2017) (wikipedia part). Our pipeline achieves significantly better improvements on both exact matching and F1 scores than state-of-the-art baselines.

2.1. ANSWER-INJECTED REWRITING OF QUESTIONS

The overwhelming types of questions in span-extraction MRC datasets are in a scope of 5W1H, as listed in Table 1 . NewsQA's statistical information comes from their websitefoot_1 . We calculate the train and validation sets together for SQuAD2.0 and TriviaQA, using these 5W1H interrogatives and their extensions ( such as "whom" and "whose" are all who-style question indicators).



https://rajpurkar.github.io/SQuAD-explorer/ https://www.microsoft.com/en-us/research/project/newsqa-dataset

