UNCERTAINTY-BASED ADAPTIVE LEARNING FOR READING COMPREHENSION Anonymous authors Paper under double-blind review

Abstract

Recent years have witnessed a surge of successful applications of machine reading comprehension. Of central importance to the tasks is the availability of massive amount of labeled data, which facilitates the training of large-scale neural networks. However, in many real-world problems, annotated data are expensive to gather not only because of time cost and budget, but also of certain domainspecific restrictions such as privacy for healthcare data. In this regard, we propose an uncertainty-based adaptive learning algorithm for reading comprehension, which interleaves data annotation and model updating to mitigate the demand of labeling. Our key techniques are two-fold: 1) an unsupervised uncertaintybased sampling scheme that queries the labels of the most informative instances with respect to the currently learned model; and 2) an adaptive loss minimization paradigm that simultaneously fits the data and controls the degree of model updating. We demonstrate on the benchmark dataset that 25% less labeled samples suffice to guarantee similar, or even improved performance. Our results demonstrate a strong evidence that for label-demanding scenarios, the proposed approach offers a practical guide on data collection and model training.

1. INTRODUCTION

The goal of machine reading comprehension (MRC) is to train an AI model which is able to understand natural language text (e.g. a passage), and answer questions related to it (Hirschman et al., 1999) ; see Figure 1 for an example. MRC has been one of the most important problems in natural language processing thanks to its various successful applications, such as smooth-talking AI speaker assistants -a technology that was highlighted as among 10 breakthrough technologies by MIT Technology Review very recently (Karen, 2019) . Of central importance to MRC is the availability of benchmarking question-answering datasets, where a larger dataset often enables training of a more informative neural networks. In this regard, there have been a number of benchmark datasets proposed in recent years, with the efforts of pushing forward the development of MRC. A partial list includes SQuAD (Rajpurkar et al., 2016 ), NewsQA (Trischler et al., 2017) , MSMARCO (Nguyen et al., 2016), and Natural Questions (Kwiatkowski et al., 2019) . While the emergence of these high-quality datasets have stimulated a surge of research and have witness a large volume of deployments of MRC, it is often challenging to go beyond the scale of the current architectures of neural networks, in that it is extremely expensive to obtain massive amount of labeled data. The barrier of data collection can be seen from SQuAD: the research group at Standford University spent 1,547 working hours for the annotation of SQuAD dataset, with the cost over $14,000. This issue was also set out and addressed by AI companies. However, even equipped with machine learning assisted labeling tools (e.g. Amazon SageMaker Ground Truth), it is still expensive to hire and educate expert workers for annotation. What makes the issue more serious is that there is a rise of security and privacy concerns in various problems, which prevents researchers from scaling their projects to diverse domains efficiently. For example, all annotators are advised to get a series of training about privacy rules, such as Health Insurance Portability & Accountability Act, before they can work on the medical records. In this work, we tackle the challenge by proposing a computationally efficient learning algorithm that is amenable for label-demanding problems. Unlike prior MRC methods that separate data annotation and model training, our algorithm interleaves these two phases. Our algorithm, in spirit, • Question: What causes precipitation to fall? • Passage: In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity . The main forms . . . intense periods of rain in scattered locations are called "shower". • Answer: gravity resembles the theme of active learning (Balcan et al., 2007) , where the promise of active learning is that we can always concentrate on fitting only the most informative examples without suffering a degraded performance. While there have been a considerable number of works showing that active learning often guarantees exponential savings of labels, the analysis holds typically for linear classification models Awasthi et al. (2017) ; Zhang (2018); Zhang et al. (2020) . In stark contrast, less is explored for the more practical neural network based models since it is nontrivial to extend important concepts such as large margin of linear classifiers to neural networks. As a remedy, we consider an unsupervised sampling scheme based on the uncertainty of the instances (Settles, 2009). Our sampling scheme is adaptive (i.e. active) in the sense that it chooses instances that the currently learned model is most uncertain on. To this end, we recall that the purpose of MRC is to take as input a passage and a question, and finds the most accurate answer from the passage. Roughly speaking, this can be thought of as a weight assignment problem, where we need to calculate how likely each word span in the passage could be the correct answer. Ideally, we would hope that the algorithm assigns 1 to the correct answer, and assigns 0 to the remaining, leading to a large separation between the correct and those incorrect. Alternatively, if the algorithm assigns, say 0.5 to two different answers and assigns 0 to others, then it is very uncertain about its response -this is a strong criterion that we need to query the correct answer to an expert, i.e. performing active labeling. Our uncertainty-based sampling scheme is essentially motivated by this observation: the uncertainty of an instance (i.e. a pair of passage and question) is defined as the gap between the weight of the best candidate answer and the second best. We will present a more formal description in Section 2. After identifying these most uncertain, and hence most informative instances, we query their labels and use them to update the model. In this phase, in addition to minimize the widely used entropybased loss function, we consider an adaptive regularizer which has two important properties. First, it enforces that the new model will not deviate far from the current model, since 1) with reasonable initialization we would expect that the initial model should perform not too bad; and 2) we do not want to overfit the data even if they are recognized as informative. Second, the regularizer has a coefficient that is increasing with iterations. Namely, as the algorithm proceeds the stability of model updating outweighs loss minimization. In Section 2 we elaborate on the concrete form of our objective function. It is also worth mentioning that since in each iteration, the algorithm only fits the uncertain instances, the model updating is more faster than traditional methods. The pipeline is illustrated in Figure 2 . Given abundant unlabeled instances, our algorithm first evaluates their uncertainty and detects the most informative ones, marked as red. Then we send these instances to an expert to obtain the groundtruth answers, marked as yellow. With the newly added labeled samples, it is possible to perform incremental updating of the MRC model. Roadmap. We summarize our main technical contributions below, and discuss more related works in Section 5. In Section 2 we present a detailed description of the core components of our algorithm, and in Section 3 we provide an end-to-end learning paradigm for MRC with implementation details. In Section 4, we demonstrate the efficacy of our algorithm in terms of exact match, F-1 score, and the savings of labels. Finally we conclude this paper in Section 6.

1.1. SUMMARY OF CONTRIBUTIONS

We consider the problem of learning an MRC model in the label-demanding context, and we propose a novel algorithm that interleaves data annotation and model updating. In particular, there are two core components for this end: an unsupervised uncertainty-based sampling scheme that only queries labels of the most informative instances with respect to the currently learned model, and



Figure 1: An illustrative example in the SQuAD dataset (Rajpurkar et al., 2016).

