UNCERTAINTY-BASED ADAPTIVE LEARNING FOR READING COMPREHENSION Anonymous authors Paper under double-blind review

Abstract

Recent years have witnessed a surge of successful applications of machine reading comprehension. Of central importance to the tasks is the availability of massive amount of labeled data, which facilitates the training of large-scale neural networks. However, in many real-world problems, annotated data are expensive to gather not only because of time cost and budget, but also of certain domainspecific restrictions such as privacy for healthcare data. In this regard, we propose an uncertainty-based adaptive learning algorithm for reading comprehension, which interleaves data annotation and model updating to mitigate the demand of labeling. Our key techniques are two-fold: 1) an unsupervised uncertaintybased sampling scheme that queries the labels of the most informative instances with respect to the currently learned model; and 2) an adaptive loss minimization paradigm that simultaneously fits the data and controls the degree of model updating. We demonstrate on the benchmark dataset that 25% less labeled samples suffice to guarantee similar, or even improved performance. Our results demonstrate a strong evidence that for label-demanding scenarios, the proposed approach offers a practical guide on data collection and model training.

1. INTRODUCTION

The goal of machine reading comprehension (MRC) is to train an AI model which is able to understand natural language text (e.g. a passage), and answer questions related to it (Hirschman et al., 1999) ; see Figure 1 for an example. MRC has been one of the most important problems in natural language processing thanks to its various successful applications, such as smooth-talking AI speaker assistants -a technology that was highlighted as among 10 breakthrough technologies by MIT Technology Review very recently (Karen, 2019) . Of central importance to MRC is the availability of benchmarking question-answering datasets, where a larger dataset often enables training of a more informative neural networks. In this regard, there have been a number of benchmark datasets proposed in recent years, with the efforts of pushing forward the development of MRC. A partial list includes SQuAD (Rajpurkar et al., 2016) , NewsQA (Trischler et al., 2017) , MSMARCO (Nguyen et al., 2016) , and Natural Questions (Kwiatkowski et al., 2019) . While the emergence of these high-quality datasets have stimulated a surge of research and have witness a large volume of deployments of MRC, it is often challenging to go beyond the scale of the current architectures of neural networks, in that it is extremely expensive to obtain massive amount of labeled data. The barrier of data collection can be seen from SQuAD: the research group at Standford University spent 1,547 working hours for the annotation of SQuAD dataset, with the cost over $14,000. This issue was also set out and addressed by AI companies. However, even equipped with machine learning assisted labeling tools (e.g. Amazon SageMaker Ground Truth), it is still expensive to hire and educate expert workers for annotation. What makes the issue more serious is that there is a rise of security and privacy concerns in various problems, which prevents researchers from scaling their projects to diverse domains efficiently. For example, all annotators are advised to get a series of training about privacy rules, such as Health Insurance Portability & Accountability Act, before they can work on the medical records. In this work, we tackle the challenge by proposing a computationally efficient learning algorithm that is amenable for label-demanding problems. Unlike prior MRC methods that separate data annotation and model training, our algorithm interleaves these two phases. Our algorithm, in spirit,

