ADAPTIVE SELF-TRAINING FOR NEURAL SEQUENCE LABELING WITH FEW LABELS

Abstract

Neural sequence labeling is an important technique employed for many Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), slot tagging for dialog systems and semantic parsing. Large-scale pre-trained language models obtain very good performance on these tasks when fine-tuned on large amounts of task-specific labeled data. However, such large-scale labeled datasets are difficult to obtain for several tasks and domains due to the high cost of human annotation as well as privacy and data access constraints for sensitive user applications. This is exacerbated for sequence labeling tasks requiring such annotations at token-level. In this work, we develop techniques to address the label scarcity challenge for neural sequence labeling models. Specifically, we develop self-training and meta-learning techniques for training neural sequence taggers with few labels. While self-training serves as an effective mechanism to learn from large amounts of unlabeled data -meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels. Extensive experiments on six benchmark datasets including two for massive multilingual NER and four slot tagging datasets for task-oriented dialog systems demonstrate the effectiveness of our method. With only 10 labeled examples for each class for each task, our method obtains 10% improvement over state-of-the-art systems demonstrating its effectiveness for the low-resource setting.

1. INTRODUCTION

Motivation. Deep neural networks typically require large amounts of training data to achieve stateof-the-art performance. Recent advances with pre-trained language models like BERT (Devlin et al., 2019) , GPT-2 (Radford et al., 2019) and RoBERTa (Liu et al., 2019) have reduced this annotation bottleneck. In this paradigm, large neural network models are trained on massive amounts of unlabeled data in a self-supervised manner. However, the success of these large-scale models still relies on fine-tuning them on large amounts of labeled data for downstream tasks. For instance, our experiments show 27% relative improvement on an average when fine-tuning BERT with the full training set (2.5K-705K labels) vs. fine-tuning with only 10 labels per class. This poses several challenges for many real-world tasks. Not only is acquiring large amounts of labeled data for every task expensive and time consuming, but also not feasible in many cases due to data access and privacy constraints. This issue is exacerbated for sequence labeling tasks that require annotations at tokenand slot-level as opposed to instance-level classification tasks. For example, an NER task can have slots like B-PER, I-PER, O-PER marking the beginning, intermediate and out-of-span markers for person names, and similar slots for the names of location and organization. Similarly, language understanding models for dialog systems rely on effective identification of what the user intends to do (intents) and the corresponding values as arguments (slots) for use by downstream applications. Therefore, fully supervised neural sequence taggers are expensive to train for such tasks, given the requirement of thousands of annotations for hundreds of slots for the many different intents. Semi-supervised learning (SSL) (Chapelle et al., 2010) is one of the promising paradigms to address labeled data scarcity by making effective use of large amounts of unlabeled data in addition to task-specific labeled data. Self-training (ST, (III, 1965) ) as one of the earliest SSL approaches has recently shown state-of-the-art performance for tasks like image classification (Li et al., 2019; Xie et al., 2020) performing at par with supervised systems while using very few training labels. In contrast to such instance-level classification tasks, sequence labeling tasks have dependencies To this end, we address some key challenges on how to construct an informative held-out validation set for token-level re-weighting. Prior works (Ren et al., 2018; Shu et al., 2019) for instance classification construct this validation set by random sampling. However, sequence labeling tasks involve many slots (e.g. WikiAnn has 123 slots over 41 languages) with variable difficulty and distribution in the data. In case of random sampling, the model oversamples from the most populous category and slots. This is particularly detrimental for low-resource languages in the multilingual setting. To this end, we develop an adaptive mechanism to create the validation set on the fly considering the diversity and uncertainty of the model for different slot types. Furthermore, we leverage this validation set for token-level loss estimation and re-weighting pseudo-labeled sequences from the teacher in the meta-learning setup. While prior works (Li et al., 2019; Sun et al., 2019; Bansal et al., 2020) on meta-learning for image and text classification leverage multi-task learning to improve a target classification task based on several similar tasks, in this work we focus on a single sequence labeling task -making our setup more challenging altogether. Our task and framework overview. We focus on sequence labeling tasks with only a few annotated samples (e.g., K = {5, 10, 20, 100}) per slot type for training and large amounts of task-specific unlabeled data. Figure 1 shows an overview of our framework with the following components: (i) Self-training: Our self-training framework leverages a pre-trained language model as a teacher and co-trains a student model with iterative knowledge exchange (ii) Adaptive labeled data acquisition for validation: Our few-shot learning setup assumes a small number of labeled training samples per slot type. The labeled data from multiple slot types are not equally informative for the student model to learn from. While prior works in meta-learning randomly sample some labeled examples for held-out validation set, we develop an adaptive mechanism to create this set on the fly. To this end, we leverage loss decay as a proxy for model uncertainty to select informative labeled samples for the student model to learn from in conjunction with the re-weighting mechanism in the next step. (iii) Meta-learning for sample re-weighting: Since pseudo-labeled samples from the teacher can be noisy, we employ meta-learning to re-weight them to improve the student model performance on the held-out validation set obtained from the previous step. In contrast to prior work (Ren et al., 2018) 



Figure 1: MetaST framework.

