ADAPTIVE SELF-TRAINING FOR NEURAL SEQUENCE LABELING WITH FEW LABELS

Abstract

Neural sequence labeling is an important technique employed for many Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), slot tagging for dialog systems and semantic parsing. Large-scale pre-trained language models obtain very good performance on these tasks when fine-tuned on large amounts of task-specific labeled data. However, such large-scale labeled datasets are difficult to obtain for several tasks and domains due to the high cost of human annotation as well as privacy and data access constraints for sensitive user applications. This is exacerbated for sequence labeling tasks requiring such annotations at token-level. In this work, we develop techniques to address the label scarcity challenge for neural sequence labeling models. Specifically, we develop self-training and meta-learning techniques for training neural sequence taggers with few labels. While self-training serves as an effective mechanism to learn from large amounts of unlabeled data -meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels. Extensive experiments on six benchmark datasets including two for massive multilingual NER and four slot tagging datasets for task-oriented dialog systems demonstrate the effectiveness of our method. With only 10 labeled examples for each class for each task, our method obtains 10% improvement over state-of-the-art systems demonstrating its effectiveness for the low-resource setting.

1. INTRODUCTION

Motivation. Deep neural networks typically require large amounts of training data to achieve stateof-the-art performance. Recent advances with pre-trained language models like BERT (Devlin et al., 2019) , GPT-2 (Radford et al., 2019) and RoBERTa (Liu et al., 2019) have reduced this annotation bottleneck. In this paradigm, large neural network models are trained on massive amounts of unlabeled data in a self-supervised manner. However, the success of these large-scale models still relies on fine-tuning them on large amounts of labeled data for downstream tasks. For instance, our experiments show 27% relative improvement on an average when fine-tuning BERT with the full training set (2.5K-705K labels) vs. fine-tuning with only 10 labels per class. This poses several challenges for many real-world tasks. Not only is acquiring large amounts of labeled data for every task expensive and time consuming, but also not feasible in many cases due to data access and privacy constraints. This issue is exacerbated for sequence labeling tasks that require annotations at tokenand slot-level as opposed to instance-level classification tasks. For example, an NER task can have slots like B-PER, I-PER, O-PER marking the beginning, intermediate and out-of-span markers for person names, and similar slots for the names of location and organization. Similarly, language understanding models for dialog systems rely on effective identification of what the user intends to do (intents) and the corresponding values as arguments (slots) for use by downstream applications. Therefore, fully supervised neural sequence taggers are expensive to train for such tasks, given the requirement of thousands of annotations for hundreds of slots for the many different intents. Semi-supervised learning (SSL) (Chapelle et al., 2010) is one of the promising paradigms to address labeled data scarcity by making effective use of large amounts of unlabeled data in addition to task-specific labeled data. Self-training (ST, (III, 1965) ) as one of the earliest SSL approaches has recently shown state-of-the-art performance for tasks like image classification (Li et al., 2019; Xie et al., 2020) performing at par with supervised systems while using very few training labels. In contrast to such instance-level classification tasks, sequence labeling tasks have dependencies

