EMPIRICAL ANALYSIS OF UNLABELED ENTITY PROB-LEM IN NAMED ENTITY RECOGNITION

Abstract

In many scenarios, named entity recognition (NER) models severely suffer from unlabeled entity problem, where the entities of a sentence may not be fully annotated. Through empirical studies performed on synthetic datasets, we find two causes of performance degradation. One is the reduction of annotated entities and the other is treating unlabeled entities as negative instances. The first cause has less impact than the second one and can be mitigated by adopting pretraining language models. The second cause seriously misguides a model in training and greatly affects its performances. Based on the above observations, we propose a general approach, which can almost eliminate the misguidance brought by unlabeled entities. The key idea is to use negative sampling that, to a large extent, avoids training NER models with unlabeled entities. Experiments on synthetic datasets and real-world datasets show that our model is robust to unlabeled entity problem and surpasses prior baselines. On well-annotated datasets, our model is competitive with the state-of-the-art method 1 .

1. INTRODUCTION

Named entity recognition (NER) is an important task in information extraction. Previous methods typically cast it into a sequence labeling problem by adopting IOB tagging scheme (Mesnil et al., 2015; Huang et al., 2015; Ma & Hovy, 2016; Akbik et al., 2018; Qin et al., 2019) . A representative model is Bi-LSTM CRF (Lample et al., 2016) . The great success achieved by these methods benefits from massive correctly labeled data. However, in some real scenarios, not all the entities in the training corpus are annotated. For example, in some NER tasks (Ling & Weld, 2012) , the datasets contain too many entity types or a mention may be associated with multiple labels. Since manual annotation on this condition is too hard, some entities are inevitably neglected by human annotators. Situations in distantly supervised NER (Ren et al., 2015; Fries et al., 2017) are even more serious. To reduce handcraft annotation, distant supervision (Mintz et al., 2009) is applied to automatically produce labeled data. As a result, large amounts of entities in the corpus are missed due to the limited coverage of knowledge resources. We refer this to unlabeled entity problem, which largely degrades performances of NER models. There are several approaches used in prior works to alleviate this problem. Fuzzy CRF and Au-toNER (Shang et al., 2018b) allow models to learn from the phrases that may be potential entities. However, since these phrases are obtained through a distantly supervised phrase mining method (Shang et al., 2018a) , many unlabeled entities in the training data may still not be recalled. In the context of only resorting to unlabeled corpora and an entity ontology, Mayhew et al. ( 2019 



Our source code is available at https://github.com/LeePleased/NegSampling-NER.1



); Peng et al. (2019) employ positive-unlabeled (PU) learning (Li & Liu, 2005) to unbiasedly and consistently estimate the task loss. In implementations, they build distinct binary classifiers for different labels. Nevertheless, the unlabeled entities still impact the classifiers of the corresponding entity types and, importantly, the model can't disambiguate neighboring entities. Partial CRF (Tsuboi et al., 2008) is an extension of commonly used CRF (Lafferty et al., 2001) that supports learning from incomplete annotations. Yang et al. (2018); Nooralahzadeh et al. (2019); Jie et al. (2019) use it to circumvent training with false negatives. However, as fully annotated corpora are still required

