OPTIMIZING BI-ENCODER FOR NAMED ENTITY RECOGNITION VIA CONTRASTIVE LEARNING

Abstract

We present a bi-encoder framework for named entity recognition (NER), which applies contrastive learning to map candidate text spans and entity types into the same vector representation space. Prior work predominantly approaches NER as sequence labeling or span classification. We instead frame NER as a representation learning problem that maximizes the similarity between the vector representations of an entity mention and its type. This makes it easy to handle nested and flat NER alike, and can better leverage noisy self-supervision signals. A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions. Instead of explicitly labeling all non-entity spans as the same class Outside (O) as in most prior methods, we introduce a novel dynamic thresholding loss, learned in conjunction with the standard contrastive loss. Experiments show that our method performs well in both supervised and distantly supervised settings, for nested and flat NER alike, establishing new state of the art across standard datasets in the general domain (e.g., ACE2004, ACE2005, CoNLL2003) and high-value verticals such as biomedicine (e.g., GENIA, NCBI, BC5CDR, JNLPBA). We release the code at github.com/microsoft/binder.

1. INTRODUCTION

Named entity recognition (NER) is the task of identifying text spans associated with named entities and classifying them into a predefined set of entity types such as person, location, etc. As a fundamental component in information extraction systems (Nadeau & Sekine, 2007) , NER has been shown to be of benefit to various downstream tasks such as relation extraction (Mintz et al., 2009) , coreference resolution (Chang et al., 2013) , and fine-grained opinion mining (Choi et al., 2006) . Inspired by recent success in open-domain question answering (Karpukhin et al., 2020) and entity linking (Wu et al., 2020; Zhang et al., 2021a) , we propose an efficient BI-encoder for NameD Entity Recognition (BINDER). Our model employs two encoders to separately map text and entity types into the same vector space, and it is able to reuse the vector representations of text for different entity types (or vice versa), resulting in a faster training and inference speed. Based on the bi-encoder representations, we propose a unified contrastive learning framework for NER, which enables us to overcome the limitations of popular NER formulations (shown in Figure 1 ), such as difficulty in handling nested NER with sequence labeling (Chiu & Nichols, 2016; Ma & Hovy, 2016) , complex learning and inference for span-based classification (Yu et al., 2020; Fu et al., 2021) , and challenges in learning with noisy supervision (Straková et al., 2019; Yan et al., 2021) .foot_0 Through contrastive learning, we encourage the representation of entity types to be similar with the corresponding entity spans, and to be dissimilar with that of other text spans. Additionally, existing work labels all nonentity tokens or spans as the same class Outside (O), which can introduce false negatives when the training data is partially annotated (Das et al., 2022; Aly et al., 2021) . We instead introduce a novel dynamic thresholding loss in contrastive learning, which learns candidate-specific dynamic thresholds to distinguish entity spans from non-entity ones. To the best of our knowledge, we are the first to optimize bi-encoder for NER via contrastive learning. We conduct extensive experiments to evaluate our method in both supervised and distantly 

✓ ✓ ✓

Figure 1 : Left: The architecture of BINDER. The entity type and text encoder are isomorphic and fully decoupled Transformer models. In the vector space, the anchor point ( ) represents the special token [CLS] from the entity type encoder. Through contrastive learning, we maximize the similarity between the anchor and the positive token ( Jim), and minimize the similarity between the anchor and negative tokens. The dotted gray circle (delimited by the similarity between the anchor and [CLS] from the text encoder) represents a threshold that separates entity tokens from non-entity tokens. To reduce clutter, data points that represent possible spans from the input text are not shown. Right: We compare BINDER with existing solutions for NER on three dimensions: 1) whether it can be applied to nested NER without special handling; 2) whether it can be trained using noisy supervision without special handling; 3) whether it has a fast training and inference speed. supervised settings. Experiments demonstrate that our method achieves the state of the art on a wide range of NER datasets, covering both general and biomedical domains. In supervised NER, compared to the previous best results, our method obtains a 2.4%-2.9% absolute improvement in F1 on standard nested NER datasets such as ACE2004 and ACE2005, and a 1.2%-1.9% absolute improvement on standard flat NER datasets such as BC5-chem, BC5-disease, and NCBI. In distantly supervised NER, our method obtains a 1.5% absolute improvement in F1 on the BC5CDR dataset. We further study the impact of various choices of components in our method, and conduct breakdown analysis at entity type level and token level, which reveals potential growth opportunities.

2. METHOD

In this section, we present the design of BINDER, a novel architecture for NER tasks. As our model is built upon a bi-encoder framework, we first provide the necessary background for encoding both entity types and text using the Transformer-based (Vaswani et al., 2017) bi-encoder. Then, we discuss our ways of deriving entity type and individual mention span representations using the embedding output from the bi-encoder. Based on that, we introduce two types of contrastive learning objectives for NER using the token and span-level similarity respectively.

2.1. BI-ENCODER FOR NER

The overall architecture of BINDER is shown in Figure 1 . Our model is built upon a bi-encoder architecture which has been mostly explored for dense retrieval (Karpukhin et al., 2020) . Following the recent work, our bi-encoder also consists of two isomorphic and fully decoupled Transformer models (Vaswani et al., 2017) , i.e. an entity type encoder and a text encoder. For NER tasks, we consider two types of inputs, entity type descriptions and text to detect named entities. At the high level, the entity type encoder produces type representations for each entity of interests (e.g. person in Figure 1 ) and the text encoder outputs representations for each input token in the given text where named entities are potentially mentioned (e.g. Jim in Figure 1 ). Then, we enumerate all span candidates based on corresponding token representations and match them with each entity type in the vector space. As shown in Figure 1 , we maximize the similarity between the entity type and the positive spans, and minimize the similarity of negative spans.



Das et al. (2022) applies contrastive learning for NER in a few-shot setting. In this paper, we focus on supervised NER and distantly supervised NER.

