Semi-connected Joint Entity Recognition and Relation Extraction of Contextual Entities in Family History Records

Abstract

Entity extraction is an important step in document understanding. Higher accuracy entity extraction on fine-grained entities can be achieved by combining the utility of Named Entity Recognition (NER) and Relation Extraction (RE) models. In this paper, a semiconnected joint model is proposed that implements NER and Relation extraction. This joint model utilizes relations between entities to infer context-dependent fine-grain named entities in text corpora. The RE module is prevented from conveying information to the NER module which reduces the error accumulation during training. That improves on the fine-grained NER F1-score of existing state-of-the-art from .4753 to .8563 on our data. This provides the potential for further applications in historical document processing. These applications will enable automated searching of historical documents, such as those used in economics research and family history.

1. Introduction

Named Entity Recognition (NER) also called entity extraction or entity identification -is a natural language processing (NLP) technique that automatically identifies named entities (names, places or dates for example) in a text and classifies them into predefined categories. It is often sufficient to identify an entity as a course grained entity like a name if the application is attempting to identify employees working for a company from a paragraph of text. Although this course grained entity recognition is sufficient for many applications, fine grained classification is necessary for family history applications where it is necessary to know that relationship between different entities in addition to deriving their classification. There are many companies and research organizations in the fields of family history and historical document understanding. Family history work helps people learn about their heritage and form connections with their ancestors. To facilitate their work they often automate the extraction of information en masse from historical documents. One part of this process is called entity extraction. For example, in family history, digital text is searched for particular entities. These entities include names of parents, names of children, birth dates, marriage dates, etc. These entities are extracted and compared to each other to build family tree charts in a process sometimes called indexing. To precisely index historical documents it is not enough to have course-grained labels, such as name or date, but fine-grained labels, such as spouse name and marriage date, are necessary. Furthermore, these finegrained labels often rely on the document's internal context. Unfortunately, these entities are often finegrained and contextual to relationships between entities within a record. These organizations do not have models that can accurately extract such context-dependent fine-grained entities, so they instead extract important words as coarse-grained entities (such as person, place, or date) and manually label them as fine-grained classifications. This problem is even more pronounced for records written in languages such as French with less labeled data.

2. Related Work

In 2018, Belkoulis revolutionized the field of named entity recognition (NER) by conceptualizing joint entity and relationship extraction as a multi-head selection problem (Bekoulis et al., 2018) . His paper demonstrated that relation extraction was a helpful tool to improve entity extraction accuracy. His original model used a bidirectional LSTM for sentence encoding. That same year, papers saw improved results using ELMo (Peters et al., 2018) for sentence encoding (Sanh et al., 2019) . After the introduction of BERT (Devlin

