Semi-connected Joint Entity Recognition and Relation Extraction of Contextual Entities in Family History Records

Abstract

Entity extraction is an important step in document understanding. Higher accuracy entity extraction on fine-grained entities can be achieved by combining the utility of Named Entity Recognition (NER) and Relation Extraction (RE) models. In this paper, a semiconnected joint model is proposed that implements NER and Relation extraction. This joint model utilizes relations between entities to infer context-dependent fine-grain named entities in text corpora. The RE module is prevented from conveying information to the NER module which reduces the error accumulation during training. That improves on the fine-grained NER F1-score of existing state-of-the-art from .4753 to .8563 on our data. This provides the potential for further applications in historical document processing. These applications will enable automated searching of historical documents, such as those used in economics research and family history.

1. Introduction

Named Entity Recognition (NER) also called entity extraction or entity identification -is a natural language processing (NLP) technique that automatically identifies named entities (names, places or dates for example) in a text and classifies them into predefined categories. It is often sufficient to identify an entity as a course grained entity like a name if the application is attempting to identify employees working for a company from a paragraph of text. Although this course grained entity recognition is sufficient for many applications, fine grained classification is necessary for family history applications where it is necessary to know that relationship between different entities in addition to deriving their classification. There are many companies and research organizations in the fields of family history and historical document understanding. Family history work helps people learn about their heritage and form connections with their ancestors. To facilitate their work they often automate the extraction of information en masse from historical documents. One part of this process is called entity extraction. For example, in family history, digital text is searched for particular entities. These entities include names of parents, names of children, birth dates, marriage dates, etc. These entities are extracted and compared to each other to build family tree charts in a process sometimes called indexing. To precisely index historical documents it is not enough to have course-grained labels, such as name or date, but fine-grained labels, such as spouse name and marriage date, are necessary. Furthermore, these finegrained labels often rely on the document's internal context. Unfortunately, these entities are often finegrained and contextual to relationships between entities within a record. These organizations do not have models that can accurately extract such context-dependent fine-grained entities, so they instead extract important words as coarse-grained entities (such as person, place, or date) and manually label them as fine-grained classifications. This problem is even more pronounced for records written in languages such as French with less labeled data.

2. Related Work

In 2018, Belkoulis revolutionized the field of named entity recognition (NER) by conceptualizing joint entity and relationship extraction as a multi-head selection problem (Bekoulis et al., 2018) . His paper demonstrated that relation extraction was a helpful tool to improve entity extraction accuracy. His original model used a bidirectional LSTM for sentence encoding. That same year, papers saw improved results using ELMo (Peters et al., 2018) for sentence encoding (Sanh et al., 2019) . After the introduction of BERT (Devlin Figure 1 : Transcription on records into Family trees. Example of a handwritten birth record, written in French, which after transcription(using the handwriting recognition model) is passed through our joint entity relation model in order to identify the entities and their respective relations. Note that the transcription is not perfect as it contains spelling mistakes, grammatical inconsistencies, etc that make it harder for the model to find correct entities and relations et al., 2019) in 2019, several papers saw even better results when using BERT for sentence encoding instead of ELMo. Since then, papers in the field have improved entity recognition accuracy by creating joint models that use relation extraction. The ways these joint models are implemented vary. One approach is to have the joint model ask questions about the data (Li et al., 2019 )(Zhao et al., 2020) . Another approach uses distantly supervised data augmentation to reduce the impact of negative labels on the joint model (Xie et al., 2021) . Jue Wang's paper sees improved relation extraction by filling an entity-relation table (Wang and Lu, 2020) . Hierarchical relationship extraction is particularly effective at detecting hierarchical relationships (Han et al., 2018 )(Takanobu et al., 2019 )(Zhang et al., 2021) . The most successful approaches for entity recognition insert markers into the sentences. This is done in models such as PURE (Zhong and Chen, 2021). These markers reduce the need for embeddings which reduces memory needed and improves inference speed. Papers that use this approach are in the top 3 micro-F1 scores for both entity extraction and relation extraction accuracy for the benchmark datasets: CoNLL 2003, ACE2004, ACE2005, and SciERC (Ye et al., 2021) . The models from existing research perform well on benchmark datasets. However, they fail to perform on more complicated datasets, such as ones with fine-grained or contextual entities. Most of these models can not find all the relations in a multisentence corpus because they can not map a relationship between two entities in separate sentences. However, cross-sentence relation extraction is necessary to find the nuanced relationships in family history records.

3. Problem Description

Existing research insufficiently performs the task of fine-grained entity extraction on contextual entities. Traditional methods may be able to identify names and dates, but are unable to identify Mother's Name vs Sisters Name or the dates corresponding to different events in a record. Much of the difficulty comes from the space between entities in a paragraph. Combining entity recognition with relation extraction significantly improves the accuracy of contextual fine-grained entity extraction for automatic indexing systems used by researchers performing automated historical document analysis. This contributes a novel solution to significantly reduce the manual annotation effort when indexing records without sacrificing recognition accuracy. The indexing of family history records requires a NER model that identifies contextual fine-grained entities. However, it is difficult to train NER models to do that on their own. Relation extraction can help find the context in the corpus by identifying relations between entities. A joint model is needed

