COARSE-TO-FINE KNOWLEDGE GRAPH DOMAIN ADAPTATION BASED ON DISTANTLY-SUPERVISED ITERATIVE TRAINING

Abstract

Modern supervised learning neural network models require a large amount of manually labeled data, which makes the construction of domain-specific knowledge graphs time-consuming and labor-intensive. In parallel, although there has been much research on named entity recognition and relation extraction based on distantly supervised learning, constructing a domain-specific knowledge graph from large collections of textual data without manual annotations is still an urgent problem to be solved. In response, we propose an integrated framework for adapting and re-learning knowledge graphs from one coarse domain (biomedical) to a finer-define domain (oncology). In this framework, we apply distant-supervision on cross-domain knowledge graph adaptation. Consequently, no manual data annotation is required to train the model. We introduce a novel iterative training strategy to facilitate the discovery of domain-specific named entities and triples. Experimental results indicate that the proposed framework can perform domain adaptation and construction of knowledge graph efficiently.

1. INTRODUCTION

The triples in the knowledge graph (KG) contain the relationships between various entities, providing rich semantic background knowledge for various natural language processing (NLP) tasks, such as natural language representation Liu et al. ( 2020 In the construction of fine-domain KG scenarios, there are usually some existing resources available, such as biomedical KGs in coarse domains, which generally cover broader concepts and more commonsense knowledge. When constructing the oncology KG, the biomedical KG is thus available. However, few studies have focused on adapting KG from the coarse domain (e.g., biomedical) to the fine domain (e.g., oncology) where a large collection of unlabeled textual data are available, which motivates the work in this paper. Distant supervision Smirnova & Cudré-Mauroux ( 2018) is an intuitive way to transfer coarsedomain KG to fine domains. Distant-supervision provides labels for data with the help of an external 1



), question answering Saxena et al. (2020), image captioning Zhang et al. (2021a), and text classification Jiang et al. (2020). Consequently, automatically constructing knowledge graphs directly from natural texts has attracted close attentions of re-searchers in recent years Kertkeidkachorn & Ichise (2017); Rossanez et al. (2020); Stewart & Liu (2020). KG construction from text generally involves two primitive steps: named entity recognition (NER) and relation extraction (RE). Named entity recognition aims to identify the types of entities mentioned in text sequences, such as people, place, etc. in the open domain; or disease, medicine, disease symptom, etc. in the biomedical domain. The relation extraction also known as triple extraction, aims to identify the relationship between two entities, such as the birthplace relationship between people and places in the open domain; or the therapeutic relationship between drug and disease in the biomedical domain. NER and RE are necessary steps for information extraction to construct KG from the text. In addition to NER and RE, constructing a KG usually includes other steps such as coreference resolution, entity linking, knowledge fusion, and ontology extraction. In order to facilitate model evaluation, this paper mainly focuses on information extraction and then constructs a KG.

