COARSE-TO-FINE KNOWLEDGE GRAPH DOMAIN ADAPTATION BASED ON DISTANTLY-SUPERVISED ITERATIVE TRAINING

Abstract

Modern supervised learning neural network models require a large amount of manually labeled data, which makes the construction of domain-specific knowledge graphs time-consuming and labor-intensive. In parallel, although there has been much research on named entity recognition and relation extraction based on distantly supervised learning, constructing a domain-specific knowledge graph from large collections of textual data without manual annotations is still an urgent problem to be solved. In response, we propose an integrated framework for adapting and re-learning knowledge graphs from one coarse domain (biomedical) to a finer-define domain (oncology). In this framework, we apply distant-supervision on cross-domain knowledge graph adaptation. Consequently, no manual data annotation is required to train the model. We introduce a novel iterative training strategy to facilitate the discovery of domain-specific named entities and triples. Experimental results indicate that the proposed framework can perform domain adaptation and construction of knowledge graph efficiently.

1. INTRODUCTION

The triples in the knowledge graph (KG) contain the relationships between various entities, providing rich semantic background knowledge for various natural language processing (NLP) tasks, such as natural language representation Liu et al. ( 2020 In the construction of fine-domain KG scenarios, there are usually some existing resources available, such as biomedical KGs in coarse domains, which generally cover broader concepts and more commonsense knowledge. When constructing the oncology KG, the biomedical KG is thus available. However, few studies have focused on adapting KG from the coarse domain (e.g., biomedical) to the fine domain (e.g., oncology) where a large collection of unlabeled textual data are available, which motivates the work in this paper. Distant supervision Smirnova & Cudré-Mauroux ( 2018) is an intuitive way to transfer coarsedomain KG to fine domains. Distant-supervision provides labels for data with the help of an external knowledge base, which saves the time of manual labeling. For distantly-supervised NER, we can build distant labels by matching unlabeled sentences with external semantic dictionaries or knowledge bases. The matching strategies usually include string matching Zhao et al. ( 2019 Therefore, the KG in the coarse domain can be potentially used as a knowledge base for distant supervision, thus avoiding a large number of manual annotations. However, only using the KG of the coarse domain as the knowledge base might limit the model's ability to discover domain-specific named entities and triples in the fine domain, which further limits the construction of the fine domain KG. To address these problems, in this paper, we propose a novel coarse-to-fine knowledge graph domain adaptation (KGDA) framework. Our KGDA framework utilizes an iterative training strategy to enhance the model's ability to discover fine-domain entities and triples, thereby facilitating fast and effective coarse-to-fine KG domain adaptation. Overall, the contributions of our work are as follows: • An integrated framework for adapting and re-learning KG from coarse-domain to finedomain is proposed. As a case study, the biomedical domain and oncology domain are considered the coarse domain and fine domain, respectively. • Our model does not require human annotated samples with distant-supervision for crossdomain KG adaptation, and the iterative training strategy is applied to discovering domainspecific named entities and new triples. • The proposed method can be adapted to various pre-trained language models (PLMs) and can be easily applied to different coarse-to-fine KGDA tasks. It is so far the simplest datadriven approach for learning a KG from free text data, with the help of the coarse domain KG. • Experimental results demonstrate the effectiveness of the proposed KGDA framework. We will release the source code and the data used in this work to fuel further research. The constructed oncology KG will be hosted as a web service to be used by the general public.

2.1. PIPELINE-BASED METHODS FOR KG CONSTRUCTION

The pipeline-based methods apply carefully-crafted linguistic and statistical patterns to extract the co-occurred noun phrases as triples. There are many off-the-shelf toolkits (2021) . We will introduce the methods of fully-supervised and weakly-supervised in this section. Specifically, the NER, RE, and entity linking tasks in the KG construction pipeline can all be solved by fully-supervised learning methods such as long short-term memory neural network (LSTM) Hochreiter & Schmidhuber (1997); Zeng et al. (2017) . Graph neural network methods have also



), question answering Saxena et al. (2020), image captioning Zhang et al. (2021a), and text classification Jiang et al. (2020). Consequently, automatically constructing knowledge graphs directly from natural texts has attracted close attentions of re-searchers in recent years Kertkeidkachorn & Ichise (2017); Rossanez et al. (2020); Stewart & Liu (2020). KG construction from text generally involves two primitive steps: named entity recognition (NER) and relation extraction (RE). Named entity recognition aims to identify the types of entities mentioned in text sequences, such as people, place, etc. in the open domain; or disease, medicine, disease symptom, etc. in the biomedical domain. The relation extraction also known as triple extraction, aims to identify the relationship between two entities, such as the birthplace relationship between people and places in the open domain; or the therapeutic relationship between drug and disease in the biomedical domain. NER and RE are necessary steps for information extraction to construct KG from the text. In addition to NER and RE, constructing a KG usually includes other steps such as coreference resolution, entity linking, knowledge fusion, and ontology extraction. In order to facilitate model evaluation, this paper mainly focuses on information extraction and then constructs a KG.

), regular expressionsFries et al. (2017), and some heuristic rules. The distantly-supervised RE holds an assumptionMintz et al. (2009): if two entities participate in a relation, then any sentence that contains those two entities might express that relation. Following this assumption, any sentence mentioning a pair of entities that have a relation according to the knowledge base will be labeled with this relationSmirnova & Cudré-Mauroux (2018).

available, for example, Stanford CoreNLP Manning et al. (2014), NLTK Thanaki (2017), and spaCy, which can be used for the NER tasks; Reveb Fader et al. (2011), OLLIE Schmitz et al. (2012), and Stanford OpenIE Angeli et al. (2015) can be used for the information extraction task. There have been multiple pipelines Mehta et al. (2019); Rossanez et al. (2020) developed as well, consisting of modules targeting different functionalities needed for the KG construction. However, the pre-defined rules of off-the-shelf toolkits are generally tailored to specific domains, such methods are not domain-agnostic, and a new set of rules will be needed for a new domain.2.2 DATA-DRIVEN METHODS FOR KG CONSTRUCTIONWith the development of representation learning in language models, researchers began to apply data-driven models to solve the KG construction tasks. Based on how the model is trained, these works can be divided into three categories: fully-supervised methods Zhao et al. (2019); Li et al. (2022b), semi-supervised methods Zahera et al. (2021), and weakly-supervised methods Yu et al.

