COARSE-TO-FINE KNOWLEDGE GRAPH DOMAIN ADAPTATION BASED ON DISTANTLY-SUPERVISED ITERATIVE TRAINING

Abstract

Modern supervised learning neural network models require a large amount of manually labeled data, which makes the construction of domain-specific knowledge graphs time-consuming and labor-intensive. In parallel, although there has been much research on named entity recognition and relation extraction based on distantly supervised learning, constructing a domain-specific knowledge graph from large collections of textual data without manual annotations is still an urgent problem to be solved. In response, we propose an integrated framework for adapting and re-learning knowledge graphs from one coarse domain (biomedical) to a finer-define domain (oncology). In this framework, we apply distant-supervision on cross-domain knowledge graph adaptation. Consequently, no manual data annotation is required to train the model. We introduce a novel iterative training strategy to facilitate the discovery of domain-specific named entities and triples. Experimental results indicate that the proposed framework can perform domain adaptation and construction of knowledge graph efficiently.

1. INTRODUCTION

The triples in the knowledge graph (KG) contain the relationships between various entities, providing rich semantic background knowledge for various natural language processing (NLP) tasks, such as natural language representation Liu et al. (2020) , question answering Saxena et al. (2020) , image captioning Zhang et al. (2021a) , and text classification Jiang et al. (2020) . Consequently, automatically constructing knowledge graphs directly from natural texts has attracted close attentions of re-searchers in recent years Kertkeidkachorn & Ichise (2017) ; Rossanez et al. (2020) ; Stewart & Liu (2020) . KG construction from text generally involves two primitive steps: named entity recognition (NER) and relation extraction (RE) . Named entity recognition aims to identify the types of entities mentioned in text sequences, such as people, place, etc. in the open domain; or disease, medicine, disease symptom, etc. in the biomedical domain. The relation extraction also known as triple extraction, aims to identify the relationship between two entities, such as the birthplace relationship between people and places in the open domain; or the therapeutic relationship between drug and disease in the biomedical domain. NER and RE are necessary steps for information extraction to construct KG from the text. In addition to NER and RE, constructing a KG usually includes other steps such as coreference resolution, entity linking, knowledge fusion, and ontology extraction. In order to facilitate model evaluation, this paper mainly focuses on information extraction and then constructs a KG. In the construction of fine-domain KG scenarios, there are usually some existing resources available, such as biomedical KGs in coarse domains, which generally cover broader concepts and more commonsense knowledge. When constructing the oncology KG, the biomedical KG is thus available. However, few studies have focused on adapting KG from the coarse domain (e.g., biomedical) to the fine domain (e.g., oncology) where a large collection of unlabeled textual data are available, which motivates the work in this paper. Distant supervision Smirnova & Cudré-Mauroux (2018) is an intuitive way to transfer coarsedomain KG to fine domains. Distant-supervision provides labels for data with the help of an external knowledge base, which saves the time of manual labeling. For distantly-supervised NER, we can build distant labels by matching unlabeled sentences with external semantic dictionaries or knowledge bases. The matching strategies usually include string matching Zhao et al. (2019) , regular expressions Fries et al. (2017) , and some heuristic rules. The distantly-supervised RE holds an assumption Mintz et al. (2009) : if two entities participate in a relation, then any sentence that contains those two entities might express that relation. Following this assumption, any sentence mentioning a pair of entities that have a relation according to the knowledge base will be labeled with this relation Smirnova & Cudré-Mauroux (2018) . Therefore, the KG in the coarse domain can be potentially used as a knowledge base for distant supervision, thus avoiding a large number of manual annotations. However, only using the KG of the coarse domain as the knowledge base might limit the model's ability to discover domain-specific named entities and triples in the fine domain, which further limits the construction of the fine domain KG. To address these problems, in this paper, we propose a novel coarse-to-fine knowledge graph domain adaptation (KGDA) framework. Our KGDA framework utilizes an iterative training strategy to enhance the model's ability to discover fine-domain entities and triples, thereby facilitating fast and effective coarse-to-fine KG domain adaptation. Overall, the contributions of our work are as follows: • An integrated framework for adapting and re-learning KG from coarse-domain to finedomain is proposed. As a case study, the biomedical domain and oncology domain are considered the coarse domain and fine domain, respectively. • Our model does not require human annotated samples with distant-supervision for crossdomain KG adaptation, and the iterative training strategy is applied to discovering domainspecific named entities and new triples. • The proposed method can be adapted to various pre-trained language models (PLMs) and can be easily applied to different coarse-to-fine KGDA tasks. It is so far the simplest datadriven approach for learning a KG from free text data, with the help of the coarse domain KG. • Experimental results demonstrate the effectiveness of the proposed KGDA framework. We will release the source code and the data used in this work to fuel further research. The constructed oncology KG will be hosted as a web service to be used by the general public.

2.1. PIPELINE-BASED METHODS FOR KG CONSTRUCTION

The pipeline-based methods apply carefully-crafted linguistic and statistical patterns to extract the co-occurred noun phrases as triples. There are many off-the-shelf toolkits (2021) . We will introduce the methods of fully-supervised and weakly-supervised in this section. Specifically, the NER, RE, and entity linking tasks in the KG construction pipeline can all be solved by fully-supervised learning methods such as long short-term memory neural network (LSTM) Hochreiter & Schmidhuber (1997) ; Zeng et al. (2017) . Graph neural network methods have also On the other hand, distant supervision, a weakly supervised learning method, can replace manual annotation with an existing and remote knowledge base. Previous studies have applied remote supervised learning to deal with NER Zheng et al. (2021) , and RE Wei (2021); Zhang et al. (2021b) tasks. Thus in this work, we adopted the distant-supervision scheme in the proposed KGDA framework. It should be noted that KG of the coarse domain (e.g., biomedical) generally will not contain the complete knowledge of its finer sub-domains (e.g., oncology). So when we use the coarse-domain KG for distant supervision, labels of the target domain will be limited by the source domain, making it less effective to discover new knowledge. To address this issue, we introduced an iterative strategy to gradually update the model via distant supervision while at the same time using the partially-trained model to discover new entities and relations from the data of the target fine domain.

3.1. NOTATION AND TASK DEFINITION

An unstructured sentence s = [w 1 , w 2 , w 3 , ..., w n ] indicates a sequence of tokens, where n is its length. A dataset D is a collection of unstructured sentences (i.e. D = {s 1 , s 2 , s 3 , .., s m }). The knowledge graph, denoted as K, is a collection of triples t = (e i , r j , e k ), where e i ∈ E and e k ∈ E are the head entity and the tail entity respectively, and r j ∈ V is the relation between e i and e k . Here we denote coarse-domain KG as K c and fine-domain KG as K f . In a typical scenario of KG domain adaptation, we will have an existing coarse-domain KG and a large amount of unlabeled text in the fine domain. For example, when constructing the oncology KG, we can utilize the existing biomedical KG and collect oncology-related literature as unlabeled text. KG constructed from the fine domain data would then include overlapping triples with the coarsedomain KG and new triples representing domain-specific knowledge. Specifically, the fine-domain KG contains the following three types of triples: • Overlapping triples T O : Triples that also existed in the coarse-domain KG, indicating knowledge overlapping between the coarse and fine domains. • Triples of new relations but overlapping entities T R : Triples with both entity pairs existing in the coarse-domain KG but no indicated relationships between these entity pairs. • Triples of new entities T E : Triples with at least one entity not existing in the coarsedomain KG. Consequently, the relationship is also unknown in the coarse domain. Both T R and T E belong to the specific knowledge of the fine domain. The goal of the coarseto-fine KGDA task is to adapt the KG from the coarse domain to the fine domain and leverage the knowledge from the coarse domain to guide the mining of new knowledge specific to the fine domain. Finally, we will keep the definition of entity types and relation types from coarse-domain KG when constructing the fine-domain KG.

3.2. ITERATIVE TRAINING FRAMEWORK

While it is trivial to identify the overlapping entities E O and triples T O by distant supervision, if the NER and RE models are trained on the entire corpus, they will not be able to recognize the fine domain-specific named entities and triples (T R and T E ). Because the distant-supervision labels are generated by matching K c . Thus we introduce an iterative training strategy to construct T R and T E from the text and adapt the knowledge from K c to K f . The overall framework of the iterative training scheme is shown in Fig. 1 , and the detailed pseudo code can be found in Algorithm 1. Rather than performing distant-supervision training on the whole unlabeled text corpus, the core mechanism of the proposed iterative training is to split the whole unlabeled dataset into n sub-datasets without intersection. Before building distant-supervision corpus, the trained model is used to predict the text corpus for getting specific knowledge of fine-domain, which is conducive to mining T R and T E of the fine-domain. As shown in Figure 1 , firstly, it is necessary to preprocess the acquired text corpus in the fine domain. Preprocessing operations include: handling special characters, word segmentation, filtering sentences using human-defined rules (such as sentence length), etc. Then, our framework involves two neural network models: NER model and RE model. We replace the PLM's output layer with a classifier head as NER model model N and fine-tune it by minimizing the cross-entropy loss on distant-supervision NER corpus. Additionally, we apply the BIO scheme Li et al. (2012) For other part of the text corpus D i , we apply the previously trained model N and mode R to extract the entities and triples in the fine-domain, and select the high confidence entities E conf and high confidence triples T conf as the specific knowledge of the fine-domain (line 7). Then, we take K c , E conf , and T conf as the external knowledge base for constructing distant-supervision corp N and corp R (line 8). Finally, we use overlapping triples T O and high-confidence triples T conf to construct a knowledge graph of fine domains (line 17). Next, we show the details of get distant corpus in Algorithm 2 and get specific knowledge in Algorithm 3.

3.3. CONSTRUCTING DISTANTLY-SUPERVISED CORPUS

Through distant-supervision, we can only match entity pairs that have a relationship and use them as positive samples. We then construct negative samples with NULL relationship by the following two schemes: 1) randomly sampling two entities which have no relationship as defined in the coarsedomain; 2) randomly sampling a word from out-of-domain words (i.e., a word that is not an entity as defined in the coarse domain) W O as one of the entities. The parameter ratio n controls the ratio  T conf = {} . 2: corp N , corp R , E O , T O = build distant corpus( D 1 , K c , E conf , T conf , W O ) 3: train NER(model N , corp N ) 4: train RE(model R , corp R ) 5: i = 2 6: while i <= n do 7: E new , E conf , T new , T conf = get specific knowledge(D i , K c , E new , E conf , T new , T conf ) 8: corp ′ N , corp ′ R , E ′ O , T ′ O = get distant corpus( D i , K c , E conf , T conf , W O ) 9: corp N = corp N ∪ corp ′ N 10: corp R = corp R ∪ corp ′ R 11: entities = entity matching( D j i , K c , E conf ) E O = E O ∪ E ′ O 12: T O = T O ∪ T ′ O 13: train NER(model N , corp N ) 14: train RE(model R , corp R ) 15: i = i + 1 16: end while 17: K f = build kg(T O , T conf ) 18: return K f 6: E O = E O ∪ entities 7: corp N = corp N ∪ build NER sample( D j i , entities) 8: triples k ,triples c = entity pair matching( D j i , K c , T conf ) 9: triples = triples k ∪ triples c 10: triples n = get negative triples( D j i , W O , triples, ratio n , ratio o ) 11: corp R = corp R ∪ get samples(triples) 12: corp R = corp R ∪ get samples(triples n ) 13: T O = T O ∪ triples k 14: j = j + 1 15: end while 16: return corp N , corp R , E O , T O of negative samples (constructed by either schemes) to the total sample size. The parameter ratio o controls the ratio of entity pairs constructed by the second scheme (i.e., via sampling the words outside the domain) to the size of negative samples, respectively. In addition to the K c in the source domain, we use both K c , E conf , and T conf as knowledge bases for constructing the remotely supervised corpus. This would ensure that the NER and RE models can identify the overlapping knowledge between K c and K f , while at the same time be guided to discover the new knowledge specific to the fine domain. T new = merge triple( T new , triples ) 18: j = j + 1 19: end while 20: T conf = get confidence triple( T new , th pt , th pt ) 21: return E new , E conf , T new , T conf As shown in Algorithm 2, for building the distantly-supervised NER corpus corp N , the sentence D j i is firstly string-matched with the knowledge bases K c and E conf to extract the entities in the sentence (line 5). Afterward, the matched entities are merged into overlapping entities E O , and the NER label sequences are generated through the BIO strategy to merge into corp N (line 6 and 7). For building the distantly-supervised RE corpus corp R , we firstly take K c and T conf as knowledge bases and use entity pair matching to match the triples triples k based on K c and the triples triples c based on T conf appearing in the sentence D j i (line 8). We then build negative triples with parameters ratio n and ratio o (line 10). Finally, we construct the RE corpus based on the triples triples, triples n and corresponding sentences through a pre-defined relationship sample template (line 11 and 12).

3.4. DISCOVERING FINE-DOMAIN SPECIFIC KNOWLEDGE

Recall that in the proposed iterative training framework, the whole unlabeled dataset is divided into n sub-dataset D i , i = 1...n, the fine-domain specific knowledge discovery will be performed on each sub-dataset except the first one D i , i = 2...n (line 5 to 16 in Algorithm 1). For each new sub-dataset D i , i = 2...n, we will use the previously-updated models model N and model R to predict the new entities and triples. Afterward, the sub-dataset will be used for updating model N and model R via distantly-supervised training. As noisy or incorrect entities and triples could be discovered during this procedure, we developed a filtering mechanism only to keep the entities and triples with higher confidence. Specifically, we design the rules for filtering the discovered entities and triples by: 1) probability of the new entities and triples predicted by the corresponding models should be greater than pre-defined thresholds th pe and th pt , respectively; 2) cumulative frequency of the new entities and triples discovered from datasets D 2 to D i should be greater than the pre-defined thresholds th f e and th f t , respectively. As shown in Algorithm 3, for discovering new entities E new , we will apply the trained model N on dataset D i and obtain entities that are disjoint with K c (line 5 and 6). Then, we will merge entities with the previously-discovered entity set E new (line 7). Finally, we will select the "high-confident" entity as E conf based on the mechanism above by the prediction probability and cumulative frequency (line 10). For the discovery of new triples T new , we will enumerate entity pairs that are disjoint with the K c (line 13 -15). We will then use the trained RE model and the predefined sample template to predict the relationship of the entity pairs and delete the triples whose predicted relationship is NULL (line 16). Other processing is similar to the discovery of new entities. After Algorithm 3, discovered entities specific to the fine domain are stored in E conf . Discovered triples T R (new relation, overlapping entity) and T E (new relation, new entity) are stored in T conf . In the next iteration, Algorithm 2 will then use the updated E conf and T conf for building distant-supervision corpus. Such iterative design can facilitate the interoperability between the two competing tasks based on a fixed number of unannotated data samples in the fine target domain: distantly-supervised training of the NER and RE models versus the discovery of new knowledge using the trained NER and RE models, thus improve the efficiency of performing KG domain adaptation and construction without any annotation.

4. EXPERIMENTS

In this work, we used the adaptation of KG from the biomedical domain (coarse) to the oncology domain (fine) as an example to demonstrate the workflow of the KGDA framework, as well as to evaluate its effectiveness in practice. Implementation details of the experiment are also provided, along with the publicly-available data and the containerized environment in the released source code, for easy replication of the experiment and the development of other KG methods.

4.1. DATASET

We downloaded papers from 12 international journals (journal details can be found in the supplemental materials) in the oncology domain. PDF files of the papers were cleaned and converted to sentences. In total, we included nearly 240,000 sentences as the unlabeled text corpus of the oncology domain D. The coarse-domain KG K c used in this work is the biomedical KGfoot_0 , defines 18 entity types and 19 relationship types, including 5.2 million English entities and 7.34 million triples. The lists of entity types and relationship types can be found in the supplementary materials.

4.2. EVALUATION

Similar to the previous works Mintz et al. (2009) , we evaluate our method in two schemes: held-out evaluation and manual evaluation. For the held-out evaluation, we reserved a part of the text corpus of D as the test set. During the testing, we then compared the prediction results of the NER and RE models with the labels matched with K c , and calculated the precision, recall, and F1 of the held-out dataset. Specifically, we use seqvelfoot_1 to evaluate the micro average precision, recall, F1 of NER. When evaluating the RE model, we perform relation classification prediction on the triples existing in K c and corresponding entity pairs appearing in the held-out corpus. Finally, weighted average precision, recall, and F1 from the held-out evaluation will be reported. As the labels of testing samples in the held-out evaluation are all inferred by distant supervision from the coarse domain, such scheme can only evaluate whether the trained model can capture the knowledge in the coarse domain, but cannot evaluate the ability of the models to discover new knowledge in the fine-domain. Therefore, we also adopted the manual evaluation scheme, consisting of the evaluations of: 1) the entities specific to fine domain E conf , which are not presented in K c ; 2) the triples of new relations T R ; 3) the triples of new entities T E . We randomly sampled 50 cases of E conf , T R , and T E respectively, then asked one physician to manually label them for whether the entities and triples are correct. As the number of name entities and triples instances that are expressed in the corpus is unknown, we cannot estimate the recall of fine-domain KG. Therefore, we only show the precision of E conf , T R , and T E . We fully recognize that the discovery of new knowledge in the fine-domain is an indispensable task for this work and we are recruiting more medical experts to conduct human reader study and performance evaluation for the proposed model.

4.3. IMPLEMENTATION SETTINGS

We divide the corpus D into six equal subsets, and each subset contains around 40,000 sentences. We used D 1 to D 5 for model training and KG construction. We reserved D 6 for held-out evaluation. We tested BERT Kenton & Toutanova (2019 ), Bio ClinicalBERT Alsentzer et al. (2019) , biomed RoBERTa Gururangan et al. (2020) for initializing NER and RE models. Our experiments were run on an Ubuntu system computer with three NVIDIA 1080Ti graphics cards. The learning rate, batch size, and epochs are set as 2E-05, 20, and 4, respectively. Hyperparameters th f e ,th pe ,th f t ,th pt are set as 2, 0.95, 3, 0.97. The parameters ratio n and ratio o that control negative sampling are set to 0.2 and 0.3. The number of all discovered entities (E O ), triples (T O ), new entities with high confidence (E conf ), triples representing new relations with overlapping entities (T R ), and triples representing new relations with new entities (T E ) are shown in Table 3 , with each row belonging to one pre-trained language models used. Numbers of E O and T O have minor differences among different pre-trained language models, possibly due to the conflicts in strings matching of knowledge bases. E conf , T R , and T E represent specific knowledge of the fine domain. We sampled 50 cases from E conf , T R , and T E for manual evaluation, and the results are shown in Table 4 .

4.6. KNOWLEDGE GRAPH CONSTRUCTION IN THE FINE DOMAIN

As our ultimate goal, we can construct the KG in the fine domain by combining T O , T R , and T E . We selected biomed RoBERTa as the backbone language model for KGDA and constructed the knowledge graph correspondingly. An example of the KG we built are shown in the supplementary material.

4.7. ABLATION STUDY

We investigated the impact of 3 techniques employed by KGDA on its held-out experiment performance by removing the corresponding component from the framework: w/o (cumulative): When using corpus D i to train NER and RE models, the cumulative corpus is not used. i.e. delete lines 9 and 10 in Algorithm 1 and mark corp The results of the ablation analysis are shown in Table 5 . Compared to the complete framework with w/o (cumulative), it can be seen that the using of accumulated data through iterations is beneficial for improving the generalization ability of NER and RE models. The held-out performances of the model without iteration indicates that the iterative training strategy can not only discover the specific knowledge in the fine domain, but also maintain the ability to discover overlapping knowledge between coarse and fine domain. The RE performance of w/o (iter) is slightly better than that of w/o (iter, type), indicating that specifying the entity type of the entities is helpful for improving the performance of the RE task. 

5. CONCLUSION AND DISCUSSION

In this paper, we propose an integrated, end-to-end framework for knowledge graph domain adaptation using distant supervision, which can be used to construct KG from fully unlabeled raw text data with the guidance of an existing KG. To deal with the potential challenges in distant supervision, which might limit the knowledge discovered from the new domain, we propose an iterative training strategy, which divides an unlabeled corpus into multiple corpuses. For each new corpus to the model, we then combine the knowledge in the coarse domain with the knowledge identified from the previous corpuses for distantly-supervised training. By adopting the iterative training strategy, our proposed KGDA framework can discover not only knowledge that overlaps with the coarse domain, but also knowledge specific to the fine domain and unknown to the coarse domain, thus enabling coarse-to-fine domain adaptation. We implemented the adaptation from biomedical KG to the oncology domain in our experiments and verified the effectiveness of the KGDA framework through held-out and manual evaluation. Several limitations and challenges remain beyond the current work for more effective and accurate KG construction: Firstly, more thorough evaluation with human reader study is needed to validate that new knowledge relevant (not only correct) to the target domain can be discovered by KGDA. Secondly, it has been recognized by the field that distant supervision will inevitably introduce noisy labels Liang et al. (2020) ; Zhang et al. (2021b) , thus the denoising step is usually needed but not implemented in the current version of KGDA. Thirdly, there has been existing KG constructed in the related domains of oncology and cancer research. We will investigate the scheme to allow adaption from multiple sources (not only the coarse domain) to leverage this existing knowledge better. Another type of crucial prior information for this work is clinical ontology, where we will integrate the relationships defined in ontology and entity description to enhance the model. Fourthly, an essential premise of the KGDA is that we assume the source and target domains share the same set of entity types and relation types, which can limit the knowledge discovered from the fine domain. We will investigate data mining techniques to adaptively add/remove entity and relation types in the fine domain. Finally, there have been many new large-scale pre-trained language models developed such as GPT-3 in recent years. While our model uses variations of BERT (BiomedRoBERTa and BioClinicalBERT) as backbone networks, we can easily adapt KGDA to other language models.



https://idea.edu.cn/bios.html https://github.com/chakki-works/seqeval



Figure 1: The overall framework of iterative training KGDA.

Iterative training KGDA framework Input: Text corpus D = {D 1 , D 2 , ..., D n }, coarse-domain KG K c , out-of-domain words W O Parameter: Initialized NER model model N , initialized RE model model R Output: fine-domain kg K f 1: Let new entities E new = {} , new entities with high confidence E conf = {} , new triples T new = {} , new triples with high confidence

Constructing distantly-supervised corpus Input: A part of text corpus text corpus D i , coarse-domain KG K c , new entities with high confidence E conf , new triples with high confidence T conf , out-of-domain words W O Parameter: negative sample ratio ratio n , out-of-domain sample ratio ratio o Output: Distant-supervision NER corpus corp N , distant-supervision RE corpus corp R , overlapping entities E O , overlapping triples T O 1: Let corp E = {}, corp R = {}, E O = {}, T O = {}. 2: sentence num = len( D i ) 3: j = 1 4: while j<=sentence num do 5:

N and corp ′ R in line 8 as corp N and corp R respectively. w/o (iter): Remove the iterative training strategy and only use K C as an external knowledge base. w/o (iter, type): Remove the iterative training strategy and delete the entity type in the template of RE. In, this method, the template is "[CLS] head entity [SEP] tail entity [SEP] sentence".

We summarize the steps to achieve KGDA in Algorithm 1. For the first parts of the text corpus D 1 , the distant-supervision method is applied to construct the NER training corpus corp N and RE training corpus corp R , and the NER model model N and RE model mode R are trained based on corpus corp N and corp R , respectively.

Algorithm 3 Discovering fine-domain specific knowledge Input: A part of text corpus text corpus D i , coarse-domain KG K c , new entities E new , new entities with high confidence E conf , new triples T new , new triples with high confidence T conf Parameter: NER model model N , RE model model R , probability threshold of the entity th pe , frequency threshold of the entity th f e , probability threshold of the triple th pt , frequency threshold of the triple th f t Output: E new , E conf , T new , T conf 1: Let corp E = {}, corp R = {}, E O = {}, T O = {}. E conf = get confidence entity( E new , th pe , th f e ) 11: j = 1 12: while j <= sentence num do

Held-out evaluation of NER model.The results of the NER and RE models evaluated by the held-out dataset are shown in Table1and Table2, respectively. The KGDA frameworks initialized by the three pre-trained language models (BERT, Bio ClinicalBERT, and biomed RoBERTa) all show good performance in held-out evaluations, demonstrating the robustness of our framework.

The number of entities and triples.

Results of manual evaluations.

