JAKET: JOINT PRE-TRAINING OF KNOWLEDGE GRAPH AND LANGUAGE UNDERSTANDING

Abstract

Knowledge graphs (KGs) contain rich information about world knowledge, entities, and relations. Thus, they can be great supplements to existing pre-trained language models. However, it remains a challenge to efficiently integrate information from KG into language modeling. And the understanding of a knowledge graph requires related context. We propose a novel joint pre-training framework, JAKET, to model both the knowledge graph and language. The knowledge module and language module provide essential information to mutually assist each other: the knowledge module produces embeddings for entities in text while the language module generates context-aware initial embeddings for entities and relations in the graph. Our design enables the pre-trained model to easily adapt to unseen knowledge graphs in new domains. Experimental results on several knowledge-aware NLP tasks show that our proposed framework achieves superior performance by effectively leveraging knowledge in language understanding.

1. INTRODUCTION

Pre-trained language models (PLM) leverage large-scale unlabeled corpora to conduct selfsupervised training. They have achieved remarkable performance in various NLP tasks, exemplified by BERT (Devlin et al., 2018) , RoBERTa (Liu et al., 2019b) , XLNet (Yang et al., 2019) , and GPT series (Radford et al., 2018; 2019; Brown et al., 2020) . It has been shown that PLMs can effectively characterize linguistic patterns in text and generate high-quality context-aware representations (Liu et al., 2019a) . However, these models struggle to grasp world knowledge about entities and relations (Poerner et al., 2019; Talmor et al., 2019) , which are very important in language understanding. Knowledge graphs (KGs) represent entities and relations in a structural way. They can also solve the sparsity problem in text modeling. For instance, a language model may require tens of instances of the phrase "labrador is a kind of dog" in its training corpus before it implicitly learns this fact. In comparison, a knowledge graph can use two entity nodes "labrador", "dog" and a relation edge "is a" between these nodes to precisely represent this fact. Recently, some efforts have been made to integrate knowledge graphs into PLM. Most of them combine the token representations in PLM with representations of aligned KG entities. The entity embeddings in those methods are either pre-computed based on an external source by a separate model (Zhang et al., 2019; Peters et al., 2019) , which may not be easily aligned with the language representation space, or directly learned as model parameters (Févry et al., 2020; Verga et al., 2020) , which often have an over-parameterization issue due to the large number of entities. Moreover, all the previous works share a common challenge: when the pre-trained model is fine-tuned in a new domain with a previously unseen knowledge graph, it struggles to adapt to the new entities, relations and structure. Therefore, we propose JAKET, a Joint pre-trAining framework for KnowledgE graph and Text. Our framework contains a knowledge module and a language module, which mutually assist each other by providing required information to achieve more effective semantic analysis. The knowledge module leverages a graph attention network (Veličković et al., 2017) to provide structure-aware entity embeddings for language modeling. And the language module produces contextual representations as initial embeddings for KG entities and relations given their descriptive text. Thus, in both modules, content understanding is based on related knowledge and rich context. On one hand, the joint pre-training effectively projects entities/relations and text into a shared semantic latent space, which eases the semantic matching between them. On the other hand, as the knowledge module produces representations from descriptive text, it solves the over-parameterization issue since entity embeddings are no longer part of the model's parameters. In order to solve the cyclic dependency between the two modules, we propose a novel two-step language module LM 1 and LM 2 , respectively. LM 1 provides embeddings for both LM 2 and KG. The entity embeddings from KG are also fed into LM 2 , which produces the final representation. LM 1 and LM 2 can be easily established as the first several transformer layers and the rest layers of a pre-trained language model such as BERT and RoBERTa. Furthermore, we design an entity context embedding memory with periodic update which speeds up the pre-training by 15x. The pre-training tasks are all self-supervised, including entity category classification and relation type prediction for the knowledge module, and masked token prediction and masked entity prediction for the language module. A great benefit of our framework is that it can easily adapt to unseen knowledge graphs in the finetuning phase. As the initial embeddings of entities and relations come from their descriptive text, JAKET is not confined to any fixed KG. With the learned ability to integrate structural information during pre-training, the framework is extensible to novel knowledge graphs with previously unseen entities and relations, as illustrated in Figure 1 . We conduct empirical studies on several knowledge-aware natural language understanding (NLU) tasks, including few-shot relation classification, question answering and entity classification. The results show that JAKET achieves the best performance compared with strong baseline methods on all the tasks, including those with a previously unseen knowledge graph.

2. RELATED WORK

Pre-trained language models have been shown to be very effective in various NLP tasks, including ELMo (Peters et al., 2018) , GPT (Radford et al., 2018) , BERT (Devlin et al., 2018 ), RoBERTa (Liu et al., 2019b) and XLNet (Yang et al., 2019) . Built upon large-scale corpora, these pretrained models learn effective representations for various semantic structures and linguistic relationships. They are trained on self-supervised tasks like masked language modeling and next sentence prediction. Recently, a lot of efforts have been made on investigating how to integrate knowledge into PLMs (Levine et al., 2019; Soares et al., 2019; Liu et al., 2020; Guu et al., 2020) . These approaches can be grouped into two categories: 1. Explicitly injecting entity representation into the language model, where the representations are either pre-computed from external sources (Zhang et al., 2019; Peters et al., 2019) or directly learned as model parameters (Févry et al., 2020; Verga et al., 2020) . 



Figure 1: A simple illustration on the novelty of our proposed model JAKET.

For example, ERNIE (THU)(Zhang  et al., 2019)  pre-trains the entity embeddings on a knowledge graph usingTransE (Bordes et al.,  2013), whileEAE (Févry et al., 2020)  learns the representation from pre-training objectives with all the other model parameters. K-BERT(Liu et al., 2020)  represents the entities by the embeddings of surface form tokens (i.e. entity names), which contains much less semantic information compared with description text. Moreover, it only injects KG during fine-tuning phase instead of joint-pretraining KG and text.

