TOWARDS PRINCIPLED REPRESENTATION LEARNING FOR ENTITY ALIGNMENT

Abstract

Knowledge graph (KG) representation learning for entity alignment has recently received great attention. Compared with conventional methods, these embeddingbased ones are considered to be robuster for highly-heterogeneous and cross-lingual entity alignment scenarios as they do not rely on the quality of machine translation or feature extraction. Despite the significant improvement that has been made, there is little understanding of how the embedding-based entity alignment methods actually work. Most existing methods rest on the foundation that a small number of pre-aligned entities can serve as anchors to connect the embedding spaces of two KGs. But no one investigates the rationality of such foundation. In this paper, we define a typical paradigm abstracted from the existing methods, and analyze how the representation discrepancy between two potentially-aligned entities is implicitly bounded by a predefined margin in the scoring function for embedding learning. However, such a margin cannot guarantee to be tight enough for alignment learning. We mitigate this problem by proposing a new approach that explicitly learns KG-invariant and principled entity representations, meanwhile preserves the original infrastructure of existing methods. In this sense, the model not only pursues the closeness of aligned entities on geometric distance, but also aligns the neural ontologies of two KGs to eliminate the discrepancy in feature distribution and underlying ontology knowledge. Our experiments demonstrate consistent and significant improvement in performance against the existing embedding-based entity alignment methods, including several state-of-the-art ones.

1. INTRODUCTION

Knowledge Graphs (KGs), such as DBpedia (Auer et al., 2007) and Wikidata (Vrandečić & Krötzsch, 2014) , have become crucial data resources for many AI applications. Although a large-scale KG offers structured knowledge derived from millions of facts in the real world, it is still incomplete by nature, and the downstream applications are always demanding for more knowledge. To resolve this issue, the task of entity alignment (EA) is proposed, which exploits the potentially-aligned entities among different KGs to facilitate knowledge fusion and exchange. Recently, embedding-based entity alignment (EEA) methods (Chen et al., 2017; Zhu et al., 2017; Wang et al., 2018; Guo et al., 2019; Ye et al., 2019; Wu et al., 2019; Sun et al., 2020a; Fey et al., 2020) have been prevailing in this area. Their common idea is to encode semantics into embeddings and estimate the similarities by embedding distance. During this process, a small number of aligned entity pairs (a.k.a., seed alignment) are required as supervision data to align (or merge) the embedding spaces of KGs. These methods either learn an alignment function f a to minimize the difference between two entity embeddings in each seed (Wang et al., 2018) , or directly map aligned entities to one embedding vector (Sun et al., 2017) . Meanwhile, they also leverage a shared scoring function f s to encode semantics into representations, such that two underlying aligned entities that connect to respective sides of a seed shall have similar characteristics in their feature expression. Although the effectiveness of current EEA methods are empirically demonstrated (Sun et al., 2020b) , little efforts have been made on the theoretical analysis. In this paper, we fill this gap by formally defining a paradigm leveraged by the current methods. We show that the representation discrepancy of an underlying aligned entity pair is bounded in an indirect way by a margin λ in the scoring function f s . Unfortunately, we further find that this margin-based bound cannot be set as tight as expected, causing that little constrain can be put on the entities with few neighbors. To mitigate the above problem, we propose neural ontology driven entity alignment (abbr., NeoEA) , in which the entity representations are optimized jointly with a neural ontology. An ontology (Baader et al., 2005) is usually comprised of axioms that define the legitimate relationships among entities and relations. Those axioms make a KG principled (i.e., constrained by rules). For example, an "Object Property Domain" axiom in OWL2 (Baader et al., 2005) claims the valid head entities for a specific relation (e.g., the head entities of relation "birthPlace" should be in class "Person"), and it thus determines the head entity distributions of this relation. The neural ontology in this paper, however, is reversely deduced from the entity distributions. We expect to align the high-level neural ontology to diminish the discrepancy of feature distributions, as well as ontology knowledge, between two KGs. The main contributions of this paper are threefold: • We define the paradigm of the current EEA methods, and demonstrate that the embedding discrepancy in each potential alignment pair is implicitly bounded by the margin in the scoring function. We show that this bound cannot be as tight as we expect. • We propose NeoEA to learn KG-invariant as well as principled representations by aligning the neural axioms of two KGs. We prove that minimizing the difference can substantially align their corresponding ontology-level knowledge, without the assumption about the existence of real ontology data. • We conducted experiments to verify the effectiveness of NeoEA with several state-of-the-art methods as baselines. The results show that NeoEA can consistently and significantly improve the performance of the EEA methods.

2.1. METHODOLOGY

We first summarize the common paradigm employed by most existing EEA methods (Chen et al., 2017; Sun et al., 2017; Zhu et al., 2017; Sun et al., 2018; Wang et al., 2018; Pei et al., 2019a; Guo et al., 2019; Wu et al., 2019; Ye et al., 2019; Sun et al., 2020a) : Definition 1 (Embedding-based Entity Alignment). The input of EEA is two KGs G 1 = (E 1 , R 1 , T 1 ), G 2 = (E 2 , R 2 , T 2 ) , and a small subset of aligned entity pairs S ⊂ E 1 × E 2 as seeds to connect G 1 with G 2 . An EEA model consists of two neural functions: an alignment function f a , which is used to regularize the embeddings of pairwise entities in S; and a scoring function f s , which scores the representations based on the joint triple set T 1 ∪T 2 . EEA estimates the alignment score of an arbitrary entity pair (e 1 i , e 2 j ) by their geometric distance d(e 1 i , e 2 j ), where e 1 i , e 2 j denote the embeddings of e 1 i , e 2 i respectively. It is worth noting that, the existing EEA methods have different settings in relation seed alignment. Some works (Chen et al., 2017; Zhu et al., 2017) assume that all aligned relation pairs are known in advance. Others (Sun et al., 2017; 2018) suppose that the number of relations is much smaller than that of entities, i.e., |R| |E|, which means that the training data for aligning relations is sufficient. In this paper, we do not explore the details of relation seed setting. We assume that the relation representations for a well-trained EEA model are aligned. The existing works have explored the diversity of f a . The pioneering work MTransE (Chen et al., 2017) proposed to learn a mapping matrix to cast an entity representation e 1 i to the feature space of G 2 . SEA (Pei et al., 2019a) and OTEA (Pei et al., 2019b) extended this approach by leveraging adversarial training to learn the projection matrix. Recently, a simpler yet more efficient choice was widely-used, which directly maps (e 1 i , e 2 i ) ∈ S to one embedding vector e i (Sun et al., 2017; Zhu et al., 2017; Trsedya et al., 2019; Guo et al., 2019) . Also, researchers (Wang et al., 2018; Pei et al., 2019a; Wu et al., 2019) started to leverage a softer way to incorporate seed information, in which the distance between entities in a positive pair (i.e., supervised data in S) is minimized, while that referred to the negative one will be enlarged. As the most common choice, we consider f a as

