TOWARDS PRINCIPLED REPRESENTATION LEARNING FOR ENTITY ALIGNMENT

Abstract

Knowledge graph (KG) representation learning for entity alignment has recently received great attention. Compared with conventional methods, these embeddingbased ones are considered to be robuster for highly-heterogeneous and cross-lingual entity alignment scenarios as they do not rely on the quality of machine translation or feature extraction. Despite the significant improvement that has been made, there is little understanding of how the embedding-based entity alignment methods actually work. Most existing methods rest on the foundation that a small number of pre-aligned entities can serve as anchors to connect the embedding spaces of two KGs. But no one investigates the rationality of such foundation. In this paper, we define a typical paradigm abstracted from the existing methods, and analyze how the representation discrepancy between two potentially-aligned entities is implicitly bounded by a predefined margin in the scoring function for embedding learning. However, such a margin cannot guarantee to be tight enough for alignment learning. We mitigate this problem by proposing a new approach that explicitly learns KG-invariant and principled entity representations, meanwhile preserves the original infrastructure of existing methods. In this sense, the model not only pursues the closeness of aligned entities on geometric distance, but also aligns the neural ontologies of two KGs to eliminate the discrepancy in feature distribution and underlying ontology knowledge. Our experiments demonstrate consistent and significant improvement in performance against the existing embedding-based entity alignment methods, including several state-of-the-art ones.

1. INTRODUCTION

Knowledge Graphs (KGs), such as DBpedia (Auer et al., 2007) and Wikidata (Vrandečić & Krötzsch, 2014) , have become crucial data resources for many AI applications. Although a large-scale KG offers structured knowledge derived from millions of facts in the real world, it is still incomplete by nature, and the downstream applications are always demanding for more knowledge. To resolve this issue, the task of entity alignment (EA) is proposed, which exploits the potentially-aligned entities among different KGs to facilitate knowledge fusion and exchange. Recently, embedding-based entity alignment (EEA) methods (Chen et al., 2017; Zhu et al., 2017; Wang et al., 2018; Guo et al., 2019; Ye et al., 2019; Wu et al., 2019; Sun et al., 2020a; Fey et al., 2020) have been prevailing in this area. Their common idea is to encode semantics into embeddings and estimate the similarities by embedding distance. During this process, a small number of aligned entity pairs (a.k.a., seed alignment) are required as supervision data to align (or merge) the embedding spaces of KGs. These methods either learn an alignment function f a to minimize the difference between two entity embeddings in each seed (Wang et al., 2018) , or directly map aligned entities to one embedding vector (Sun et al., 2017) . Meanwhile, they also leverage a shared scoring function f s to encode semantics into representations, such that two underlying aligned entities that connect to respective sides of a seed shall have similar characteristics in their feature expression. Although the effectiveness of current EEA methods are empirically demonstrated (Sun et al., 2020b) , little efforts have been made on the theoretical analysis. In this paper, we fill this gap by formally defining a paradigm leveraged by the current methods. We show that the representation discrepancy of an underlying aligned entity pair is bounded in an indirect way by a margin λ in the scoring

