INTERPRETING KNOWLEDGE GRAPH RELATION REPRESENTATION FROM WORD EMBEDDINGS

Abstract

Many models learn representations of knowledge graph data by exploiting its low-rank latent structure, encoding known relations between entities and enabling unknown facts to be inferred. To predict whether a relation holds between entities, embeddings are typically compared in the latent space following a relation-specific mapping. Whilst their predictive performance has steadily improved, how such models capture the underlying latent structure of semantic information remains unexplained. Building on recent theoretical understanding of word embeddings, we categorise knowledge graph relations into three types and for each derive explicit requirements of their representations. We show that empirical properties of relation representations and the relative performance of leading knowledge graph representation methods are justified by our analysis.

1. INTRODUCTION

Knowledge graphs are large repositories of binary relations between words (or entities) in the form of (subject, relation, object) triples. Many models for representing entities and relations have been developed, so that known facts can be recalled and previously unknown facts can be inferred, a task known as link prediction. Recent link prediction models (e.g. Bordes et al., 2013; Trouillon et al., 2016; Balažević et al., 2019b) learn entity representations, or embeddings, of far lower dimensionality than the number of entities, by capturing latent structure in the data. Relations are typically represented as a mapping from the embedding of a subject entity to those of related object entities. Although the performance of link prediction models has steadily improved for nearly a decade, relatively little is understood of the low-rank latent structure that underpins them, which we address in this work. The outcomes of our analysis can be used to aid and direct future knowledge graph model design. We start by drawing a parallel between the entity embeddings of knowledge graphs and context-free word embeddings, e.g. as learned by Word2Vec (W2V) (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) . Our motivating premise is that the same latent word features (e.g. meaning(s), tense, grammatical type) give rise to the patterns found in different data sources, i.e. manifesting in word cooccurrence statistics and determining which words relate to which. Different embedding approaches may capture such structure in different ways, but if it is fundamentally the same, an understanding gained from one embedding task (e.g. word embedding) may benefit another (e.g. knowledge graph representation). Furthermore, the relatively limited but accurate data used in knowledge graph representation differs materially from the highly abundant but statistically noisy text data used for word embeddings. As such, theoretically reconciling the two embedding methods may lead to unified and improved embeddings learned jointly from both data sources. Recent work (Allen & Hospedales, 2019; Allen et al., 2019) theoretically explains how semantic properties are encoded in word embeddings that (approximately) factorise a matrix of pointwise mutual information (PMI) from word co-occurrence statistics, as known for W2V (Levy & Goldberg, 2014) . Semantic relationships between words, specifically similarity, relatedness, paraphrase and analogy, are proven to manifest as linear geometric relationships between rows of the PMI matrix (subject to known error terms), of which word embeddings can be considered low-rank projections. This explains, for example, the observations that similar words have similar embeddings and that embeddings of analogous word pairs share a common "vector offset" (e.g. Mikolov et al., 2013b) .

