INTERPRETING KNOWLEDGE GRAPH RELATION REPRESENTATION FROM WORD EMBEDDINGS

Abstract

Many models learn representations of knowledge graph data by exploiting its low-rank latent structure, encoding known relations between entities and enabling unknown facts to be inferred. To predict whether a relation holds between entities, embeddings are typically compared in the latent space following a relation-specific mapping. Whilst their predictive performance has steadily improved, how such models capture the underlying latent structure of semantic information remains unexplained. Building on recent theoretical understanding of word embeddings, we categorise knowledge graph relations into three types and for each derive explicit requirements of their representations. We show that empirical properties of relation representations and the relative performance of leading knowledge graph representation methods are justified by our analysis.

1. INTRODUCTION

Knowledge graphs are large repositories of binary relations between words (or entities) in the form of (subject, relation, object) triples. Many models for representing entities and relations have been developed, so that known facts can be recalled and previously unknown facts can be inferred, a task known as link prediction. Recent link prediction models (e.g. Bordes et al., 2013; Trouillon et al., 2016; Balažević et al., 2019b) learn entity representations, or embeddings, of far lower dimensionality than the number of entities, by capturing latent structure in the data. Relations are typically represented as a mapping from the embedding of a subject entity to those of related object entities. Although the performance of link prediction models has steadily improved for nearly a decade, relatively little is understood of the low-rank latent structure that underpins them, which we address in this work. The outcomes of our analysis can be used to aid and direct future knowledge graph model design. We start by drawing a parallel between the entity embeddings of knowledge graphs and context-free word embeddings, e.g. as learned by Word2Vec (W2V) (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) . Our motivating premise is that the same latent word features (e.g. meaning(s), tense, grammatical type) give rise to the patterns found in different data sources, i.e. manifesting in word cooccurrence statistics and determining which words relate to which. Different embedding approaches may capture such structure in different ways, but if it is fundamentally the same, an understanding gained from one embedding task (e.g. word embedding) may benefit another (e.g. knowledge graph representation). Furthermore, the relatively limited but accurate data used in knowledge graph representation differs materially from the highly abundant but statistically noisy text data used for word embeddings. As such, theoretically reconciling the two embedding methods may lead to unified and improved embeddings learned jointly from both data sources. Recent work (Allen & Hospedales, 2019; Allen et al., 2019) theoretically explains how semantic properties are encoded in word embeddings that (approximately) factorise a matrix of pointwise mutual information (PMI) from word co-occurrence statistics, as known for W2V (Levy & Goldberg, 2014) . Semantic relationships between words, specifically similarity, relatedness, paraphrase and analogy, are proven to manifest as linear geometric relationships between rows of the PMI matrix (subject to known error terms), of which word embeddings can be considered low-rank projections. This explains, for example, the observations that similar words have similar embeddings and that embeddings of analogous word pairs share a common "vector offset" (e.g. Mikolov et al., 2013b) . We extend this insight to identify geometric relationships between PMI-based word embeddings that correspond to other relations, i.e. those of knowledge graphs. Such relation conditions define relation-specific mappings between entity embeddings (i.e. relation representations) and so provide a "blue-print" for knowledge graph representation models. Analysing the relation representations of leading knowledge graph representation models, we find that various properties, including their relative link prediction performance, accord with predictions based on these relation conditions, supporting the premise that a common latent structure is learned by word and knowledge graph embedding models, despite the significant differences between their training data and methodology. In summary, the key contributions of this work are: • to use recent understanding of PMI-based word embeddings to derive geometric attributes of a relation representation for it to map subject word embeddings to all related object word embeddings (relation conditions), which partition relations into three types ( §3); • to show that both per-relation ranking as well as classification performance of leading link prediction models corresponds to the model satisfying the appropriate relation conditions, i.e. how closely its relation representations match the geometric form derived theoretically ( §4.1); and • to show that properties of knowledge graph representation models fit predictions based on relation conditions, e.g. the strength of a relation's relatedness aspect is reflected in the eigenvalues of its relation matrix ( §4.2).

2. BACKGROUND

Knowledge graph representation: Recent knowledge graph models typically represent entities e s , e o as vectors e s , e o ∈ R de , and relations as transformations in the latent space from subject to object entity embedding, where the dimension d e is far lower (e.g. 200) than the number of entities n e (e.g. > 10 4 ). Such models are distinguished by their score function, which defines (i) the form of the relation transformation, e.g. matrix multiplication and/or vector addition; and (ii) the measure of proximity between a transformed subject embedding and an object embedding, e.g. dot product or Euclidean distance. Score functions can be non-linear (e.g. Dettmers et al., 2018) , or linear and sub-categorised as additive, multiplicative or both. We focus on linear models due to their simplicity and strong performance at link prediction (including state-of-the-art). Table 1 shows the score functions of competitive linear knowledge graph embedding models spanning the sub-categories: TransE (Bordes et al., 2013 ), DistMult (Yang et al., 2015) , TuckER (Balažević et al., 2019b) and MuRE (Balažević et al., 2019a) . Additive models apply a relation-specific translation to a subject entity embedding and typically use Euclidean (Trouillon et al., 2016) . In TuckER, each relation-specific R = W× 3 r is a linear combination of d r "prototype" relation matrices in a core tensor W ∈ R de×dr×de (× n denoting tensor product along mode n), facilitating multi-task learning across relations. Some models, e.g. MuRE, combine both multiplicative (R) and additive (r) components. Word embedding: Algorithms such as Word2Vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) generate low-dimensional word embeddings that perform well on downstream tasks (Baroni et al., 2014) . Such models predict the context words (c j ) observed around a target word (w i ) in a text corpus using shallow neural networks. Whilst recent language models (e.g. Devlin et al., 2018; Peters et al., 2018) achieve strong performance using contextualised word embeddings, we focus on "context-free" embeddings since knowledge graph entities have no obvious context and, importantly, they offer insight into embedding interpretability.



distance to evaluate proximity to object embeddings. A generic additive score function is given by φ(e s , r, e o ) = -e s +r-e o 2 2 +b s +b o . A simple example is TransE, where b s = b o = 0. Multiplicative models have the generic score function φ(e s , r, e o ) = e s Re o , i.e. a bilinear product of the entity embeddings and a relation-specific matrix R. DistMult is a simple example with R diagonal and so cannot model asymmetric relations

Score functions of representative linear link prediction models. R ∈ R de×de and r ∈ R de are the relation matrix and translation vector, W ∈ R de×dr ×de is the core tensor and bs, bo ∈ R are the entity biases.

