INTERPRETING KNOWLEDGE GRAPH RELATION REPRESENTATION FROM WORD EMBEDDINGS

Abstract

Many models learn representations of knowledge graph data by exploiting its low-rank latent structure, encoding known relations between entities and enabling unknown facts to be inferred. To predict whether a relation holds between entities, embeddings are typically compared in the latent space following a relation-specific mapping. Whilst their predictive performance has steadily improved, how such models capture the underlying latent structure of semantic information remains unexplained. Building on recent theoretical understanding of word embeddings, we categorise knowledge graph relations into three types and for each derive explicit requirements of their representations. We show that empirical properties of relation representations and the relative performance of leading knowledge graph representation methods are justified by our analysis.

1. INTRODUCTION

Knowledge graphs are large repositories of binary relations between words (or entities) in the form of (subject, relation, object) triples. Many models for representing entities and relations have been developed, so that known facts can be recalled and previously unknown facts can be inferred, a task known as link prediction. Recent link prediction models (e.g. Bordes et al., 2013; Trouillon et al., 2016; Balažević et al., 2019b) learn entity representations, or embeddings, of far lower dimensionality than the number of entities, by capturing latent structure in the data. Relations are typically represented as a mapping from the embedding of a subject entity to those of related object entities. Although the performance of link prediction models has steadily improved for nearly a decade, relatively little is understood of the low-rank latent structure that underpins them, which we address in this work. The outcomes of our analysis can be used to aid and direct future knowledge graph model design. We start by drawing a parallel between the entity embeddings of knowledge graphs and context-free word embeddings, e.g. as learned by Word2Vec (W2V) (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) . Our motivating premise is that the same latent word features (e.g. meaning(s), tense, grammatical type) give rise to the patterns found in different data sources, i.e. manifesting in word cooccurrence statistics and determining which words relate to which. Different embedding approaches may capture such structure in different ways, but if it is fundamentally the same, an understanding gained from one embedding task (e.g. word embedding) may benefit another (e.g. knowledge graph representation). Furthermore, the relatively limited but accurate data used in knowledge graph representation differs materially from the highly abundant but statistically noisy text data used for word embeddings. As such, theoretically reconciling the two embedding methods may lead to unified and improved embeddings learned jointly from both data sources. Recent work (Allen & Hospedales, 2019; Allen et al., 2019) theoretically explains how semantic properties are encoded in word embeddings that (approximately) factorise a matrix of pointwise mutual information (PMI) from word co-occurrence statistics, as known for W2V (Levy & Goldberg, 2014) . Semantic relationships between words, specifically similarity, relatedness, paraphrase and analogy, are proven to manifest as linear geometric relationships between rows of the PMI matrix (subject to known error terms), of which word embeddings can be considered low-rank projections. This explains, for example, the observations that similar words have similar embeddings and that embeddings of analogous word pairs share a common "vector offset" (e.g. Mikolov et al., 2013b) . Table 1 : Score functions of representative linear link prediction models. R ∈ R de×de and r ∈ R de are the relation matrix and translation vector, W ∈ R de×dr ×de is the core tensor and bs, bo ∈ R are the entity biases.

Model

Linear Subcategory Score Function TransE (Bordes et al., 2013) additive -e s + re o 2 2 DistMult (Yang et al., 2015) multiplicative (diagonal) e s Re o TuckER (Balažević et al., 2019b) multiplicative W × 1 e s × 2 r × 3 e o MuRE (Balažević et al., 2019a) We extend this insight to identify geometric relationships between PMI-based word embeddings that correspond to other relations, i.e. those of knowledge graphs. Such relation conditions define relation-specific mappings between entity embeddings (i.e. relation representations) and so provide a "blue-print" for knowledge graph representation models. Analysing the relation representations of leading knowledge graph representation models, we find that various properties, including their relative link prediction performance, accord with predictions based on these relation conditions, supporting the premise that a common latent structure is learned by word and knowledge graph embedding models, despite the significant differences between their training data and methodology. In summary, the key contributions of this work are: • to use recent understanding of PMI-based word embeddings to derive geometric attributes of a relation representation for it to map subject word embeddings to all related object word embeddings (relation conditions), which partition relations into three types ( §3); • to show that both per-relation ranking as well as classification performance of leading link prediction models corresponds to the model satisfying the appropriate relation conditions, i.e. how closely its relation representations match the geometric form derived theoretically ( §4.1); and • to show that properties of knowledge graph representation models fit predictions based on relation conditions, e.g. the strength of a relation's relatedness aspect is reflected in the eigenvalues of its relation matrix ( §4.2).

2. BACKGROUND

Knowledge graph representation: Recent knowledge graph models typically represent entities e s , e o as vectors e s , e o ∈ R de , and relations as transformations in the latent space from subject to object entity embedding, where the dimension d e is far lower (e.g. 200) than the number of entities n e (e.g. > 10 4 ). Such models are distinguished by their score function, which defines (i) the form of the relation transformation, e.g. matrix multiplication and/or vector addition; and (ii) the measure of proximity between a transformed subject embedding and an object embedding, e.g. dot product or Euclidean distance. Score functions can be non-linear (e.g. Dettmers et al., 2018) , or linear and sub-categorised as additive, multiplicative or both. We focus on linear models due to their simplicity and strong performance at link prediction (including state-of-the-art). Table 1 shows the score functions of competitive linear knowledge graph embedding models spanning the sub-categories: TransE (Bordes et al., 2013) , DistMult (Yang et al., 2015) , TuckER (Balažević et al., 2019b) and MuRE (Balažević et al., 2019a) . (Trouillon et al., 2016) . In TuckER, each relation-specific R = W× 3 r is a linear combination of d r "prototype" relation matrices in a core tensor W ∈ R de×dr×de (× n denoting tensor product along mode n), facilitating multi-task learning across relations. Some models, e.g. MuRE, combine both multiplicative (R) and additive (r) components.

Additive models apply

Word embedding: Algorithms such as Word2Vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) generate low-dimensional word embeddings that perform well on downstream tasks (Baroni et al., 2014) . Such models predict the context words (c j ) observed around a target word (w i ) in a text corpus using shallow neural networks. Whilst recent language models (e.g. Devlin et al., 2018; Peters et al., 2018) achieve strong performance using contextualised word embeddings, we focus on "context-free" embeddings since knowledge graph entities have no obvious context and, importantly, they offer insight into embedding interpretability. Levy & Goldberg (2014) show that, for a dictionary of n e unique words and embedding dimension d e n e , W2V's loss function is minimised when its embeddings w i , c j form matrices W, C ∈ R de×ne that factorise a pointwise mutual information (PMI) matrix of word co-occurrence statistics (PMI(w i , c j ) = log P (wi,cj ) P (wi)P (cj ) ), subject to a shift term. This result relates W2V to earlier count-based embeddings and specifically PMI, which has a history in linguistic analysis (Turney & Pantel, 2010) . From its loss function, GloVe can be seen to perform a related factorisation. Recent work (Allen & Hospedales, 2019; Allen et al., 2019) shows how the semantic relationships of similarity, relatedness, paraphrase and analogy are encoded in PMI-based word embeddings by recognising such embeddings as low-rank projections of high dimensional rows of the PMI matrix, termed PMI vectors. Those semantic relationships are described in terms of multiplicative interactions between co-occurrence probabilities (subject to defined error terms), that correspond to additive interactions between (logarithmic) PMI statistics, and hence PMI vectors. Thus, under a sufficiently linear projection, those semantic relationships correspond to linear relationships between word embeddings. Note that although the relative geometry reflecting semantic relationships is preserved, the direct interpretability of dimensions, as in PMI vectors, is lost since the embedding matrices can be arbitrarily scaled/rotated if the other is inversely transformed. We state the relevant semantic relationships on which we build, denoting the set of unique dictionary words by E: • Paraphrase: word subsets W, W * ⊆ E are said to paraphrase if they induce similar distributions over nearby words, i.e. p(E|W) ≈ p(E|W * ), e.g. {king} paraphrases {man, royal}. • Analogy: a common example of an analogy is "woman is to queen as man is to king" and can be defined as any set of word pairs {(w i , w * i )} i∈I for which it is semantically meaningful to say "w a is to w * a as w b is to w * b " ∀a, b ∈ I. Where one word subset paraphrases another, the sums of their embeddings are shown to be equal (subject to the independence of words within each set), e.g. w king ≈ w man + w royal . An interesting connection is established between the two semantic relationships: a set of word pairs A = {(w a , w * a ), (w b , w * b )} is an analogy if {w a , w * b } paraphrases {w * a , w b }, in which case the embeddings satisfy w a * -w a ≈ w b * -w b ("vector offset").

3. FROM ANALOGIES TO KNOWLEDGE GRAPH RELATIONS

Analogies from the field of word embeddings are our starting point for developing a theoretical basis for representing knowledge graph relations. The relevance of analogies stems from the observation that for an analogy to hold (see §2), its word pairs, e.g {(man, king), (woman, queen), (girl, princess)}, must be related in the same way, comparably to subject-object entity pairs under a common knowledge graph relation. Our aim is to develop the understanding of PMI-based word embeddings (henceforth word embeddings), to identify the mathematical properties necessary for a relation representation to map subject word embeddings to all related object word embeddings. Considering the paraphrasing word sets {king} and {man, royal} corresponding to the word embedding relationship w king ≈ w man +w royal ( §2), royal can be interpreted as the semantic difference between man and king, fitting intuitively with the relationship w royal ≈ w king -w man . Fundamentally, this relationship holds because the difference between words that co-occur (i.e. occur more frequently than if independent) with king and those that co-occur with man, reflects those words that co-occur with royal. We refer to this difference in co-occurrence distribution as a "context shift", from man (subject) to king (object). Allen & Hospedales (2019) effectively show that where multiple word pairs share a common context shift, they form an analogy whose embeddings satisfy the vector offset relationship. This result seems obvious where the context shift mirrors an identifiable word, the embedding of which is approximated by the common vector offset, e.g. queen and woman are related by the same context shift, i.e. w queen ≈ w woman +w royal , thus w queen -w woman ≈ w king -w man . However, the same result holds, i.e. an analogy is formed with a common vector offset between embeddings, for an arbitrary (common) context shift that may reflect no particular word. Importantly, these context shift relations evidence a case in which it is known how a relation can be represented, i.e. by an additive vector (comparable to TransE) if entities are represented by word embeddings. More generally, this provides an interpretable foothold into relation representation. Note that not all sets of word pairs considered analogies exhibit a clear context shift relation, e.g. in the analogy {(car,engine), (bus,seats)}, the difference between words co-occurring with engine and car is not expected to reflect the corresponding difference between bus and seats. This illustrates how analogies are a loosely defined concept, e.g. their implicit relation may be semantic or syntactic, with several sub-categories of each (e.g. see Gladkova et al. (2016) ). The same is readily observed for the relations of knowledge graphs. This likely explains the observed variability in "solving" analogies by use of vector offset (e.g. Köper et al., 2015; Karpinska et al., 2018; Gladkova et al., 2016) and suggests that further consideration is required to represent relations (or solve analogies) in general. ≈ (a) Similarity ≈ (b) Relatedness ≈ (c) Specialisation ≈ (d) Context-shift ≈ (e) Gen. context-shift We have seen that the existence of a context shift relation between a subject and object word implies a (relation-specific) geometric relationship between word embeddings, thus the latter provides a necessary condition for the relation to hold. We refer to this as a "relation condition" and aim to identify relation conditions for other classes of relation. Once identified, relation conditions define a mapping from subject embeddings to all related object embeddings, by which related entities might be identified with a proximity measure (e.g. Euclidean distance or dot product). This is the precise aim of a knowledge graph representation model, but loss functions are typically developed heuristically. Given the existence of many representation models, we can verify identified relation conditions by contrasting the per-relation performance of various models with the extent to which their loss function reflects the appropriate relation conditions. Note that since relation conditions are necessary rather than sufficient, they do not guarantee a relation holds, i.e. false positives may arise. Whilst we seek to establish relation conditions based on PMI word embeddings, the data used to train knowledge graph embeddings differs significantly to the text data used by word embeddings, and the relevance of conditions ultimately based on PMI statistics may seem questionable. However, where a knowledge graph representation model implements relation conditions and measures proximity between embeddings, the parameters of word embeddings necessarily provide a potential solution that minimises the loss function. Many equivalent solutions may exist due to symmetry as typical for neural network architectures. We now define relation types and identify their relation conditions (underlined); we then consider the completeness of this categorisation. • Similarity: Semantically similar words induce similar distributions over the words they co-occur with. Thus their PMI vectors and word embeddings are similar (Fig 1a ). • Relatedness: The relatedness of two words can be considered in terms of the words S ⊆ E with which both co-occur similarly. S defines the nature of relatedness, e.g. milk and cheese are related by S = {dairy, breakfast, ...}; and |S| reflects the strength of relatedness. Since PMI vector components corresponding to S are similar (Fig 1b) , embeddings of S-related words have similar components in the subspace V S that spans the projected PMI vector dimensions corresponding to S. The rank of V S is thus anticipated to reflect relatedness strength. Relatedness can be seen as a weaker and more variable generalisation of similarity, its limiting case where S = E, hence rank(V S ) = d e . • Context-shift: As discussed above, words related by a common difference between their distributions of co-occurring words, defined as context-shifts, share a common vector offset between word embeddings. Context might be considered added (e.g. man to king), termed a specialisation (Fig 1c ), subtracted (e.g. king to man) or both (Fig 1d) . These relations are 1-to-1 (subject to synonyms) and include an aspect of relatedness due to the word associations in common. Note that, specialisations include hyponyms/hypernyms and context shifts include meronyms. • Generalised context-shift: Context-shift relations generalise to 1-to-many, many-to-1 and manyto-many relations where the added/subtracted context may be from a (relation-specific) context set (Fig 1e ), e.g. any city or anything bigger. The potential scope and size of context sets adds variability to these relations. The limiting case in which the context set is "small" reduces to a 1-to-1 context-shift (above) and the embedding difference is a known vector offset. In the limiting case of a "large" context set, the added/subtracted context is essentially unrestricted such that only the relatedness aspect of the relation, and thus a common subspace component of embeddings, is fixed. 

3.1. CATEGORISING REAL KNOWLEDGE GRAPH RELATIONS

Analysing the relations of popular knowledge graph datasets, we observe that they indeed imply (i) a relatedness aspect reflecting a common theme (e.g. both entities are animals or geographic terms); and (ii) contextual themes specific to the subject and/or object entities. Further, relations fall under a hierarchy of three relation types: highly related (R); generalised specialisation (S); and generalised context-shift (C). As above, "generalised" indicates that context differences are not restricted to be 1-1. From Fig 1, it can be seen that type R relations are a special case of S, which are a special case of C. Thus type C encompasses all considered relations. Whilst there are many ways to classify relations, e.g. by hierarchy, transitivity, the proposed relation conditions delineate relations by the required mathematical form (and complexity) of their representation. Table 2 shows a categorisation of the relations of the WN18RR dataset (Dettmers et al., 2018) comprising 11 relations and 40,943 entities.foot_0 An explanation for the category assignment is in Appx. A. Analysing the commonly used FB15k-237 dataset (Toutanova et al., 2015) reveals relations to be almost exclusively of type C, precluding a contrast of performance per relation type and hence that dataset is omitted from our analysis. Instead, we categorise a random subsample of 12 relations from the NELL-995 dataset (Xiong et al., 2017) , containing 75,492 entities and 200 relations (see Tables 8 and 9 in Appx. B).

3.2. RELATIONS AS MAPPINGS BETWEEN EMBEDDINGS

Given the relation conditions of a relation type, we now consider mappings that satisfy them and thereby loss functions able to identify relations of each type, evaluating proximity between mapped entity embeddings by dot product or Euclidean distance. We then contrast our theoretically derived loss functions, specific to a relation type, with those of several knowledge graph models (Table 1 ) to predict identifiable properties and the relative performance of different knowledge graph models for each relation type. R: Identifying S-relatedness requires testing both entity embeddings e s , e o for a common subspace component V S , which can be achieved by projecting both embeddings onto V S and comparing their images. Projection requires multiplication by a matrix P r ∈ R d×d and cannot be achieved additively, except in the trivial limiting case of similarity (P r = I) when r ≈ 0 can be added. ). This is achieved by (i) multiplying both entity embeddings by a relationspecific projection matrix P r that projects onto the subspace that spans the low-rank projection of dimensions corresponding to S, v s r and v o r (i.e. testing for S-relatedness while preserving relationspecific entity components); and (ii) adding a relation-specific vector r = v o r -v s r to the transformed subject entity embeddings. Comparing the transformed entity embeddings by dot product equates to (P r e s + r) P r e o ; and by Euclidean distance to P r e s + r -P r e o 2 = P r e s + r 2 -2(P r e s + r) P r e o + P r e o 2 (cf MuRE: Re s + re o 2 ). Contrasting these theoretically derived loss functions with those of knowledge graph models (Table 1 ), we make the following predictions: To elaborate, our core prediction P1(b) anticipates that: (i) additive-only models (e.g. TransE) are not suited to identifying the relatedness aspect of relations, except in limiting cases of similarity, requiring a zero vector); (ii) multiplicative-only models (e.g. DistMult) should perform well on type R relations, but are not suited to identifying entity-specific features of type S/C (an asymmetric relation matrix in TuckER may help compensate); and (iii) the loss function of MuRE closely resembles that derived for type C relations, which generalise all others, and is thus expected to perform best overall.

4. EVIDENCE LINKING KNOWLEDGE GRAPH AND WORD EMBEDDINGS

We test whether the predictions P1 and P2, made on the basis of word embeddings, apply to knowledge graph relations by analysing the performance and properties of competitive knowledge graph models. We compare TransE, DistMult, TuckER and MuRE, which entail different forms of relation representation, on all WN18RR relations and a similar number of NELL-995 relations (spanning all relation types). All models have a comparable number of free parameters. Since for TransE, the logistic sigmoid cannot be applied to the score function to 

4.1. P1: JUSTIFYING THE RELATIVE PERFORMANCE OF KNOWLEDGE GRAPH MODELS

Ranking performance: Tables 3 and 4 report Hits@10 for each relation and include the relation type as well as known confounding influences: percentage of relation instances in the training and test sets (approximately equal), the actual number of instances in the test set (causing some results to be highly granular), Krackhardt hierarchy score (see Appx. E) (Krackhardt, 2014; Balažević et al., 2019a) and maximum and average shortest path between any two related nodes. A further confounding effect is dependence between relations: Lacroix et al. ( 2018) and Balažević et al. (2019b) independently show that constraining the rank of relation representations is beneficial for datasets with many relations due to multi-task learning, particularly when the number of instances per relation is low. This is expected to benefit TuckER on the NELL-995 dataset (200 relations). As predicted in P1(a), all models tend to perform best at type R relations, with a clear performance gap to other relation types. Also, performance on type S relations appears higher in general than type C. In accordance with P1(b), additive-only models (TransE, MuRE I ) perform worst on average since all relation types involve (multiplicative) relatedness. Best performance is achieved on type R relations, which can be represented by a small/zero additive vector. Multiplicative-only DistMult performs well, sometimes best, on type R relations, fitting expectation as it can represent those relations and has no inessential parameters, e.g. that may overfit to noise, which may explain instances where MuRE performs slightly worse. As expected, MuRE, performs best overall (particularly on WN18RR), and most strongly on S and C type relations, predicted to require both multiplicative and additive components. Comparable performance of TuckER on NELL-995 may be explained by its multi-task learning ability. Other anomalous results also closely align with confounding factors. For example, all models perform poorly on the hypernym relation, despite it having a relative abundance of training data (40% of all instances), which may be explained by its hierarchical nature (Khs ≈ 1 and long paths). The same may explain the reduced performance on relations also_see and agentcollaborateswithagent. As found previously (Balažević et al., 2019a) , none of the models considered are well suited to modelling hierarchical structures. We also note that the percentage of training instances of a relation is not a dominant factor on performance, as would be expected if all relations could be equally represented. Classification performance: We further evaluate whether P1 holds when comparing knowledge graph models by classification accuracy on WN18RR. Independent predictions of whether a given triple is true or false are not commonly evaluated, instead metrics such as mean reciprocal rank and Hits@k are reported that compare the prediction of a test triple against all evaluation triples. Not only is this computationally costly, the evaluation is flawed if an entity is related to l > k others (k is often 1 or 3). A correct prediction validly falling within the top l but not the top k would appear incorrect under the metric. Some recent works also note the importance of standalone predictions (Speranskaya et al., 2020; Pezeshkpour et al., 2020) . Since for each relation there are n 2 e possible entity-entity relationships, we sub-sample by computing predictions only for those (e s , r, e o ) triples for which the e s , r pairs appear in the test set. We split positive predictions (σ(φ(e s , r, e o )) > 0.5) between (i) known truths -training or test/validation instances; and (ii) other, the truth of which is not known. We then compute per-relation accuracy over the true training instances ("train") and true test/validation instances ("test"); and the average number of "other" triples predicted true per e s , r pair. Table 5 shows results for MuRE I , DistMult, TuckER and MuRE. All models achieve near perfect training accuracy. The additive-multiplicative MuRE gives best test set performance, followed (surprisingly) closely by MuRE I , with multiplicative models (DistMult and TuckER) performing poorly on all but type R relations in line with P1(b), with near-zero performance on most type S/C relations. Since the ground truth labels for "other" triples predicted to be true are not in the dataset, we analyse a sample of "other" true predictions for one relation of each type (see Appx. G). From this, we estimate that TuckER is relatively accurate but pessimistic (∼0.3 correct of the 0.5 predictions≈ 60%), MuRE I is optimistic but inaccurate (∼2.3 of 7.5≈ 31%), whereas MuRE is both optimistic and accurate (∼1.1 of 1.5≈ 73%). 

5. CONCLUSION

Many low-rank knowledge graph representation models have been developed, yet little is known of the latent structure they learn. We build on recent understanding of PMI-based word embeddings to theoretically establish a set of geometric properties of relation representations (relation conditions) required to map PMI-based word embeddings of subject entities to related object entities under knowledge graph relations. These conditions partition relations into three types and provide a basis to consider the loss functions of existing knowledge graph models. Models that satisfy the relation conditions of a particular type have a known set of model parameters that minimise the loss function, i.e. the parameters of PMI embeddings, together with potentially many equivalent solutions. We show that the better a model's architecture satisfies a relation's conditions, the better its performance at link prediction, evaluated under both rank-based metrics and accuracy. Overall, we generalise recent theoretical understanding of how particular semantic relations, e.g. similarity and analogy, are encoded between PMI-based word embeddings to the general relations of knowledge graphs. In doing so, we provide evidence in support of our initial premise: that common latent structure is exploited by both PMI-based word embeddings (e.g. W2V) and knowledge graph representation. A CATEGORISING WN18RR RELATIONS Categorisation of NELL-995 relations and the explanation for the category assignment of are shown in Tables 8 and 9 respectively. 



We omit the relation "similar_to" since its instances have no discernible structure, and only 3 occur in the test set, all of which are the inverse of a training example and trivial to predict.



multiplicative (diagonal) + additive -Re s +r-e o 2 2 + b s + b o

a relation-specific translation to a subject entity embedding and typically use Euclidean distance to evaluate proximity to object embeddings. A generic additive score function is given by φ(e s , r, e o ) = -e s +r-e o 2 2 +b s +b o . A simple example is TransE, where b s = b o = 0. Multiplicative models have the generic score function φ(e s , r, e o ) = e s Re o , i.e. a bilinear product of the entity embeddings and a relation-specific matrix R. DistMult is a simple example with R diagonal and so cannot model asymmetric relations

Figure 1: Relationships between PMI vectors (black rectangles) of subject/object words for different relation types. PMI vectors capture co-occurrence with every dictionary word: strong associations (PMI > 0) are shaded (blue define the relation, grey are random other associations); red dash = relatedness; black dash = context sets.

Categorisation completeness: Taking intuition from Fig 1 and considering PMI vectors as sets of word features, these relation types can be interpreted as set operations: similarity as set equality; relatedness as subset equality; and context-shift as a relation-specific set difference. Since for any relation each feature must either remain unchanged (relatedness), change (context shift) or else be irrelevant, we conjecture that the above relation types give a complete partition of semantic relations.

Comparison by dot product gives (P r e s ) (P r e o ) = e s P r P r e o = e s M r e o (for relation-specific symmetric M r = P r P r ). Euclidean distance gives P r e s -P r e o 2 = (e s -e o ) M r (e s -e o ) = P r e s 2 -2e s M r e o + P r e o 2 . S/C: The relation conditions require testing for both S-relatedness and relation-specific entity component(s) (v s r , v o r

The ability to learn the representation of a relation is expected to reflect: (a) the complexity of its type (R<S<C) independently of model choice; and (b) whether relation conditions (e.g. additive/multiplicative interactions) are met by the model. P2: Knowledge graph relation representations reflect the following type-specific properties: (a) relation matrices for relatedness (type R) relations are highly symmetric; (b) offset vectors for relatedness relations have low norm; and (c) as a proxy to the rank of V S , the eigenvalues of a relation matrix reflect relatedness strength.

give a probabilistic interpretation comparable to other models, for fair comparison we include MuRE I , a constrained variant of MuRE with R s = R o = I, as a proxy to TransE. Implementation details are included in Appx. D. For evaluation, we generate 2n e evaluation triples for each test triple (n e = |E| denoting the number of entities) by fixing the subject entity e s and relation r and replacing the object entity e o with each entity in turn and then keeping e o and r fixed and varying e s . Each model's scores for the evaluation triples are ranked to give the standard metric Hits@10 (Bordes et al., 2013), i.e. the fraction of times a true triple appears in the top 10 ranked evaluation triples.

Our analysis identifies the best performing model per relation type as predicted by P1(b): multiplicative-only DistMult for type R, additive-multiplicative MuRE for types S/C; providing a basis for dataset-dependent model selection. The per-relation insight into where models perform

Categorisation of WN18RR relations.

Hits@10 per relation on WN18RR.

Hits@10 per relation on NELL-995.

Per relation prediction accuracy for MuRE I (M I ), (D)istMult, (T)uckER and (M)uRE (WN18RR).Table 6 shows the symmetry score (∈ [-1, 1] indicating perfect anti-symmetry to symmetry; see Appx. F) for the relation matrix of TuckER and the norm of relation vectors of TransE, MuRE I and MuRE on the WN18RR dataset. As expected, type R relations have materially higher symmetry than both other relation types, fitting the prediction of how TuckER compensates for having no additive component. All additive models learn relation vectors of a noticeably lower norm for type R relations, which in the limiting case (similarity) require no additive component, than for types S or C.

Relation matrix symmetry score [-1.1] for TuckER; and relation vector norm for TransE, MuRE I and MuRE (WN18RR).

describes how each WN18RR relation was assigned to its respective category.

Explanation for the WN18RR relation category assignment.

Categorisation of NELL-995 relations.

Explanation for the NELL-995 relation category assignment. test set of NELL-995 created by Xiong et al. (2017) contains only 10 out of 200 relations present in the training set. To ensure a fair representation of all training set relations in the validation and test sets, we create new validation and test set splits by combining the initial validation and test sets with the training set and randomly selecting 10,000 triples each from the combined dataset.

"Other" facts as predicted by MuRE

"Other" facts as predicted by DistMult.

ACKNOWLEDGEMENTS

Carl Allen and Ivana Balažević were supported by the Centre for Doctoral Training in Data Science, funded by EPSRC (grant EP/L016427/1) and the University of Edinburgh.

availability

(R) (equalizer_NN_2, set_off_VB_5) (extrapolation_NN_1, maths_NN_1) (sewer_NN_2, stitcher_NN_1) (trail_VB_2, trail_VB_2) (constellation_NN_2, satellite_NN_3) (spread_VB_5, circularize_VB_3) (lard_VB_1, vegetable_oil_NN_1) (worship_VB_1, worship_VB_1) (shrink_VB_3, subtraction_NN_2) (flaunt_NN_1, showing_NN_2) (snuggle_NN_1, draw_close_VB_3) (steer_VB_1, steer_VB_1) (continue_VB_10, proceed_VB_1) (extrapolate_VB_3, synthesis_NN_3) (train_VB_3, training_NN_1) (sort_out_VB_1, sort_out_VB_1) (support_VB_6, defend_VB_5) (strategist_NN_1, machination_NN_1) (scratch_VB_3, skin_sensation_NN_1)urban_center_NN_1) (r._e._byrd_NN_1, military_advisor_NN_1) (malcolm_x_NN_1, emancipationist_NN_1) (central_america_NN_1, c._am._nation_NN_1) (nuptse_NN_1, urban_center_NN_1) (r._e._byrd_NN_1, aide-de-camp_NN_1) (north_platte_river_NN_1, urban_center_NN_1) (malcolm_x_NN_1, environmentalist_NN_1) (ticino_NN_1, urban_center_NN_1) (tampa_bay_NN_1, urban_center_NN_1) (oslo_NN_1, urban_center_NN_1) (the_nazarene_NN_1, christian_JJ_1) (aegean_sea_NN_1, aegean_island_NN_1) (tidewater_region_NN_1, south_NN_1) (zaire_river_NN_1, urban_center_NN_1) (thomas_aquinas_NN_1, church_father_NN_1) (cowpens_NN_1, war_of_am._ind._NN_1) (r._e._byrd_NN_1, executive_officer_NN_1)libel_VB_1) (drive_NN_12, badminton_NN_1) (slugger_NN_1, baseball_player_NN_1) (turn_VB_12, plow_NN_1) (turn_VB_12, till_VB_1) (etymologizing_NN_1, etymologize_VB_2) (rna_NN_1, chemistry_NN_1) (assist_NN_2, softball_game_NN_1) (hit_NN_1, hit_VB_1) (matrix_algebra_NN_1, diagonalization_NN_1) (metrify_VB_1, versify_VB_1) (council_NN_2, assembly_NN_4) (fire_VB_3, flaming_NN_1) (cabinetwork_NN_2, woodworking_NN_1) (trial_impression_NN_1, publish_VB_1) (throughput_NN_1, turnout_NN_4) (ring_NN_4, chemical_chain_NN_1) (cabinetwork_NN_2, bottom_VB_1) (turn_VB_12, plowman_NN_1) (cream_VB_1, cream_NN_2) (libidinal_energy_NN_1, charge_NN_9) (cabinetwork_NN_2, upholster_VB_1)

D IMPLEMENTATION DETAILS

All algorithms are re-implemented in PyTorch with the Adam optimizer (Kingma & Ba, 2015) that minimises binary cross-entropy loss, using hyper-parameters that work well for all models (learning rate: 0.001, batch size: 128, number of negative samples: 50). Entity and relation embedding dimensionality is set to d e = d r = 200 for all models except TuckER, for which d r = 30 (Balažević et al., 2019b) .

E KRACKHARDT HIERARCHY SCORE

The Krackhardt hierarchy score measures the proportion of node pairs (x, y) where there exists a directed path x → y, but not y → x; and it takes a value of one for all directed acyclic graphs, and zero for cycles and cliques (Krackhardt, 2014; Balažević et al., 2019a) .Let M ∈ R n×n be the binary reachability matrix of a directed graph G with n nodes, with M i,j = 1 if there exists a directed path from node i to node j and 0 otherwise. The Krackhardt hierarchy score of G is defined as:.(1)

F SYMMETRY SCORE

The symmetry score∈ [-1, 1] (Hubert & Baker, 1979) for a relation matrix R ∈ R de×de is defined as:where 1 indicates a symmetric and -1 an anti-symmetric matrix.G "OTHER" PREDICTED FACTS Tables 10 to 13 shows a sample of the unknown triples (i.e. those formed using the WN18RR entities and relations, but not present in the dataset) for the derivationally_related form (R), instance_hypernym (S) and synset_domain_topic_of (C) relations at a range of probabilities (σ(φ(e s , r, e o )) ≈ {0.4, 0.6, 0.8, 1}), as predicted by each model. True triples are indicated in bold; instances where a model predicts an entity is related to itself are indicated in blue. Table 13 : "Other" facts as predicted by MuRE.Relation (Type) σ(φ(es, r, eo)) ≈ 0.4 σ(φ(es, r, eo)) ≈ 0.6 σ(φ(es, r, eo)) ≈ 0.8 

