RETHINKING IDENTITY IN KNOWLEDGE GRAPH EM-BEDDING

Abstract

Knowledge Graph Embedding (KGE) is a common method to complete real-world Knowledge Graphs (KGs) by learning the embeddings of entities and relations. Beyond specific KGE models, previous work proposes a general framework based on group. A group has a special element identity that uniquely corresponds to the relation identity in KGs, which implies that identity should be represented uniquely. However, we find that this uniqueness cannot be modeled by bilinear based models, revealing the inconsistency between the framework and models. To this end, we propose a solution named Unit Ball Bilinear Model (UniBi). In addition to theoretical superiority, it has greater interpretability and improves performance by preventing ineffective learning with the least constraints. Experiments demonstrate that UniBi models the uniqueness and verify its interpretability and performance.

1. INTRODUCTION

Knowledge Graphs (KGs) store human knowledge in the form of triple (h, r, t), which represents a relation r between a head entity h and a tail entity t (Ji et al., 2021) . KGs benefit a lots of downstream tasks and applications, e.g., recommender system (Zhang et al., 2016) , dialogue system (He et al., 2017) and question answering (Mohammed et al., 2018) . Since actual KGs are usually incomplete, researchers are interested in predicting missing links to complete them. As a common solution, Knowledge Graph Embedding (KGE) completes KGs by learning low-dimensional representations of entities and relations. Beyond the great advances in specific KGE models (Trouillon et al., 2016; Hitchcock, 1927; Chami et al., 2020; Liu et al., 2017; Nickel et al., 2011; Bordes et al., 2013) , several works also attempt to unify these models with general frameworks, such as promising ones based on group (Yang et al., 2020; Xu & Li, 2019; Ebisu & Ichise, 2018) . Group is an abstraction of an operations on a set, like addition on integer. Just like such case has a special number 0, each group has a unique element identity. From the perspective of group, such element requires that its correspondence in KGs, identity relation , should be represented uniquely. However, we find that such uniqueness cannot be modeled by bilinear based models, which reveals the inconsistency between the framework and models. To present the problem more clearly, we first need to introduce some notation. A model with a score function s(h, r, t) can model the uniqueness of identity means that ∀h ̸ = t, s(h, r, h) > s(h, r, t) holds if and only if r is identity and its universal representation is unique. In addition, the score function s(•) of bilinear based model is h ⊤ Rt, where h, R, t are the representations of h, r, and t. In terms of such uniqueness, bilinear based models have two flaws. On the one hand, Fig. 1(a ) demonstrates e ⊤ 1 Ie 1 < e ⊤ 1 Ie 2 , which means that the relation matrices per se do not model identity perfectly. On the other hand, Fig. 1(b) shows even if a matrix, e.g. I, does. Its scaled one kI can also model identity and thus breaks the uniqueness. Obviously, modeling this property requires both entities and relations to be restricted, which reduces expressiveness. To avoid this side effect, we make the cost negligible by minimizing the constraints, one per entity or relation, while modeling the desired property. To be specific, we normalize the vectors of the entities and the spectral radius of the matrices of the relations to 1. Since the model captures entities in a unit ball as shown in Fig. 1(c ), we name it Unit Ball Bilinear Model (UniBi) In addition to the theoretical superiority, UniBi is more powerful and interpretable since modeling identity uniquely requires normalizing the scales that barely contain any useful knowledge. On the one hand, scale normalization prevents ineffective learning on scales and makes UniBi focus more on useful knowledge. On the other hand, it reveals the connection between the relative ratio of singular values and the complexity of relations. 𝑒𝑒 1 𝑒𝑒 2 (a) 𝑒𝑒 1 𝑘𝑘𝑒𝑒 1 (b) 𝑒𝑒 1 𝑅𝑅 1 𝑒𝑒 1 𝑅𝑅 2 𝑒𝑒 1 (c) Experiments verify that UniBi models identity uniquely with improvement on performance and interpretability. Therefore, UniBi reconciles the framework and bilinear based models in terms of identity and paves the way for further studies of both.

2. PRELIMINARIES 2.1 BACKGROUND

A Knowledge Graph K is a set that contains the facts about sets of entities E and relations R. Each fact is stored by a triple (e i , r j , e k ) ∈ E × R × E where e i and r j denote the i -th entity and the j -th relation, respectively. KGE aims to predict the missing links in K by learning embeddings for each entity and relation via a score function s : E × R × E → R. To verify the performance of a KGE method, K is first divided into K train and K test . Then, the method is trained on K train to learn the embeddings e and r (or R) for each entity e ∈ E and relation r ∈ R. Finally, the model is expected to give a higher rank if (e i , h j , e k ) ∈ K while a lower rank if (e i , h j , e k ) / ∈ K, for each query (e i , h j , ?) from K test and the candidate entity e k ∈ E. In addition to the above tail prediction, models are also required to test on head prediction conversely. We transform it to tail prediction by introducing reciprocal relations following Lacroix et al. (2018) . Group is the abstraction of an operation on a set. For example, additive group of integers is consists of addition and all integers, and its identity element is 0. Since we do not expand on this concept in the main text, we put the formal definition in Appendix B. To avoid confusion, we use bold to represent the element identity while italics to denote identity and other relations. And we gives the definition of modeling identity uniquely in the following. Definition 1. A KGE model can uniquely model identity means s(h, r, h) > s(h, r, t), ∀h, t ∈ E, h ̸ = t holds if and only if r is identity and its universal representation, i.e., it works in any KG, is unique.

2.2. OTHER NOTATIONS

We utilize Ê and R to denote the set of all possible representations of entities and relations. And we use e ∈ Ê and R ∈ R to denote the embedding vector of the entity e and the transformation matrix specific to the relation R. Furthermore, we use ∥ • ∥ to denote the L2 norm of the vectors, ∥ • ∥ F and ρ(•) to represent the Frobenius norm and the spectral radius of a matrix. In this paper, we focus on n-dimensional real space R n , which means Ê ⊆ R n and R ⊆ R n×n . We also consider real vector spaces whose vectors are complex C n or hypercomplex space H n , since they are isomorphic to R 2n or R 4n .

3. RELATED WORK

Previous work of KGE can be roughly divided into the following three categories: distance, bilinear and others. Distance based models choose Euclidean distance for their score functions. TransE (Bordes et al., 2013) inspired by Word2Vec (Mikolov et al., 2013) in Natural Language Processing proposes the first distance based model, which uses translation as the linear transformation s(h, r, t) = -∥h + r -t∥. TransH (Lin et al., 2015) and TransR (Lin et al., 2015) find that TransE difficult to handle complex relations and thus apply linear projections before translation. Apart from translation, RotatE (Sun et al., 2019) first introduces rotation as the transformation. RotE (Chami et al., 2020) further combines translation and rotation. Some works also introduce hyperbolic spaces (Balazevic et al., 2019b; Chami et al., 2020; Wang et al., 2021) . Apparently, distance based models are capable of uniquely modeling identity, since the distance between a vector and itself is 0foot_0 . In contrast, bilinear based models have score functions in the bilinear form s(h, r, t) = h ⊤ Rt. RESCAL (Nickel et al., 2011) is the first bilinear based model whose relation matrices are unconstrained. Although RESCAL is expressive, it contains too many parameters and tends to overfitting. DistMult (Yang et al., 2015) simplifies these matrices into diagonal ones. ComplEx (Trouillon et al., 2016) further introduces complex values to model the skew-symmetry pattern. Analogy (Liu et al., 2017) uses block-diagonal to model the analogical pattern and subsumes DistMult, ComplEx, and HolE (Nickel et al., 2016) . Moreover, QuatE (Zhang et al., 2019) extends complex values to quaternion and GeomE (Xu et al., 2020) utilizes geometric algebra to subsume all these models. However, these studies have not noticed that bilinear based models fail to uniquely model identity. Corresponding to identity, identity is a special element in a group which is first introduced by Dihedral (Xu & Li, 2019) and HolE (Nickel et al., 2016) . Based on group, NagE (Yang et al., 2020) proposes a general framework to incorporate previous methods. Although they believe that the matrix representation for identity is I, they ignore the fact that identity may not be uniquely modeled by specific models. In addition, other works using black-box networks (Dettmers et al., 2018; Nguyen et al., 2018; Yao et al., 2019; Schlichtkrull et al., 2018; Zhang et al., 2020b) or additional information (An et al., 2018; Ren et al., 2016) are beyond the scope of this paper.

4. METHOD

In this section, we first discuss the relationship between the identity element in groups and the identity relation in KGs, and give the condition for bilinear based models to model identity uniquely in Section 4.1. We then propose a model named UniBi that satisfies it with least constraint in Section 4.2 and an efficient modeling for UniBi in Section 4.3. In addition, we also discuss its improvement on performance and interpretability via scale normalization in Section 4.4.

4.1. IDENTITY AND IDENTITY

In previous work, a framework based on group has been proposed (Yang et al., 2020) . It states that all relations should be embedded as elements of a group. Groups have a special element identity, which corresponds to the identity relation in KGs. This correspondence implies that identity should have a unique representation. Counter-intuitively, we find that in fact bilinear based models fail to model this uniqueness. We demonstrate two cases in which bilinear based models violate it. On the one hand, Fig. 1(a ) demonstrates e ⊤ 1 Ie 1 < e ⊤ 1 Ie 2 , which means that the matrix of a relation per se is not guaranteed for modeling identity. On the other hand, Fig. 1(b) shows even if a matrix, e.g., I, does. Its scaled one kI where k > 0, k ̸ = 1 can also model identity, which contradicts the quantification of uniqueness. Therefore, we give a formal definition based on definition 1 as following to investigate how to modify bilinear based models to model this uniqueness. Definition 2. A bilinear model can uniquely model identity means: ∃! R ∈ R, ∀h, t ∈ Ê, h ̸ = t, h ⊤ Rh > h ⊤ Rt, (1) where ∃! is the uniqueness quantification.

4.2. UNIT BALL BILINEAR MODEL

From the above examples, it is clear that both embeddings of entities and relations needed to be constrained. Obviously, modeling identity requires both entities and relations be restricted, which will reduce expressiveness. To solve this dilemma, we make the cost negligible by minimizing the constraints, one per entity or relation, while modeling the desired property. To be specific, we normalize the vectors of the entities and the spectral radius of the matrices of the relations to 1 by setting Ê = {e | ∥e∥ = 1, e ∈ R n } and R = {R | ρ(R) = 1, R ∈ R n×n }. We name the proposed model as Unit Ball Bilinear Model (UniBi), since it captures entities in a unit ball as shown in Fig. 1(c ). The score function of UniBi isfoot_1 : s(h, r, t) = h ⊤ Rt, ∥h∥, ∥t∥ = 1, ρ(R) = 1. We then have the following theorem. Theorem 1. UniBi is capable to uniquely model identity in terms of definition 2. Proof. Please refer to the Appendix A.2 4.3 EFFICIENT MODELING

4.3.1. EFFICIENT MODELING FOR SPECTRAL RADIUS

Although the proposed model has been proven to model identity uniquely, it still has a practical disadvantage, since it is difficult to directly represent all matrices whose spectral raidus are 1. In addition, it is also time-consuming to calculate the spectral radius ρ(•) via singular values decomposition (SVD). To avoid unnecessary decomposition, we divide a relation matrix into three parts R = R h ΣR t where R h , R t are orthogonal matrices and Σ = Diag[σ 1 , . . . , σ n ] is a positive semidefinite diagonal matrix. And we maintain the independence of these three components during training. Therefore, it becomes simple to obtain matrices whose spectral radius is 1, that is, R h ΣRt σmax . And we transform the score function Eq. 2 into the following form. s(h, r, t) = h ⊤ R h ΣR t t σ max ∥h∥∥t∥ , where σ max is the maximum among σ i .

4.3.2. EFFICIENT MODELING FOR ORTHOGONAL MATRIX

In addition, we find that the calculation of the orthogonal matrix is still time-consuming (Tang et al., 2020) . To this end, we only consider the diagonal orthogonal block matrix, where each block is a low-dimensional orthogonal matrix. Specifically, we use k-dimensional rotation matrices to build R h and R t . Taking R h as an example R h = Diag[SO(k) 1 , . . . , SO(k) n k ] , where SO(k) i denotes the i-th special orthogonal matrix, that is, the rotation matrix. The rotation matrix only represents the orthogonal matrices whose determinant are 1 and does not represent the ones whose determinant are -1. To this end, we introduce two diagonal sign matrices of n-th order S h , S t ∈ S where Thus, we could rewrite the score function Eq. 3 to S = {S | S ij = ±1, if i = j, 0, if i ̸ = j. }. s(h, r, t) = h ⊤ R h S h ΣS t R t t σ max ∥h∥∥t∥ . However, the sign matrix S h and S t are discrete. To address this problem, we notice that S h , Σ, S t can be merged into a matrix Ξ that Ξ ij = s i s j σ i , if i = j, 0 if i ̸ = j. ( ) where s i = (S h ) ii , s j = (S t ) jj , i, j = 1, . . . , n and Ξ = Diag[ξ 1 , . . . , ξ n ] . Thus, we incorporate the discrete matrices S h , S t into the continuous matrix Ξ. s(h, r, t) = h ⊤ R h ΞR t t |ξ max |∥h∥∥t∥ , where |ξ max | is the maximum among |ξ i |.

4.4. OTHER BENEFITS FROM SCALE NORMALIZATION

In addition to theoretical superiority, UniBi is more powerful and interpretable, since modeling identity requires normalizing the scales that barely contain any useful knowledge. On the one hand, it is obvious that modeling identity uniquely needs to avoid the cases in Fig. 1 (a) and Fig. 1 (b), which requires normalizing the scales of entities and relations. On the other hand, it is counter-intuitive that the scale information is useless for bilinear based models. Scale information is treated as useless because what really matters is not the absolute values but the relative ranks of scores. And scale contributes nothing to the ranks, since they remain the same after we multiply these scores by a factor greater than zero: s ′ (h, r, t) = (k e h) ⊤ (k r R)(k e t) = k 2 e k r (h ⊤ Rt) = k 2 e k r • s(h, r, t), where k e , k r > 0. Therefore, we treat learning on scales as ineffectivefoot_2 .

4.4.1. PERFORMANCE

As illustrated in Fig. 2 , UniBi has better performance, since it prevents ineffective learning with the least constraints. On the one hand, by preventing ineffective learning, UniBi focuses more on learning useful knowledge, which helps improve performance. On the other hand, it pays a negligible cost of expressiveness, since it adds only one equality constraint to each entity or relation, which is ignorable when the dimension d is high. In other words, although our scale normalization is a double-edged sword, its negative effect is negligible, and thus leads to a better performance. It should be noticed that the loss on expressiveness may outweighs the gain on learning, if scale normalization is replaced by a sticker one. For example, if we constrain the matrix to be orthogonal, the cost of expressiveness is no longer negligible, since an orthogonal matrix requires that each of its singular values be 1, which is d equality constraints.

Bill

Alice David hptr tphr complexity  1 2 3 isParentOf isSonOf 1 0 1 (a) h 𝑒𝑒 1 𝑒𝑒 2 h 𝑒𝑒 1 𝑒𝑒 2 𝜎𝜎 𝑖𝑖 (b)

4.4.2. INTERPRETABILITY

In addition to performance, scale normalization also helps us to understand complex relations. Complex relations are defined by whither hptr (head per tail of a relation) or tphr (tail per head of a relation) is higher than a special threshold 1.5 (Wang et al., 2014) . Therefore, all relations are divided into 4 types, i.e. 1-1, 1-N, N-1, and N-N. However, we think this division is too coarse-grained and suggest a fine-grained continuing metric complexity instead. To better demonstrate this idea, we gave an example in Fig. 3 (a) and the definition of complexity as follows. Definition 3. The complexity of a relation is the sum of its hptr and tphr. Intuitively, complex relations are handled by aggregating entities through projection (Wang et al., 2014; Lin et al., 2015) , which implies that the higher the complexity of a relation, the stronger its aggregation effect, and vice versa. We note that this aggregation effect can be well characterized by the relative ratio, or imbalance degree, of singular values of UniBi. For any relation matrix R = UΣV ⊤ , both U and V are isometry, and only the singular values of the scaling matrix Σ contribute to the aggregation. Moreover, the singular values of UniBi are less than or equal to 1foot_3 , since the spectral radius, i.e. the maximum singular value, are normalized. This shows a promising correspondence between the singular values of our model and the aggregation and further to the complexity, as demonstrated in Fig. 3(b ). Therefore, we could use singular values to represent the complexity of relations, which increases the interpretability of UniBi. It is worth mentioning that this interpretabity can be transferred to other bilinear based models if they normalize the spectral radius of their relation matrix as UniBi does.

5. EXPERIMENT

In this section, we give the experiment settings in Section 5.1. We verify that UniBi is capable to uniquely model identity, while previous bilinear based models are not in Section 5.2. UniBi is comparable to previous SOTA bilinear models in the link prediction task, as shown in Section 5.3. In addition, we demonstrate the robustness of UniBi in Section 5.4 and the interpretability about complexity in Section 5.5. Dataset We evaluate models on three commonly used benchmarks, i.e. WN18RR (Dettmers et al., 2018) , FB15k-237 (Toutanova & Chen, 2015 ) and YAGO3-10-DR (Akrami et al., 2020) . They are proposed by removing the reciprocal triples that cause data leakage in WN18, FB15K and YAGO3-10 respectively. Their statistics are listed in Tbl. 1.

Evaluation metrics

We use Mean Reciprocal Rank (MRR) and Hits@k (k = 1, 3, 10) as the evaluation metrics. MRR is the average inverse rank of the correct entities that are insensitive to outliers. Hits@k denotes the proportion of correct entities ranked above k. Baselines Here we consider two specific versions of UniBi: UniBi-O(2), UniBi-O(3), which use rotation matrices in 2 and 3 dimensions to construct the orthogonal matrix. To be specific, we use the unit complex value and the unit quaternion to model the 2D and 3D rotations using the 2 × 2 and 4 × 4 matrices, respectively. For more details, see the Appendix G.1. UniBi is compared with these bilinear models: RESCAL (Nickel et al., 2011) , CP (Hitchcock, 1927) , ComplEx (Trouillon et al., 2016) , and QuatE (Zhang et al., 2019) . In addition, it also compared to other models: RotatE (Sun et al., 2019) , MurE (Balazevic et al., 2019b) and RotE (Chami et al., 2020) , Turcker (Balazevic et al., 2019a) and ConvE (Dettmers et al., 2018) . Optimization We adopt the reciprocal setting (Lacroix et al., 2018) , which creates a reciprocal relation r ′ for each r and a new triple (e k , r ′ j , e i ) for each (e i , r j , e k ) ∈ K. Instead of using Cross Entropy directly (Lacroix et al., 2018; Zhang et al., 2020a; 2019) , we add an extra scalar γ > 0 before softmax function. Since UniBi is bounded, it brings an upper bound to loss that makes the model difficult to optimize as discussed by Wang et al. (2017) . L = - (h,r,t)∈Ktrain log exp(γ • s(h, r, t)) t ′ ∈E exp(γ • s(h, r, t ′ )) + λ • Reg(h, r, t), where Reg(h, r, t) is the regularization term and λ > 0 is its factor. Specifically, we only take Reg(h, r, t) as DURA (Zhang et al., 2020a) in experiments, since it significantly outperforms other regularization terms. In addition, γ is set to 1 for previous methods and greater than 1 for UniBi. And we set the dimension n to 500. For other details on the implementation, see Appendix G.2.

5.2. MODELING IDENTITY UNIQUELY

In this part, we verify that 1) UniBi is capable to uniquely model identity while previous models fail, and 2) both constraints on the embedding of entities and relations are indispensable. We explicitly add identity as a new relation to benchmarks and use its corresponding matrix to determine whether the uniqueness is modeled. In particular, this matrix is supposed to converge to the identity matrix I or a scaled one. To evaluate it, we introduce a new metric imbalance degree ∆ = ( i σi σmax -1) 2 . We first compare UniBi with CP (Hitchcock, 1927) and RESCAL (Nickel et al., 2011) , the least and most expressive bilinear model on FB15k-237. Besides, we also apply DURA (Zhang et al., 2020a) to models to explore whether these methods are able to uniquely model identity under extra regularization. As demonstrated in Fig. 4 (a), the imbalance degree ∆ of UniBi converges to 0 while others fails, which verifies that UniBi is capable to uniquely model identity. In addition, the imbalance of other models decreases to some extent when using DURA, yet they are still unable to uniquely model identity. Then, to show that UniBi can uniquely converge to identity, we use two matrices R 1 and R 2 to model it independently. As shown in Fig. 4 (b), the error between R 1 and R 2 also converges to 0, which means that they all converge to I. We then perform an ablation study to verify that both the entity constraint (EC) and the relation constraint (RC) are needed to model identity uniquely. The experiments show that only using either constraint is not enough to model the uniqueness of identity, as illustrated in Fig. 4(c) . And this verify the existence of problems shown in Fig. 1 (a) and Fig. 1(b) .

5.3. MAIN RESULTS

In this part, we demonstrate that the constraints helps UniBi to achieve better performance. We mainly compared our model with previous SOTA models, i.e. CP (Hitchcock, 1927) , ComplEx (Trouillon Table 2 : Evaluation results on WN18RR, FB15k-237 and YAGO3-10-DR datasets. We reimplement RotE, CP, RESCAL, ComplEx with n = 500 and denoted by †, while we take results on WN18RR and FB15k-237 from the origin papers and YAGO3-10-DR from Akrami et al. (2020) . Best results are in bold while the seconds are underlined.

WN18RR

FB15K-237 YAGO3-10-DR Model MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 DistMult (Nickel et al., 2011) using DURA regularization. Although these models have been implemented by Zhang et al. (2020a) , the dimensions of CP and ComplEx are very high and have not been tested on YAGO3-10-DR, so we reimplement them in this paper. In addition, we further remove the constraint of UniBi-O(n) as ablations to eliminate the influence of other factors. In Tbl. 2, UniBi achieves comparable results to previous bilinear based models and the unconstrained versions. UniBi is only slightly and justifiably below RESCAL on WN18RR, since RESCAL needs require much more time and spacefoot_4 .

5.4. UNIBI PREVENTS INEFFECTIVE LEARNING

In this part, we verify the superiority of UniBi comes from preventing ineffective learning. We conduct further comparisons without regularization. In addition, we also adopt EC and RC in Section 5.2 to study the effect of both constraints. All experiments are implemented on WN18RR. On the one hand, UniBi decreases slightly, while others decrease significantly when the regularization term is removed, as demonstrated in Fig. 5(a) . It shows that learning of UniBi is less dependent on extra regularization, since it is better at learning by preventing the ineffective part. On the other hand, we illustrate the MRR metric of UniBi and its ablation models on validation set as epoch grows in Fig. 5(b) . It shows that either constrain alleviates overfitting to some extend but fails to prevent the downward sliding behind since the scale of the unconstrained part may diverge. Thus, both constraints are verified to be indispensable for preventing the ineffective learning and performance of UniBi. 

5.5. CORRELATION TO COMPLEXITY

To verify the statement in Section 4.4.2, we study the connection between singular values and the complexity of each relation on three benchmarks, where complexity is calculated following definition 3. Furthermore, we measure the singular values of a relation by the imbalance degree ∆. To differentiate ∆ of a relation r and its reciprocal relation r ′ , we use ∆ r and ∆ r ′ to denote them. As demonstrated in Fig. 6 , we find that singular values are highly correlated with the complexity of a relation. Furthermore, we notice that ∆ r and ∆ r ′ are very close even if a relation is unbalanced (1-N or N-1), which shows that complexity is handled by aggregation regardless of direction.

6. CONCLUSION

In this paper, we reveal that previous bilinear models fail to model the uniqueness of relation identity required by general frameworks based on group. To address this problem with negligible cost, we propose UniBi, which has been proven to handle it by capturing entities in a unit ball. Furthermore, UniBi benefits from scale normalization required by modeling the identity uniquely. On the one hand, scale normalization prevents ineffective learning and leads to better performance; on the other hand, it reveals that the relative ratio of singular values corresponds to the complexity of relations and improves interpretability. Therefore, UniBi reconciles the framework and bilinear based models in terms of identity and paves the way for further studies of both.

APPENDIX A PROOFS

A.1 UNIBI IS BOUNDED Proposition 1. UniBi is bounded that s(h, r, t) ∈ [-1, 1]. Proof. By the Cauchy-Schwarz inequality and the fact that the spectral norm is a generalization of the L2 norm, we have the following. ∥h ⊤ Rt∥ ≤ ∥h ⊤ R∥∥t∥ ≤ ρ(R)∥h∥∥t∥ = 1. A.2 PROOF OF THEOREM 1 Proof. On the one hand, if R = I, it is easy to have ∀h, t ∈ Ê, h ̸ = t that h ⊤ Ih > h ⊤ It from the property of cosine. On the other hand, ∀R ∈ R, R ̸ = I, we can always give a counterexample. Using singular value decomposition (SVD), we have R = UΣV ⊤ , where Σ = Diag[σ 1 , . . . , σ n ] with σ i ≥ 0 and U, V are orthogonal matrices. Since ρ(R) = 1, we have σ max = max(σ i ) = 1. Besides, we notice that since U, V are orthogonal matrices that do not change the norm of vectors, we have ∥h ⊤ U∥ = ∥V ⊤ t∥ = 1 and use ĥ and t to denote U ⊤ h and V ⊤ t for mathematical simplicity. We consider three scenarios and discuss them separately. (1) If all singular values of R are equivalent, we have Σ = I, and we have: s(h, r, t) = h ⊤ UΣV ⊤ t = ĥ⊤ I t = ĥ⊤ t. ( ) It is easy to notice that the above equation just goes back to the cosine function, and it has the maximum value when ĥ = t and U ⊤ h = V ⊤ t. If UV ⊤ = I, this contradicts the assumption that R ̸ = I. If UV ⊤ ̸ = I, then we have h ̸ = t, since h = (U ⊤ ) -1 V ⊤ t = UV ⊤ t. It means h ⊤ Rh < h ⊤ Rt in this situation. (2) If not all singular values of R are equivalent and U = V, there ∃i, j ∈ 1, . . . , n that σ i ̸ = σ j . It may be assumed that σ i > σ j . Then we take ∀h ∈ Ê that has ĥj = ( σi σj -ϵ) ĥi where ϵ ∈ (0, σi σj -1). Then we take tk =      ĥk , k ̸ = i, j, ĥj , k = i, ĥt , k = j. (12) It is easy to notice that ĥ = U ⊤ h, t = U ⊤ t and h ̸ = t, then we have h ⊤ Rt = h ⊤ URU ⊤ t = ĥ⊤ Σ t = σ i ĥi ti + σ j ĥj tj + k̸ =i,j σ k ĥk tk = σ i ĥi ĥj + σ j ĥj ĥi + k̸ =i,j σ k ĥ2 k , similarly, we have h ⊤ Rh = h ⊤ URU ⊤ h = ĥ⊤ Σ ĥ = σ i ĥ2 i + σ j ĥ2 j + k̸ =i,j σ k ĥ2 k . Then, we use Eq. 14 minus Eq. 13, and we have h ⊤ Rh -h ⊤ Rt =   σ i ĥ2 i + σ j ĥ2 j + k̸ =i,j σ k ĥ2 k   -   σ i ĥi ĥj + σ j ĥj ĥi + k̸ =i,j σ k ĥ2 k   =σ i ( ĥ2 i -ĥi ĥj ) + σ j ( ĥ2 j -ĥi ĥj ) =σ i ĥi ( ĥi -ĥj ) -σ j ĥj ( ĥi -ĥj ) =(σ i ĥi -σ j ĥj )( ĥi -ĥj ) = σ i ĥi -σ j ( σ i σ j -ϵ) ĥi ĥi -( σ i σ j -ϵ) ĥi = σ i ĥi -σ i ĥi + ϵσ j ĥi 1 - σ i σ j + ϵ ĥi =ϵσ j ĥ2 i (1 - σ i σ j + ϵ) <0, which means that h ⊤ Rh < h ⊤ Rt in this case. (3) If not all singular values of R are equivalent and U ̸ = V, there exists k ∈ 1, . . . , n such that σ k = σ max = 1 since ρ(R) = 1. Then we take the following. ĥi = ti = 1, i = k, 0, i ̸ = k. It should be noted that ĥ = t and h ̸ = t since U ̸ = V. Then, we have h ⊤ Rt = h ⊤ UΣV ⊤ t = ĥ⊤ Σ t = σ k = 1. Since Proposition 1 proves that UniBi is bounded by [-1, 1], we have h ⊤ Rh ≤ 1 = h ⊤ Rt, which means that s(h, r, h) > s(h, r, t) does not always hold. In summary, we can conclude that UniBi has h ̸ = t, h ⊤ Rh > h ⊤ Rt iff. R = I, which means that UniBi can model identity uniquely in terms of definition 2.

A.3 PROOF OF THEOREM 3

Proof. if we have ∥e∥ = 1, ρ(R) = 1. The equation for the entity part obviously holds, and by Proposition 1, we have the following. ∥Re∥ ≤ ρ(R)∥e∥ = ρ(R) = 1, similarly, we have ∥e ⊤ R∥ ≤ 1. Moreover, such a condition will become necessary and sufficient if ∃ê that either ∥ê ⊤ R∥ = 1 or ∥Rê∥ = 1. To prove it, if we have ∥e ⊤ R∥ ≤ 1 and ∥Rt∥ ≤ 1, we use SVD to any R and get R = UΣV. Then we denote σ = Diag(σ) where σ is the vector of singular values and we have ∀e, ∥σ • e∥ ≤ 1. If ∃i, σ i > 1, we take i ̸ = j, e i = 1, e j = 0 to show that ∥e • σ∥ is larger than 1. Therefore, we have ∀i, σ i ≤ 1. Moreover, if ∀i, 0 < σ i < 1, we take ē = Vê ∥σ • ē∥ 2 = i σ 2 i ē2 i < σ i ē2 i = 1, which also violates ∥Re∥ = ∥ΣVê∥ = ∥σ • ē∥ = 1. Therefore, there ∃k, σ k = 1 and we have ρ(R) = 1.

A.4 PROOF OF PROPOSITION 2

Proof. If a relation r is invertible, it means there exists an inverse relation r -1 that ensure r • r -1 = r -1 • r = identity, where • is the composition of relations. Consider that identity relation means that no different entities will appear in a triple under it, or KGs only contain triples like ∀e ∈ E, (e, identity, e). Consider a complex relation that has multiple tail entities for a head entity, that is, ∃h, t 1 , t 2 (h, r, t 1 ), (h, r, t 2 ). In order to remap the entities t 1 and t 2 back to h, the inverse relation r -1 has to contain triples that (t 1 , r -1 , h), (t 2 , r -1 , h). Although r • r -1 could map h to h, r -1 • r, it fails to map t 1 , t 2 back to themselves separately since both (t 1 , r -1 • r, t 1 ) and (t 1 , r -1 • r, t 2 ) are true. Similarly, if a relation has multiple heads, we can also obtain the above conclusion. Therefore, all complex relations are inherently non-invertible.

B BACKGROUND OF GROUP AND MONOID

Definition 4 (Jacobson (2012)). A monoid is a triple (M, p, 1 in which M is a non-vacuous set, p is an associative binary composition (or product) in M , and 1 is an element of M such that p(1, a) = a = p(a, 1) for all a ∈ M Definition 5 (Jacobson (2012) ). A group G ( or (G,p,1)) is a monoid all of whose elements are invertible.

C DISCUSS OF DISTANCE BASED MODELS

We mention that distance based model could model identity uniquely, and here we give its corresponding proof. Here, we only consider models that can be written in the basic form of s(h, r, t) = -∥Rh -t∥. Note that translation is also considered in such a form if we take translation as a linear transformation. Furthermore, the model must ensure that I ∈ R and Ê = R n . Other peculiar scenarios are not considered in this discussion. Theorem 2. A distance based model s(h, r, t) = -∥Rh -t∥ with I ∈ R and Ê = R n can model identity uniquely. Proof. On the one hand, if R = I, it is easy to have -∥Rh -t∥ = -∥h -t∥ ≤ -∥h -h∥ = 0, where h ̸ = t. On the other hand, if R ̸ = I, if R is not a singular matrix, we take t = Rh and have -∥Rh -h∥ ≤ 0 = -∥Rh -Rh∥ = -∥Rh -t∥. Therefore, we prove that distance-based models are born to uniquely model identity. We use index rather than name to denote different relations for simplicity. And we notice that 1) non-convergence exists in every case, 2) the better the result, the more the scales converge, 3) Regularization cannot stop the fluctuation of scales. (Better view in color, zoom in, note the difference in the vertical coordinates.)

D COMPARISON TO PREVIOUS RESTRICTIONS

Although it is true that entity normalization and set R to orthogonal matrices, which is a special case of ρ(R) = 1, are common in KGE models, our constraint differs from their approach as follows. (1) The constraint for relation based on the spectral radius, i.e., ρ(R) = 1, is first proposed in our work, which is indispensable and irreplaceable to model the identity uniquely without the cost of the performance. If we set R to be orthogonal while keeping the normalization of entities, the performance will drop significantly, e.g., MRR: 0.488 → 0.471 for UniBi-O(2) on WN18RR. (2) Only the combination of entity and relation constraints succeeds to model identity uniquely, as shown in Figure 4(c) . It suggests that constraints on entity and relation should be treated as a whole rather than a combination of two unrelated things. (3) At first glance, our constraint looks very similar to that of TransR (Lin et al., 2015) , but in fact there is a big difference. The constraint of TransR is ∥h∥ ≤ 1, ∥hM r ∥ ≤ 1, and ∥tM r ∥ ≤ 1, and has three differences from ours. i) For TransR, ∥hM r ∥ ≤ 1 and ∥tM r ∥ ≤ 1 is the constraint itself. In contrast, R ⊤ h ≤ 1 and Rt ≤ 1 is deduction of our constraints ∥e∥ = 1 and ρ(R) = 1. ii) TransR does not normalize the entities, and we have shown in Fig. 1 (a) that normalization is necessary for a bilinear based method to uniquely model identity. iii) TransR is a distance based model, and as we have shown in Appendix C, distance based models do not need to consider the problem of identity, thus the proposal of TransR is difference to us. Therefore, we believe that our combination of constraints is novel, since it proposes a new constraint for the relation and the two parts are deliberately rather than arbitrarily combined.

E INEFFECTIVE LEARNING

Here, we demonstrate that ineffective learning does exist, which means the scale is not only redundant but also harmful. As shown in Fig. 7 , we take RESCAL as example to show this phenomenon. And we notice that 1) non-convergence exists in every case, 2) the better the result, the more the scales converge, and 3) regularization cannot stop the fluctuation of scales. We believe that these cases illustrate, on the one hand, the positive correlation between the constraint scale and the effect, on the other hand, the mere constraint cannot eliminate fluctuations that may interfere with the model learning. Therefore, we think scale is harmful and learning on it is ineffective, and we need a hard constraint rather than regularization term to prevent this completely. 

F FROM GROUP TO MONOID

Since the singular values are less than or equal to 1, some matrices do not have their corresponding inverse elements in R. Therefore, UniBi violates the definition of group, which requires that each element has its inverse element inside. However, we argue that some relations are inherently noninvertible and misidentified by previous work (Yang et al., 2020; Sun et al., 2019; Xu & Li, 2019) . For example, in Fig. 8 (a), if two relations are inverse elements of each other, their composition must be identity. However, we find the combination of IsParentOf and IsChildOf is not the identity, since the child of Alice's parent is "Alice or David" rather than "Alice itself" as demonstrated in Fig. 8(b ) and Fig. 8(c ). It should be noted that we are not saying that these two relations are irrelevant, but rather pointing out that their combination does not strictly fit our definition of identity. Moreover, the fact that they are both non-invertible does not mean that they cannot form the inversion pattern (Sun et al., 2019) , since definitions of the two concepts are different and do not conflict. Beyond the specific example, we generalize it to all complex relations that appear in both triples (e 1 , r, e 2 ) and (e 1 , r, e 3 ) (or (e 3 , r, e 2 )), where e 1 , e 2 , e 3 ∈ E, and e 1 ̸ = e 2 ̸ = e 3 . Furthermore, we have the following proposition. Proposition 2. ∀r ∈ R, if r is a complex relation, then r is non-invertible. Proof. Please refer to Appendix A.4. Since these non-invertible relations extensively present in KGs, the requirement that each element has its inverse element in the set should be abandoned. Thus, rather than group, we think monoid is a better structure for relations by their definitionsfoot_5 .

G IMPLEMENT DETAILS G.1 ROTATION MATRICES OF UNIBI

In Section 5, we propose two variants of UniBi, i.e., UniBi-O(2) and UniBi-O(3). Here, we give the rotation matrix SO(k) they used specifically. SO(2) = x -y y x , SO(3) =    p -q -u -v q p v -u u -v p q v u -q p    , where x, y, p, q, u, v ∈ R and x 2 + y 2 = 1, p 2 + q 2 + u 2 + v 2 = 1.

G.2 HYPERPARAMETERS

We fix the dimension of all models except RESCAL on WN18RR to 500, while RESCAL on WN18RR is set to 256 following Zhang et al. (2020a) . We choose Adam (Kingma & Ba, 2015) as the optimizer and fix the learning rate at 1e -3. We set the maximum epochs to 200 and apply the early stopping strategy. We set the scaling factor γ to 1 for all models except UniBi. And we search γ from {1, 5, 10, 15, 20, 25, 30} for UniBi. For the factor for regularization λ we search {1, 5e -1, 1e -1, 5e -2, 1e -2, 5e -3, 1e -3} for all models except UniBi and {0.5, 1, 1.5, 2, 2.5, 3} for it. Furthermore, we do not search for λ 1 and λ 2 in Eq. 25 as in the original paper of DURA (Zhang et al., 2020a) . we adopt their settings and set λ 1 = 0.5, λ 2 = 1.5 for ComplEx (Trouillon et al., 2016) and CP (Hitchcock, 1927) , while λ 1 = λ 2 = 1 for cases. The search results are listed on the Tbl. 3. We search for the batch size from {100, 1000}. In addition, we implemented all the experiments in PyTorch with a single NVIDIA GeForce RTX 1080Ti graphics card. We repeat each experiment five times and take their means and standard deviations. For the experiments in Section 5.4, we set γ for UniBi without DURA in Section 5.4 to 10, 15, 25 for WN18RR, FB15k-237, and YAGO3-10-DR for UniBi, respectively.

H TIME AND SPACE COMPLEXITY

Since UniBi needs to calculate some constraints, it spends more space and time. Here, we compare the time and space consumption of UniBi and CP, ComplEx, and RESCAL on FB15k 237, and set the batch size to 1000. As demonstrated in Fig. 9 (a) and Fig. 9 (b), UniBi takes a little more time than CP and ComplEx, and more time to compute the regularization since it is more complex. Nevertheless, we noticed that UniBi occupies a space similar to that of CP and ComplEx. In addition, we note that in Section5.3, UniBi outperforms all bilinear based models except RESCAL on WN18RR. And RESCAL needs significantly more time and space than other models. Therefore, it is not as efficient as other models. In summary, although UniBi takes a little longer, it is a good balance between complexity and performance. In future work, we will consider finding other constraints on the relation instead of the spectral radius.

I INHERENT REGULARIZATION TERM

Here we find a necessary condition of UniBi and then deduce its corresponding Lagrangian function. Theorem 3. UniBi has a necessary condition that ∥e∥ = 1, ∥Re∥ ≤ 1 and ∥R ⊤ e∥ ≤ 1. Proof. Please refer to the Appendix A.3. 

J EXTRA EXPERIMENTS OF SCALES

To demonstrate that the phenomenon in Fig. 1 (b) happens, we carry out additional experiments by explicitly adding the identity relation explicitly and monitor its scales in different models. We run all experiments five times and take their means and variants. As shown in Fig. 10 , we notice that all models except UniBi has fluctuation on the spectral radius of identity, which verifies the phenomenon in Fig. 1(b ).

K COMPARISON OF LEARNING AND MODELING IDENTITY

Readers may wonder why not add the identity relation to the training set and take this indirect approach. The reason is that learning on identity per se does not help the performance on other relations, and may be harmful for the overall results as shown in Fig . 11 . We think the reason is that the triples of identity are too much compared to the ones of other relations as shown in Tbl., and the model negligence in learning these relations.



We also prove this more rigorously in Appendix C. We present in more detail the similarities and differences between our constraints and those of our predecessors in the Appendix D Further discussion in Appendix E. We discuss this characteristic in more depth from the perspective of group in Appendix F. Detailed in Appendix H. Definition 4 and Definition 5.



Figure 1: The flaws of bilinear based models and our solution in terms of modeling the uniqueness of identity. (a) Identity matrix fails to model identity. (b) Scaled identity matrix could also model identity. (c) An illustration of UniBi. All entities are embedded in the unit sphere and stay in the unit ball after relation specific transformations.

Figure 2: How modeling identity improves the performance. a) An illustration of how scale affects the performance. On the plus side, it has little effect on expressiveness, on the minus side, it causes the ineffective learning. b) Modeling identity requires scale normalization, which removes the effect of scales. c) UniBi improves the performance since it only deals with scales and not the other factors.

Figure 3: Complexity and contraction. (a) A toy example to show how to calculate complexity. (2) the aggregation corresponds to the singular values less than 1.

Figure 4: UniBi is capable to uniquely model identity. (a) the imbalance degree (∆) of UniBi converges to 0 while others diverge. (b) The errors between different matrices modeling identity converge to 0 on different datasets. (c) Both entity constrain (EC) and relation constrain (RC) are indispensable for UniBi to model identity.

(a) Ablation of regularization. (b) Ablation of constrains.

Figure 5: UniBi benefits from preventing ineffective learning. (a) UniBi maintain its performance without regularization while other models do not. (b) Neither entity constraint (EC) nor relation constrain (RC) alone stops the sliding of performance.

Figure 6: The imbalance degree (∆) and complexity (# hptr + # tphr) of relations in WN18RR, FB15k-237 and YAGO3-10-DR respectively. Two metric are highly correlated and the imbalance of a relation (∆ r ) and the imbalance of its reciprocal one (∆ r ′ ) are very close.

Figure7: Why learning on the scales are ineffective. We test RESCAL with 3 setting a) no regularization, b) use Frobenius norm as regularization, c) use DURA as regularization. We use index rather than name to denote different relations for simplicity. And we notice that 1) non-convergence exists in every case, 2) the better the result, the more the scales converge, 3) Regularization cannot stop the fluctuation of scales. (Better view in color, zoom in, note the difference in the vertical coordinates.)

Figure 8: Why some relations are non-invertible. (a) A toy example. (2) True case. (3) False case.

Figure 10: Bilinear based models learning identity have different scales. (Better zoom in to see the fluctuation of RESCAL-DURA.)

Figure 11: Directly adding identity triples may hurt performance.

Statistics of the benchmark datasets.

Hyperparameters found by grid search. λ is the regularization coefficient, γ is the scaling factor, b is the batch size.

Statistics of triples of identity and other relations in benchmark datasets.

annex

If we ignore the other implicit constraints, the optimization of UniBi can be rewritten in the form of a constrained optimization problem.where f (h, r, t) is a loss function. Furthermore, Eq. 23 corresponds to a Lagrangian function.We notice that if we set factors λ h i = λ t k = λλ 1 and µ h j = µ t j = λλ 2 where λ, λ 1 , λ 2 > 0, we achieve the following expression from Eq. 24 by discarding constant terms.which is equivalent to the optimization of a unconstrained model under DURA (Zhang et al., 2020a) , the best regularization term for bilinear based model before. Therefore, we can get a more general version of DURA (DURA-G for simplicity) from Equ. 24.

DURA-G

At first glance, this seems to be nothing more than a worthless trick. However, in terms of DURA, DURA-G is nonsense and cannot deduce form its perspective of distance-based models (Please refer to Section 4.3 in the original paper for more details) while making sense in the perspective of Lagrangian function.Although DURA-G has an additional hyperparameter to search and can lead to better results, we do not use this in experiments for three reasons: 1) for a more fair comparison with previous models, 2) the improvement is marginal, and 3) we prefer not to distract the reader from our key ideas.

