DECENTRALIZED KNOWLEDGE GRAPH REPRESENTATION LEARNING

Abstract

Knowledge graph (KG) representation learning methods have achieved competitive performance in many KG-oriented tasks, among which the best ones are usually based on graph neural networks (GNNs), a powerful family of networks that learns the representation of an entity by aggregating the features of its neighbors and itself. However, many KG representation learning scenarios only provide the structure information that describes the relationships among entities, causing that entities have no input features. In this case, existing aggregation mechanisms are incapable of inducing embeddings of unseen entities as these entities have no pre-defined features for aggregation. In this paper, we present a decentralized KG representation learning approach, decentRL, which encodes each entity from and only from the embeddings of its neighbors. For optimization, we design an algorithm to distill knowledge from the model itself such that the output embeddings can continuously gain knowledge from the corresponding original embeddings. Extensive experiments show that the proposed approach performed better than many cutting-edge models on the entity alignment task, and achieved competitive performance on the entity prediction task. Furthermore, under the inductive setting, it significantly outperformed all baselines on both tasks.

1. INTRODUCTION

Knowledge graphs (KGs) support many data-driven applications (Ji et al., 2020) . Recently, learning low-dimensional representations (a.k.a. embeddings) of entities and relations in KGs has been increasingly given attentions (Rossi et al., 2020) . We find that existing models for KG representation learning share similar characteristics to those for word representation learning. For example, TransE (Bordes et al., 2013) , a well-known translational KG embedding model, interprets a triple (e 1 , r, e 2 ) as e 1 + r ≈ e 2 , where e 1 , e 2 , r denote subject, object and their relationship, respectively, and the boldfaces denote the corresponding embeddings. If we view e 1 as a word in sentences, and e 2 as well as many other objects of e 1 as the context words, then TransE and many KG embedding models (Wang et al., 2014; Dettmers et al., 2018; Nguyen et al., 2018; Kazemi & Poole, 2018; Sun et al., 2019) , learn representations in a form simaliar to that used in Skip-gram (Mikolov et al., 2013a) , where the input representation is learned to predict the context (i.e., neighbors) representations. a more efficient but still simple way to realize this concept on the most popular graph attention network (GAT) (Velickovic et al., 2018) , as well as its many variants (Sun et al., 2020; Vashishth et al., 2020) . We illustrate the methodology by decentralized attention network (DAN), which is based on the vallina GAT. DAN is able to support KG representation learning for unseen entities with only structure information, which is essentially different from the way of using self features (e.g., attribute information) in the existing graph embedding models (Hamilton et al., 2017; Bojchevski & Günnemann, 2018; Hettige et al., 2020) . Furthermore, the neighbors in DAN serve as an integrity to give attentions, which means that DAN is more robust and more expressive compared with conventional graph attention mechanism (Velickovic et al., 2018) . Another key problem in decentralized KG representation learning is how to estimate and optimize the output embeddings. If we distribute the information of an entity over its neighbors, the original embedding of this entity e i also learns how to effectively participate in the aggregations of its different neighbors conversely. Suppose that we have obtained an output representation g i from DAN for entity e i , we can simply estimate and optimize g i by aligning it with e i . But directly minimizing the L1/L2 distance between g i and e i may be insufficient. Specifically, these two embeddings have completely different roles and functions in the model, and the shared information may not reside in the same dimensions. Therefore, maximizing the mutual information between them is a better choice. Different from the existing works like MINE (Belghazi et al., 2018) or InfoNCE (van den Oord et al., 2018) , in this paper, we design a self knowledge distillation algorithm, called auto-distiller. It alternately optimizes g i and its potential target e i , such that g i can automatically and continuously distill knowledge from the original representation e i across different batches. The main contributions of this paper are listed as follows. (1) We propose decentralized KG representation learning, and present DAN as the prototype of graph attention mechanism under the open-world setting. (2) We design an efficient knowledge distillation algorithm to support DAN for generating representations of unseen entities. (3) We implement an end-to-end framework based on DAN and auto-distiller. The experiments show that it achieved superior performance on two prevalent KG representation learning tasks (i.e., entity alignment and entity prediction), and also significantly outperformed those cutting-edge models under the open-world setting.

2. BACKGROUND

Knowledge Graph. A KG can be viewed as a multi-relational graph, in which nodes represent entities in the real world and edges have specific labels to represent different relationships between entities. Formally, we define a KG as a 3-tuple G = (E, R, T ), with E and R denoting the sets of entities and relationships, respectively. T is the set of relational triples. KG Representation Learning. Conventional models are mainly based on the idea of Skip-gram. According to the types of their score functions, these models can be divided into three categories: translational models (e.g., TransE (Bordes et al., 2013) and TransR (Lin et al., 2015a) ), semantic matching models (e.g., DistMult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016) ) and neural models (e.g., ConvE (Dettmers et al., 2018) and RSN (Guo et al., 2019) ). We refer interested readers to the surveys (Wang et al., 2017; Ji et al., 2020) for details. Recently, GNN-based models receive great attentions in this field, which are closely related to this paper. Specifically, R-GCN (Schlichtkrull et al., 2018) , AVR-GCN (Ye et al., 2019) and CompGCN (Vashishth et al., 2020) introduce different relation-specific composition operations to combine neighbors and the corresponding relations before neighbor aggregation. RDGCN (Wu et al., 2019) refactors KGs as dual relation graphs (Monti et al., 2018) where edge labels are represented as nodes for graph convolution. All the aforementioned GNN-based models choose GCNs and/or GATs to aggregate the neighbors of an entity, in which an identity matrix is added to the adjacency matrix. This operation is helpful when elements have self features, but poses a problem in learning the representations of unseen entities where no self features are attached to them. Differently, decentRL fully relies on the neighbor context to attend to the neighbors of each entity in linear complexity, which is efficient and easy to be deployed.

Entity Alignment. Entity alignment aims to find the potentially aligned entity pairs in two different

KGs G 1 = (E 1 , R 1 , T 1 ) and G 2 = (E 2 , R 2 , T 2 ), given a limited number of aligned pairs as training data S ⊂ E 1 × E 2 . Oftentimes, G 1 , G 2 are merged to a joint KG G = (E, R, T ), which enables the models learn representations in a unified space. Entity Prediction. Entity prediction (a.k.a. KG completion (Bordes et al., 2013) ) seeks to find the missing subject e 1 or object e 2 , given an incomplete relation triple (?, r, e 2 ) or (e 1 , r, ?). It is worth noting that the performance on the entity prediction task may be greatly improved by complex deep networks, as it relies on the predictive ability rather than the embedding quality (Guo et al., 2019) . Hence, many cutting-edge models cannot obtain promising results in entity alignment (Guo et al., 2019; Sun et al., 2020) . Differently, entity alignment directly compares the distance of learned entity embeddings, which clearly reflects the quality of output representations. Few models demonstrate consistently good performance on both tasks, whereas decentRL is capable of achieving competitive, even better, performance compared with respective state-of-the-art models.

3. DECENTRALIZED REPRESENTATION LEARNING

In the decentralized setting, the representation of an entity e i is aggregated from and only from its neighbors N i = {e 1 , e 2 , . . . , e |Ni| }. As it may have many neighbors that are unequally informative (Velickovic et al., 2018) , involving attention mechanism is a good choice.

3.1. GRAPH ATTENTION NETWORKS

We start by introducing the Graph attention network (GAT) (Velickovic et al., 2018) , which leverages linear self attention to operate spatially close neighbors. For an entity e i , GAT aggregates the representations of its neighbors N i and itself into a single representation c i as follows: c i = ej ∈Ni∪{ei} a ij We j , where a ij is the learnable attention score from e i to e j , and W is the weight matrix. To obtain a ij , a linear attention mechanism is used here: a ij = softmax σ(a T [W 1 e i W 2 e j ]) , ( ) where a is a weight vector to convert the concatenation of two embeddings into a scalar attention score, and denotes the concatenation operation. W 1 and W 2 are two weight matrices. σ is the activation function, usually being LeakyReLU (Xu et al., 2015) . GAT computes the attention score of an entity e i to its neighbors in linear complexity, which is very efficient when being applied to large-scale graphs.

3.2. DECENTRALIZED ATTENTION NETWORKS

Intuitively, if e i is the embedding of an unseen entity, it is rarely useful in computing the attention scores (as it is just a randomly initialized vector). Thus, purely relying on its neighbors may be a good choice. Specifically, to obtain the decentralized attention scores, one may simply sum all the attention scores from other neighbors a ij = softmax( e k ∈Ni\{ej } a kj ). However, it would lead to a problem that this sum only represents the attention of each neighbor to e j . In this case, a high attention score from one neighbor e k to e j can dominate the value of a ij , but it does not mean that e j is more important for e i . Therefore, all neighbors should act as an integrity in giving attentions. Towards this end, we propose decentralized attention networks (DANs). Formally, to obtain the decentralized attention weight a ij , we have to feed the attention layer with two types of input: the neighbor context vector n i (i.e., query), and the candidate neighbor embedding e j (i.e., key and value). Separately controlling the iterations of these two variables in a multi-layer model is evidently inefficient. Instead, we realize this operation by involving a second-order attention mechanism. For layer k, DAN calculates the decentralized attention score a k ij as follows: e i 's neighbor context to e j . Then, we can obtain the output of layer k by: a k ij = softmax σ(a T k [W k 1 d k-1 i W k 2 d k-2 j ]) , d k i = ej ∈Ni a k ij W k d k-2 j . It is worth noting that we perform convolutions on layer k -2, as the score a k ij is attended to the neighbor representations in layer k -2. This keeps the consistency and ensures the output representations are consecutive. Also, it enhances the correlation of output in different layers, and forms the second-order graph attention mechanism. For the first layer of DAN, we initialize d 0 i and d -1 j as follows: d 0 i = 1 |N i | ej ∈Ni W 0 e j , d -1 j = e j . Here, we simply use a mean aggregator to obtain the decentralized embedding d 0 i of layer 0, but other aggregators like pooling may be employed as well. This simple mean aggregator can also be regarded as a CBOW model with dynamic window size. For the architecture and implementation of DAN, please refer to Appendix A.

3.3. INSIGHT OF DAN

We compare GAT with DAN in Figure 1 . Although the attention layers levearged by DAN and GAT are identical, the decentralized structure has two significant strengths: Inductive representation learning. In GAT, the self embedding e i participates in the calculation of attention scores (Equation 1) and the aggregation of neighbors (Equation 2). Therefore, when e i is an open entity, its embedding is completely random-initialized. In this case, the attention scores computed by GAT are almost meaningless. By contrast, DAN generates the embedding of e i without requirement of its embedding throughout. Such characteristic enables DAN to induce embeddings on unseen entities. Robustness. When calculating the attention scores for an entity e i , GAT only takes e i as query, the importance of other neighbors is overlooked, which may lead to biased attention computation. On the other hand, it is generally known that most entities in KGs only has a small number of neighbors (Li et al., 2017) . Due to the lack of training examples, the embeddings of these entities are not as informative as those with more neighbors. Therefore, they may be not capable of serving as queries for computing attention scores, causing that GAT cannot obtain reliable attention scores in some cases. By contrast, the queries in DAN are neighbor context vectors, which have richer semantics and also enable DAN to compute the attention scores in an unbiased way. Furthermore, the computational complexity of DAN is almost identical to that of GAT, except that DAN has an additional mean aggregator. From Figure 1 and Equation 5, we can find that such an aggregator is evidently simpler than the linear attention layer, which means that its computational complexity (both time and space) can be almost overlooked. Therefore, DAN is an efficient model.

4. DECENTRALIZED REPRESENTATION ESTIMATION

The final output representation g i of DAN for e i can be optimized by minimizing the L1/L2 distance between g i and e i to enable self-supervised learning. Such distance estimation pursues a precise match at every dimension of these two embeddings, but ignores the implicit structure information across different dimensions.

4.1. MUTUAL INFORMATION MAXIMIZATION

As mentioned in Section 1, the original embedding e i also serves as one of neighbor embeddings in aggregating the decentralized embeddings of its neighbors, which implies that e i itself also preserves the latent information used to support its neighbors. Inspired by MINE (Belghazi et al., 2018 ), InfoNCE (van den Oord et al., 2018 ), and DIM (Hjelm et al., 2019) , in this paper we do not try to optimize g i by reconstructing the original representation e i . Instead, we implicitly align them in a way of maximizing the mutual information I(g i , e i ). Specifically, we define a learnable function f : R D ⊗ R O → R to estimate the mutual information density (van den Oord et al., 2018) between g i and the copy of e i (the reason why using the copied vector will be explained shortly): f (g i , e i ) = exp(g T i W f êi + b f ), where D and O are the dimensions of the output and input representations, respectively. W f , b f are the weight matrix and bias, respectively. êi denotes the copy of e i . We expect that f (g i , êi ) is significantly larger than f (g i , êj ) for j = i. Following InfoNCE, the objective can be written as: I(g i , êi ) = E Xi log f (g i , êi ) ej ∈Xi f (g i , êj ) , where X i = {e 1 , . . . , e |Xi| } contains |X i | -1 sampled negative entities plus the target entity e i . Maximizing this objective results in maximizing a lower-bound of mutual information between g i and êi (van den Oord et al., 2018) .

4.2. AUTO-DISTILLER

Note that in Equations ( 6) and ( 7), we actually use the copy of the original representations, which leads to a completely different optimization process compared with existing works. Specifically, some methods like InfoNCE or DIM jointly optimize the two input in the density function, as both variables are the output of deep neural models requiring to be updated in back-propagation. But in decentRL, e i is just a randomly initialized vector, and its gradient in Equation ( 7) may conflict with the gradient where it is taken as an input neighbor in learning decentralized representations of its neighbors. On the other hand, such optimization operation also prevents e i from learning the neighborhood information at the bottom layer, and compels this variable to match g i . To address this problem, we view e i as a teacher, and g i as a student to learn from the teacher (Tian et al., 2020) . Our aim is to let this teacher continuously gain more knowledge to teach the student, which we call it auto-distiller. Therefore, our final objective is: argmax gi,f E Xi log f (g i , êi ) ej ∈Xi f (g i , êj ) + argmax ei ej ∈Ni E Xj log f (g j , êj ) e k ∈Xj f (g j , êk ) , and it has two important characteristics: Lemma 1 (automatic distillation). Optimizing the first term of Equation 8for entity e i naturally contributes to optimizing the second term for the neighbors of e i , which means conventional minibatch training procedure can be applied. Lemma 2 (lower-bound). The mutual information between g i and êi is still lower-bounded in auto-distiller. Proof. See Appendix B. Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR AlignE (Sun et 

5. EXPERIMENTS

We evaluated decentRL on two prevalent tasks, namely entity alignment and entity prediction, for KG representation learning. As few existing models show state-of-the-art performance on both tasks, we picked the state-of-the-art methods in respective tasks and compared decentRL with them. To probe the effectiveness of decentRL, we also conducted ablation study and additional experiments. Limited by the length, please see Appendix C for more analytic results.

5.1. DATASETS

Entity Alignment Datasets. We consider the JAPE dataset DBP15K (Sun et al., 2017) , which is widely used by existing studies. It includes three entity alignment settings, each of which contains two KGs of different languages. For example, ZH-EN indicates Chinese-English alignment on DBpedia. Entity Prediction Datasets. We consider four datasets: FB15K, WN18, FB15K-237, and WN18RR (Bordes et al., 2013; Dettmers et al., 2018; Toutanova & Chen, 2015) . The former two have been used as benchmarks for many years, while the latter two are the corrected versions, as FB15K and WN18 contain a considerable amount of redundant data (Dettmers et al., 2018) .

5.2. EXPERIMENT SETUP

For both tasks, we initialized the original entity embeddings, relation embeddings and weight matrices with Xavier initializer (Glorot & Bengio, 2010) . To learn cross-KG embeddings for the entity alignment task, we incorporated a contrastive loss (Sun et al., 2020; Wang et al., 2018) to cope with aligned entity pairs S, which can be written as follows: L a = (i,j)∈S + ||g i -g j || + (i ,j )∈S - α λ -||g i -g j || + , where S + , S -are the positive entity pair set and sampled negative entity pair set, respectively. || • || denotes the L2 distance between two embeddings. α and λ are hyper-parameters. By jointly minimizing two types of losses, decentRL is able to learn cross-KG embeddings for entity alignment. Similarly, for entity prediction, we also need to choose a decoder to enable decentRL to predict missing entities (Vashishth et al., 2020) . We chose two simple models, TransE (Bordes et al., 2013) and DistMult (Yang et al., 2015) for the main experiments, which are sufficient to achieve comparable performance against the state-of-the-art.

5.3. ENTITY ALIGNMENT RESULTS

Table 1 depicts the entity alignment results on the JAPE dataset. We observe that: (1) decentRL significantly outperformed all the methods on Hits@1 and MRR, which empirically showed the advantage of decentRL in learning high-quality representations. (2) The scores of Hits@10 of decentRL were slightly below those of AliNet. We argue that decentRL is a purely end-to-end model, which did not incorporate any additional data augmentation used in AliNet (Sun et al., 2020 ) that may improve the Hits@10 results. Moreover, decentRL is much easier to be optimized, as it does not need to coordinate the hyper-parameters of each part in the pipeline. Also, there is no conflict to combine decentRL and the data augmentation algorithm for further improvement. We also evaluated the graph-based models in the open-world setting. Specifically, we first split the testing entity set into two subsets, namely known entity set and unknown entity set. Then, for those triples in the training set (in the non-open-world setting, all triples are used in training) containing unknown entities, we moved them to the testing triple set, which were only available during the testing phase. We followed a classical dataset splitting used in the entity alignment task, with 20% of entities in the original testing set are sampled as open entities. Table 6 in Appendix C.1 compares the datasets before and after re-splitting. The experimental results are shown in Figure 2 . We can find that decentRL outperformed GAT and AliNet (the second-best model) on all metrics. Although its performance slightly dropped compared with that in the close setting, the results of others (especially GAT, which only uses self representation as "query") suffered more under this open-world setting. Overall, decentRL is capable of achieving state-of-the-art performance on both open and conventional entity alignment tasks.

5.4. ENTITY PREDICTION RESULTS

We also evaluated decentRL on the entity prediction task, in comparison with different GNN-based models: D-GCN (Marcheggiani & Titov, 2017) , R-GCN (Schlichtkrull et al., 2018), W-GCN (Shang et al., 2019) , and CompGCN (Vashishth et al., 2020) . The results on FB15K-237 are shown in Table 2 , from which we observe that: (1) decentRL significantly outperformed all the other models for many metrics, especially MR (mean rank). This demonstrates that decentRL can learn better representations for both popular entities (valued by MRR metric) and long-tail entities (valued by MR metric) (2) decentRL boosted DistMult to achieve almost state-of-the-art performance on FB15K-237. The simpler model TransE, also gained great improvement on all metrics. The reason may be that DAN discovered a better aggregation weights and our auto-distiller continuously refined the output representations. The corresponding results on open entity prediction are shown in Figure 3 . We added a state-of-the-art yet more complicated GNN-based model CompGCN + ConvE in the comparison, from which we observe that: decentRL + DistMult outperformed all the other models under this open-world setting, which verified its effectiveness in inductive learning with only structure data. decentRL + TransE achieved the second-best performance, followed by CompGCN + ConvE. Overall, decentRL provided the decoders with better representations and supported them to achieve competitive and even better performance over the cutting-edge models. Table 3 : Entity prediction results on FB15K and WN18.

FB15K WN18

Hits@1 Hits@3 Hits@10 MRR MR Hits@1 Hits@3 Hits@10 MRR MR 

FB15K-237 WN18RR

Hits@1 Hits@3 Hits@10 MRR MR Hits@1 Hits@3 Hits@10 MRR Table 5 : Ablation study of entity alignment on DBP15K (average of 5 runs).

Models ZH-EN JA-EN FR-EN

Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR The detailed results on entity prediction are shown in Tables 3 and 4 , respectively. For the conventional benchmarks FB15K and WN18 that have been widely used for many years, our decentRL with only simple decoders achieved competitive even better performance compared with those state-of-the-art models. Furthermore, decentRL greatly improved the best results on MR, as it can more efficiently aggregate neighbor information to learn high-quality representations for those "challenging" entities. On the other hand, we find that the performance of decentRL on FB15K-237 and WN18RR is not as promising as that in Table 3 , although it still achieved the best Hits@10 and MR on FB15K-237. We argue that this may be caused by the insufficient predictive ability of simple decoders (Guo et al., 2019) . However, we currently do not plan to adapt decentRL to some complex decoders like ConvE, as such complicated architecture can largely increase the time and space complexity. For example, CompGCN with ConvE needs to be trained at least two days on a small dataset like FB15K-237. Overall, the performance of some simple linear KG representation learning models (i.e., TransE, DistMult, and ComplEx) received great benefits from decentRL, and even outperformed some cutting-edge models. Figure 4 : Hits@1 results of each layer and the concatenation. The results of AliNet are from (Sun et al., 2020) . It has no L3 and L4 scores as its best performance was achieved by a two-layer model.

5.5. COMPARISON WITH ALTERNATIVE MODELS

To exemplify the effectiveness of each module in decentRL, we derived a series of alternative models from decentRL and report the experimental results on entity alignment in Table 5 . "centRL" denotes the model that used DAN but self-loop was added to the adjacency matrix. From the results, we observe that: all the models in the table achieved state-of-the-art performance on Hits@1 metric, as DAN which leverages all neighbors as queries can better summarize the neighbor representations of entities. On the other hand, we also find that decentRL + auto-distiller outperformed all the other alternatives. The centralized model centRL + auto-distiller had a performance drop compared with the decentralized one. The main reason is that entities themselves in centRL also participated in their own aggregations, which disturbed the roles of the original representations. Please see Appendix C.2 for the corresponding results under the open entity alignment setting.

5.6. COMPARISON OF THE OUTPUT EMBEDDINGS OF EACH LAYER

We also compared the performance of each layer in decentRL and AliNet. As shown in Figure 4 , decentRL consistently outperformed AliNet on each layer except the input layer. As mentioned before, decentRL does not take the original representation of an entity as input, but this representation can still gain knowledge in participating in the aggregations of its neighbors and then teach the student (i.e., the corresponding output representation). The performance of the input layer was not as good as that in AliNet, because the latent information in this layer may not be aligned in each dimension. On the other hand, we also observe that concatenating the representations of each layer in decentRL also improved the performance, with a maximum increase of 0.025 (0.023 for AliNet). Furthermore, decentRL can gain more benefits from increasing the layer number, while the performance of AliNet starts to drop when the layer number is larger than 2 (Sun et al., 2020) .

A DECENTRALIZED ATTENTION NETWORKS

A.1 ARCHITECTURE We illustrate a four-layer decentralized attention network (DAN) in Figure 5 . Following the existing works AliNet (Sun et al., 2020) , CompGCN (Vashishth et al., 2020) etc. (Ye et al., 2019; Wu et al., 2019) , we also combine the relation embeddings in the aggregation. The original entity embeddings (i.e., g -1 ), relation embeddings and weight matrices are randomly initialized before training. At step 0, we initialize g 0 with the original entity embeddings by mean aggregator. At step 1, g -1 and g 0 are fed into DAN. Then, we combine the hidden representations with relation embeddings (steps 2, 3). Finally, we obtain the output of the first layer g 1 (step 4). Repeating steps 1-4, we can sequentially obtain the output of the last layers. Grey, blue, light blue and orange nodes denote original entity embeddings, decentralized entity embeddings, hidden entity embeddings and relation embeddings, respectively. Taking the first layer as example, as DAN requires the output embeddings of two previous layers as input, we randomly initialize the original embeddings as g -1 (identical to d -1 ), and use an appropriate aggregator to generate the initial decentralized embeddings g 0 (identical to d 0 ).

A.2 IMPLEMENTATION DETAILS

Infrastructure. Following GAT (Velickovic et al., 2018) , we adopt dropout (Srivastava et al., 2014) and layer normalization (Ba et al., 2016) for each module in DAN. To fairly compare decentRL with other models, we do not leverage the multi-head attention mechanism (Vaswani et al., 2017; Velickovic et al., 2018) which has not been used in other GNN-based models, although it can be easily integrated. Furthermore, we consider residual connections (He et al., 2016) between different layers of DAN to avoid over-smoothing, in which we not only consider the output of the previous layer as "residual", but also involve the output of the mean aggregator (i.e., g 0 i ). This can be written as follows: g k i := g 0 i + g k-1 i + g k i . For simplicity, in the rest of the paper, we still use g k i to denote the output at layer k. Adaption to different tasks. For different KG representation learning tasks, we also consider different adaptation strategies to achieve better performance. For the entity alignment task, we follow the existing GNN-based models (Sun et al., 2020; Wang et al., 2018) to concatenate the output representation of each layer as the final output. We formalize it as follows: g i = [g 1 i . . . g K i ]. On the other hand, the entity prediction task prefers the prediction ability rather than the learned representations (Guo et al., 2019) . We only use the output of the last layer as the final output representation, which allows us to choose larger batch size or hidden size to obtain better performance. We write it as follows: g i = g K i . To enhance the predictive ability of decoders, here we only regard the mutual information-based loss as a kind of regularization (which is similar to MINE (Belghazi et al., 2018) ), and thus we re-scale the loss weight to 0.001.

B AUTOMATIC KNOWLEDGE DISTILLATION B.1 INSIGHT

The existing works usually choose to jointly optimize two input variables in the density function f , in which these two variables can be regarded as two different outputs of two models. For example, InfoNCE uses an encoder to obtain the latent representations and another model to summarize those representations to one context vector. This is similar to DeepInfoMax and DeepGraphInfo, which also leverage two models to obtain local features and summarize global features, respectively. However, in our case, the mutual information that we want to maximize is between the input and output of the same model, where the input vectors are just randomly initialized raw embeddings. We argue that jointly optimizing the original input e i with the output g i in Equation ( 7) may drive e i to completely match g i . To resolve this problem, we only use the copy of e i when estimating the mutual information density between e i and g i . In other words, we do not update the gradient of e i in Equation ( 7), leading to a natural knowledge distillation architecture. Specifically, we separately optimize e i and g i in different training examples or batches. The first step is corresponding to the former part of Equation ( 8): argmax gi,f E Xi log f (g i , êi ) ej ∈Xi f (g i , êj ) . Here, e i is served as a "pre-trained" teacher model to teach a "student". Hence, the learnable parameters are g i and f . As aforementioned, e i needs to participate in learning the representations of its neighbors, during which it can gain knowledge to teach its student g i . This step is achieved by the latter part in Equation ( 8): argmax ei ej ∈Ni E Xj log f (g j , êj ) e k ∈Xj f (g j , êk ) , where our aim is to find the optimal e i to maximize the mutual information between the original and output representations of its neighbors.

B.2 THE LOWER-BOUND OF MUTUAL INFORMATION

We do not really need to explicitly separate the training procedure into the two steps described in Appendix B.1, which is widely used in adversarial learning. Instead, this knowledge distillation mechanism can be automatically achieved during different mini-batches. Under review as a conference paper at ICLR 2021 Specifically, if we expand Equation (13) a little bit, then we obtain: (N i , Θ, f ) = argmax Ni,Θ,f E Xi log f (G(N i ), êi ) ej ∈Xi f (G(N i ), êj ) , where N i = {e j |e j ∈ N i } is the original neighbor embedding set for e i and Θ denotes the parameters of our decentralized model G. As the optimal Θ for the model depends on the neighbor representation set N i , and the optimal density function f also relies on the output of the model, it is impossible to search all spaces to find the best parameters. In practice, we choose to optimize a weaker lower-bound on the mutual information I(g i , êi ) (Tian et al., 2020) . In this case, a relatively optimal neighbor embedding e * x in Equation ( 15) is: e * x = argmax ex E Xi log f (G(N i ), êi ) ej ∈Xi f (G(N i ), êj ) , and we have:  I(g i , êi |e * x ) = E Xi log f (G( ≤ E Xi log f * (G * (N * i ), êi ) ej ∈Xi f * (G * (N * i ), êj ) = I(g * i , êi ) (18) ≤ I(g * i , êi ) + log(|X i |) (19) ≤ I(g i , êi ), where * denotes the optimal setting for the corresponding parameters. Equations ( 18) and ( 19) are the conclusion of InfoNCE, given that |X i | is large enough. The above equations suggest that optimizing e x can also lower-bound the mutual information without the requirement of other parameters being perfectly assigned. Consider that the entity e x may have more than one neighbor, we can optimize those cases together: e * x = argmax ex ej ∈Nx E Xj log f (G(N j ), êj ) e k ∈Xj f (G(N j ), êk ) Evidently, the above equation is identical to Equation ( 14), which means that optimizing Equation ( 15) can subsequently contribute to optimizing the original neighbor representations. Therefore, the proposed architecture can automatically distill knowledge, in different mini-batches, from the original representations into the output representations.

C FURTHER ANALYSIS

C.1 DATASET DETAILS The detailed statistics of the entity alignment datasets are shown in Table 6 . Although we only set 20% of entities in testing set as open entities, there are actually more than 20% of triples that were removed from the training set. For the details of datasets used in entity prediction, we suggest readers to refer to (Bordes et al., 2013) and (Dettmers et al., 2018) .

C.2 ABLATION STUDY ON OPEN ENTITY ALIGNMENT

We also conducted an ablation study on the open entity alignment task, as shown in Table 7 . The experimental results, in principle, are consistent to those on conventional entity alignment. The proposed architecture (decentRL + auto-distiller) still outperformed other alternatives. By contrast, the performance of the centralized model with auto-distiller dropped significantly, in comparison with that it almost has identical performance with decentRL + infoNCE in Table 5 . Another worth-noting point is that the gap on Hits@10 narrowed in the open entity alignment task, which may be because the training data were shrunk considerably due to removing the corresponding triples referred to unseen entities. Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR 

Hidden size ZH-EN JA-EN FR-EN

Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR We also evaluated decentRL under different settings of dimensions. The results are shown in Table 8 . With the increase of the input dimensions (i.e., embedding size), the performance of decentRL improved quickly, with dimension = 128 achieving comparable performance with the state-ofthe-art methods (e.g., AliNet with 300 dimension) and outperforming them at dimension = 256. Furthermore, decentRL can continually gain benefit from larger hidden sizes. Even when the dimension was set to 512, the improvement was still significant.



CONCLUSIONIn this paper we proposed decentralized KG representation learning, which explores a new and straightforward way to learn representations in open-world, with only structure information. The corresponding end-to-end framework achieved very competitive performance on both entity alignment and entity prediction tasks.



Figure 1: Comparing graph attention network (GAT) with decentralized attention network (DAN).

Figure 2: Open entity alignment results on DBP15K. Bars with dotted lines denote the performance drop compared with the corresponding results in the non-open setting. same to the following.

Figure 3: MRR results on open FB15K-237.

Figure5: Overview of a four-layer decentralized attention network (DAN). Best viewed in color. Grey, blue, light blue and orange nodes denote original entity embeddings, decentralized entity embeddings, hidden entity embeddings and relation embeddings, respectively. Taking the first layer as example, as DAN requires the output embeddings of two previous layers as input, we randomly initialize the original embeddings as g -1 (identical to d -1 ), and use an appropriate aggregator to generate the initial decentralized embeddings g 0 (identical to d 0 ).

Result comparison of entity alignment on DBP15K.

Entity prediction results on FB15K-237.

Entity prediction results on FB15K-237 and WN18RR.

† " denotes methods executed by the source code with the provided best parameter settings.

{e 1 , . . . , e * x , . . . , e |Ni| }, êi ) ej ∈Xi f (G({e 1 , . . . , e * x , . . . , e |Ni| }, êj )

Statistics of entity alignment datasets.

Ablation study on open entity alignment. Average of 5 runs.

Performance of decentRL with different dimensions. Average of 5 runs.

