EXPLAINABLE SUBGRAPH REASONING FOR FORE-CASTING ON TEMPORAL KNOWLEDGE GRAPHS

Abstract

Modeling time-evolving knowledge graphs (KGs) has recently gained increasing interest. Here, graph representation learning has become the dominant paradigm for link prediction on temporal KGs. However, the embedding-based approaches largely operate in a black-box fashion, lacking the ability to interpret their predictions. This paper provides a link forecasting framework that reasons over queryrelevant subgraphs of temporal KGs and jointly models the structural dependencies and the temporal dynamics. Especially, we propose a temporal relational attention mechanism and a novel reverse representation update scheme to guide the extraction of an enclosing subgraph around the query. The subgraph is expanded by an iterative sampling of temporal neighbors and by attention propagation. Our approach provides human-understandable evidence explaining the forecast. We evaluate our model on four benchmark temporal knowledge graphs for the link forecasting task. While being more explainable, our model obtains a relative improvement of up to 20 % on Hits@1 compared to the previous best temporal KG forecasting method. We also conduct a survey with 53 respondents, and the results show that the evidence extracted by the model for link forecasting is aligned with human understanding.

1. INTRODUCTION

Reasoning, a process of inferring new knowledge from available facts, has long been considered an essential topic in AI research. Recently, reasoning on knowledge graphs (KG) has gained increasing interest (Das et al., 2017; Ren et al., 2020; Hildebrandt et al., 2020) . A knowledge graph is a graphstructured knowledge base that stores factual information in the form of triples (s, p, o), e.g., (Alice, livesIn, Toronto) . In particular, s (subject) and o (object) are expressed as nodes and p (predicate) as an edge type. Most knowledge graph models assume that the underlying graph is static. However, in the real world, facts and knowledge can change with time. For example, (Alice, livesIn, Toronto) becomes invalid after Alice moves to Vancouver. To accommodate time-evolving multi-relational data, temporal KGs have been introduced (Boschee et al., 2015) , where a temporal fact is represented as a quadruple by extending the static triple with a timestamp t indicating the triple is valid at t, i.e. (Barack Obama, visit, India, 2010-11-06) . In this work, we focus on forecasting on temporal KGs, where we infer future events based on past events. Forecasting on temporal KGs can improve a plethora of downstream applications such as decision support in personalized health care and finance. The use cases often require the predictions made by the learning models to be interpretable, such that users can understand and trust the predictions. However, current machine learning approaches (Trivedi et al., 2017; Jin et al., 2019) for temporal KG forecasting operate in a black-box fashion, where they design an embedding-based score function to estimate the plausibility of a quadruple. These models cannot clearly show which evidence contributes to a prediction and lack explainability to the forecast, making them less suitable for many real-world applications. Explainable approaches can generally be categorized into post-hoc interpretable methods and integrated transparent methods (Došilović et al., 2018) . Post-hoc interpretable approaches (Montavon et al., 2017; Ying et al., 2019) aim to interpret the results of a black-box model, while integrated transparent approaches (Das et al., 2017; Qiu et al., 2019; Wang et al., 2019) have an explainable internal mechanism. In particular, most integrated transparent (Lin et al., 2018; Hildebrandt et al., 2020) approaches for KGs employ path-based methods to derive an explicit reasoning path and demonstrate a transparent reasoning process. The path-based methods focus on finding the answer to a query within a single reasoning chain. However, many complicated queries require multiple supporting reasoning chains rather than just one reasoning path. Recent work (Xu et al., 2019; Teru et al., 2019) has shown that reasoning over local subgraphs substantially boosts performance while maintaining interpretability. However, these explainable models cannot be applied to temporal graph-structured data because they do not take time information into account. This work aims to design a transparent forecasting mechanism on temporal KGs that can generate informative explanations of the predictions. In this paper, we propose an explainable reasoning framework for forecasting future links on temporal knowledge graphs, xERTE, which employs a sequential reasoning process over local subgraphs. To answer a query in the form of (subject e q , predicate p q , ?, timestamp t q ), xERTE starts from the query subject, iteratively samples relevant edges of entities included in the subgraph and propagates attention along the sampled edges. After several rounds of expansion and pruning, the missing object is predicted from entities in the subgraph. Thus, the extracted subgraph can be seen as a concise and compact graphical explanation of the prediction. To guide the subgraph to expand in the direction of the query's interest, we propose a temporal relational graph attention (TRGA) mechanism. We pose temporal constraints on passing messages to preserve the causal nature of the temporal data. Specifically, we update the time-dependent hidden representation of an entity e i at a timestamp t by attentively aggregating messages from its temporal neighbors that were linked with e i prior to t. We call such temporal neighbors as prior neighbors of e i . Additionally, we use an embedding module consisting of stationary entity embeddings and functional time encoding, enabling the model to capture both global structural information and temporal dynamics. Besides, we develop a novel representation update mechanism to mimic human reasoning behavior. When humans perform a reasoning process, their perceived profiles of observed entities will update, as new clues are found. Thus, it is necessary to ensure that all entities in a subgraph can receive messages from prior neighbors newly added to the subgraph. To this end, the proposed representation update mechanism enables every entity to receive messages from its farthest prior neighbors in the subgraph. The major contributions of this work are as follows. (1) We develop xERTE, the first explainable model for predicting future links on temporal KGs. The model is based on a temporal relational attention mechanisms that preserves the causal nature of the temporal multi-relational data. (2) Unlike most black-box embedding-based models, xERTE visualizes the reasoning process and provides an interpretable inference graph to emphasize important evidence. (3) The dynamical pruning procedure enables our model to perform reasoning on large-scale temporal knowledge graphs with millions of edges. (4) We apply our model for forecasting future links on four benchmark temporal knowledge graphs. The results show that our method achieves on average a better performance than current state-of-the-art methods, thus providing a new baseline. (5) We conduct a survey with 53 respondents to evaluate whether the extracted evidence is aligned with human understanding.

2. RELATED WORK

Representation learning is an expressive and popular paradigm underlying many KG models. The embedding-based approaches for knowledge graphs can generally be categorized into bilinear models (Nickel et al., 2011; Yang et al., 2014; Ma et al., 2018a) , translational models (Bordes et al., 2013; Lv et al., 2018; Sun et al., 2019; Hao et al., 2019) , and deep-learning models (Dettmers et al., 2017; Schlichtkrull et al., 2018) . However, the above methods are not able to use rich dynamics available on temporal knowledge graphs. To this end, several studies have been conducted for temporal knowledge graph reasoning (García-Durán et al., 2018; Ma et al., 2018b; Jin et al., 2019; Goel et al., 2019; Lacroix et al., 2020; Han et al., 2020a; b; Zhu et al., 2020) . The published approaches are largely black-box, lacking the ability to interpret their predictions. Recently, several explainable reasoning methods for knowledge graphs have been proposed (Das et al., 2017; Xu et al., 2019; Hildebrandt et al., 2020; Teru et al., 2019) . However, the above explainable methods can only deal with static KGs, while our model is designed for interpretable forecasting on temporal KGs.

3. PRELIMINARIES

Let E and P represent a finite set of entities and predicates, respectively. A temporal knowledge graph is a collection of timestamped facts written as quadruples. A quadruple q = (e s , p, e o , t) represents a timestamped and labeled edge between a subject entity e s ∈ E and an object entity e o ∈ E, where p ∈ P denotes the edge type (predicate). The temporal knowledge graph forecasting task aims to predict unknown links at future timestamps based on observed past events. Definition 1 (Temporal KG forecasting). Let F represent the set of all ground-truth quadruples, and let (e q , p q , e o , t q ) ∈ F denote the target quadruple. Given a query (e q , p q , ?, t q ) derived from the target quadruple and a set of observed prior facts O = {(e i , p k , e j , t l ) ∈ F|t l < t q }, the temporal KG forecasting task is to predict the missing object entity e o . Specifically, we consider all entities in the set E as candidates and rank them by their likelihood to form a true quadruple together with the given subject-predicate-pair at timestamp t qfoot_0 . For a given query q = (e q , p q , ?, t q ), we build an inference graph G inf to visualize the reasoning process. Unlike in temporal KGs, where a node represents an entity, each node in G inf is an entitytimestamp pair. The inference graph is a directed graph in which a link points from a node with an earlier timestamp to a node with a later timestamp. Definition 2 (Node in Inference Graph and its Temporal Neighborhood). Let E represent all entities, F denote all ground-truth quadruples, and let t represent a timestamp. A node in an inference graph G inf is defined as an entity-timestamp pair v = (e i , t), e i ∈ E. We define the set of one-hop prior neighbors of v as N v=(ei,t) = {(e j , t )|(e i , p k , e j , t ) ∈ F ∧ (t < t)}foot_1 . For simplicity, we denote one-hop prior neighbors as N v . Similarly, we define the set of one-hop posterior neighbors of v as N v=(ei,t) = {(e j , t )|(e j , p k , e i , t) ∈ F ∧ (t > t)}. We denote them as N v for short. We provide an example in Figure 4 in the appendix to illustrate the inference graph.

4. OUR MODEL

We describe xERTE in a top-down fashion where we provide an overview in Section 4.1 and then explain each module from Section 4.2 to 4.6.

4.1. SUBGRAPH REASONING PROCESS

Our model conducts the reasoning process on a dynamically expanded inference graph G inf extracted from the temporal KG. We show a toy example in Figure 1 . Given query q = (e q , p q , ?, t q ), we initialize G inf with node v q = (e q , t q ) consisting of the query subject and the query time. The inference graph expands by sampling prior neighbors of v q . For example, suppose that (e q , p k , e j , t ) is a valid quadruple where t < t q , we add the node v 1 = (e j , t ) into G inf and link it with v q where the link is labeled with p k and points from v q to v 1 . We use an embedding module to assign each node and predicate included in G inf a temporal embedding that is shared across queries. The main goal of the embedding module is to let the nodes access query-independent information and get a broad view of the graph structure since the following temporal relational graph attention (TRGA) layer only performs query-dependent message passing locally. Next, we feed the inference graph into the TRGA layer that takes node embeddings and predicate embeddings as the input, produces a query-dependent representation for each node by passing messages on the small inference graph, and computes a query-dependent attention score for each edge. As explained in Section 4.7, we propagate the attention of each node to its prior neighbors using the edge attention scores. Then we further expand G inf by sampling the prior neighbors of the nodes in G inf . The expansion will grow Figure 1 : Model Architecture. We take the second inference step (l = 2) as an example. Each directed edge points from a source node to its prior neighbor. denotes nodes that have not been sampled. a l i means the attention score of node v i at the l th inference step. α l i,j is the attention score of the edge between node i and its prior neighbor j at the l th inference step. Note that all scores are query-dependent. For simplicity, we do not show edge labels (predicates) in the figure . rapidly and cover almost all nodes after a few steps. To prevent the inference graph from exploding, we reduce the edge amount by pruning the edges that gain less attention. As the expansion and pruning iterate, G inf allocates more and more information from the temporal KG. After running L inference steps, the model selects the entity with the highest attention score in G inf as the prediction of the missing query object, where the inference graph itself serves as a graphical explanation.

4.2. NEIGHBORHOOD SAMPLING

We define the set of edges between node v = (e i , t) and its prior neighbors N v as Q v , where q v ∈ Q v is a prior edge of v. To reduce the complexity, we sample a subset of prior edges Qv ∈ Q v at each inference step. We denote the remaining prior neighbors and posterior neighbors of node v after the sampling as Nv and Nv , respectively. Note that there might be multiple edges between node v and its prior neighbor u because of multiple predicates. If there is at least one edge that has been sampled between v and u, we add u into Nv . The sampling can be uniform if there is no bias, it can also be temporally biased using a non-uniform distribution. For instance, we may want to sample more edges closer to the current time point as the events that took place long ago may have less impact on the inference. Specifically, we propose three different sampling strategies: (1) Uniform sampling. Each prior edge q v ∈ Q v has the same probability of being selected: P(q v ) = 1/|Q v |. (2) Time-aware exponentially weighted sampling. We temporally bias the neighborhood sampling using an exponential distribution and assign the probability P(q v = (e i , p k , e j , t )) = exp(t -t)/ (ei,p l ,em,t )∈Qv exp(t -t) to each prior neighbor, which negatively correlates with the time difference between node v and its prior neighbor (e j , t ). Note that t and t are prior to t. (3) Time-aware linearly weighted sampling. We use a linear function to bias the sampling. Compared to the second strategy, the quadruples occurred in early stages have a higher probability of being sampled. Overall, we have empirically found that the second strategy is most beneficial to our framework and provide a detailed ablation study in Section 5.2.

4.3. EMBEDDING

In temporal knowledge graphs, graph structures are no longer static, as entities and their links evolve over time. Thus, entity features may change and exhibit temporal patterns. In this work, the embedding of an entity e i ∈ E at time t consists of a static low-dimensional vector and a functional representation of time. The time-aware entity embedding is defined as e i (t) = [ē i ||Φ(t)] T ∈ R d S +d T . Here, ēi ∈ R d S represents the static embedding that captures time-invariant features and global dependencies over the temporal KG. Φ(•) denotes a time encoding that captures temporal dependencies between entities (Xu et al., 2020) . We provide more details about Φ(•) in Appendix I. || denotes the concatenation operator. d S and d T represent the dimensionality of the static embedding and the time embedding, which can be tuned according to the temporal fraction of the given dataset. We also tried the temporal encoding presented in Goel et al. (2019) , which has significantly more parameters. But we did not see considerable improvements. Besides, we assume that predicate features do not evolve. Thus, we learn a stationary embedding vector p k for each predicate p k .

4.4. TEMPORAL RELATIONAL GRAPH ATTENTION LAYER

Here, we propose a temporal relational graph attention (TRGA) layer for identifying the relevant evidence in the inference graph related to a given query q. The input to the TRGA layer is a set of entity embeddings e i (t) and predicate embeddings p k in the given inference graph. The layer produces a query-dependent attention score for each edge and a new set of hidden representations as its output. Similar to GraphSAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2017) , the TRGA layer performs a local representation aggregation. To avoid misusing future information, we only allow message passing from prior neighbors to posterior neighbors. Specifically, for each node v in the inference graph, the aggregation function fuses the representation of node v and the sampled prior neighbors Nv to output a time-aware representation for v. Since entities may play different roles, depending on the predicate they are associated with, we incorporate the predicate embeddings in the attention function to exploit relation information. Instead of treating all prior neighbors with equal importance, we take the query information into account and assign varying importance levels to each prior neighbor u ∈ Nv by calculating a query-dependent attention score using e l vu (q, p k ) = W l sub (h l-1 v ||p l-1 k ||h l-1 eq ||p l-1 q )W l obj (h l-1 u ||p l-1 k ||h l-1 eq ||p l-1 q ), where e l vu (q, p k ) is the attention score of the edge (v, p k , u) regarding the query q = (e q , p q , ?, t q ), p k corresponds to the predicate between node u and node v, p k and p q are predicate embeddings. h l-1 v denotes the hidden representation of node v at the (l -1) th inference step. When l = 1, i.e., for the first layer, h 0 v = W v e i (t) + b v , where v = (e i , t ). W l sub and W l obj are two weight matrices for capturing the dependencies between query features and node features. Then, we compute the normalized attention score α l vu (q, p k ) using the softmax function as follows α l vu (q, p k ) = exp(e l vu (q, p k )) w∈ Nv pz∈Pvw exp(e l vw (q, p z )) , where P vw represents the set of labels of edges that connect nodes v and w. Once obtained, we aggregate the representations of prior neighbors and weight them using the normalized attention scores, which is written as h l v (q) = u∈ Nv p k ∈Pvu α l vu (q, p k )h l-1 u (q). We combine the hidden representation h l-1 v (q) of node v with the aggregated neighborhood representation h l v (q) and feed them into a fully connected layer with a LeakyReLU activation function σ(•), as shown below h l v (q) = σ(W l h (γh l-1 v (q) + (1 -γ) h l v (q) + b l h )), where h l v (q) denotes the representation of node v at the l th inference step, and γ is a hyperparameter. Further, we use the same layer to update the relation embeddings, which is of the form p l k = W l h p l-1 k + b l h . Thus, the relations are projected to the same embedding space as nodes and can be utilized in the next inference step.

4.5. ATTENTION PROPAGATION AND SUBGRAPH PRUNING

After having the edges' attention scores in the inference graph, we compute the attention score a l v,q of node v regarding query q at the l th inference step as follows: a l v,q = u∈ N v pz∈Puv α l uv (q, p z )a l-1 u,q . Thus, we propagate the attention of each node to its prior neighbors. As stated in Definition 2, each node in inference graph is an entity-timestamp pair. To assign each entity a unique attention score, we aggregate the attention scores of nodes whose entity is the same: a l ei,q = g(a l v,q |v(e) = e i ), for v ∈ V G inf , where a l ei,q denotes the attention score of entity e i , V G inf is the set of nodes in inference graph G inf . v(e) represents the entity included in node v, and g(•) represents a score aggregation function. We try two score aggregation functions g(•), i.e., summation and mean. We conduct an ablation study and find that the summation aggregation performs better. To demonstrate which evidence is important for the reasoning process, we assign each edge in the inference graph a contribution score. Specifically, the contribution score of an edge (v, p k , u) is defined as c vu (q, p k ) = α l vu (q, p k )a l v,q , where node u is a prior neighbor of node v associated with the predicate p k . We prune the inference graph at each inference step and keep the edges with K largest contribution scores. We set the attention score of entities, which the inference graph does not include, to zero. Finally, we rank all entity candidates according to their attention scores and choose the entity with the highest score as our prediction.

4.6. REVERSE REPRESENTATION UPDATE MECHANISM

When humans perform a reasoning process, the perceived profile of an entity during the inference may change as new evidence joins the reasoning process. For example, we want to predict the profitability of company A. We knew that A has the largest market portion, which gives us a high expectation about A's profitability. However, a new evidence shows that conglomerate B enters this market as a strong competitor. Although the new evidence is not directly related to A, it indicates that there will be a high competition between A and B, which lowers our expectation about A's profitability. To mimic human reasoning behavior, we should ensure that all existing nodes in inference graph G inf can receive messages from nodes newly added to G inf . However, since G inf expands once at each inference step, it might include l-hop neighbors of the query subject at the l th step. The vanilla solution is to iterate the message passing l times at the l th inference step, which means that we need to run the message passing (1 + L) • L/2 times in total, for L inference steps. To avoid the quadratic increase of message passing iterations, we propose a novel reverse representation update mechanism. Recall that, to avoid violating temporal constraints, we use prior neighbors to update nodes' representations. And at each inference step, we expand G inf by adding prior neighbors of each node in G inf . For example, assuming that we are at the fourth inference step, for a node that has been added at the second step, we only need to aggregate messages from nodes added at the third and fourth steps. Hence, we can update the representations of nodes in reverse order as they have been added in G inf . Specifically, at the l th inference step, we first update the representations of nodes added at the (l -1) th inference step, then the nodes added at (l -2) th , and so forth until l = 0, as shown in Algorithm 1 in the appendix. In this way, we compute messages along each edge in G inf only once and ensure that every node can receive messages from its farthest prior neighbor.

4.7. LEARNING

We split quadruples of a temporal KG into train, validation, and test sets by timestamps, ensuring (timestamps of training set)<(timestamps of validation set)<(timestamps of test set). We use the binary cross-entropy as the loss function, which is defined as L = - 1 |Q| q∈Q 1 |E inf q | ei∈E inf q y ei,q log( a L ei,q ej ∈E inf q a L ej ,q ) + (1 -y ei,q ) log(1 - a L ei,q ej ∈E inf q a L ej ,q ) , where E inf q represents the set of entities in the inference graph of the query q, y ei,q represents the binary label that indicates whether e i is the answer for q, and Q denotes the set of training quadruples. a L ei,q denotes the attention score of e i at the final inference step. We list all model parameters in Table 2 in the appendix. Particularly, we jointly learn the embeddings and other model parameters by end-to-end training.

5.1. DATASETS AND BASELINES

Integrated Crisis Early Warning System (ICEWS) (Boschee et al., 2015) and YAGO (Mahdisoltani et al., 2013) ) and Hits@1/3/10 (%). The best results among all models are in bold.

Comparison results

Table 1 summarizes the time-aware filtered results of the link prediction task on the ICEWS and YAGO datasets 4 . The time-aware filtering scheme only filters out triples that are genuine at the query time while the filtering scheme applied in prior work (Jin et al., 2019; Zhu et al., 2020) filters all triples that occurred in history. A detailed explanation is provided in Appendix D. Overall, xERTE outperforms all baseline models on ICEWS14/05-15/18 in MRR and Hits@1/3/10 while being more interpretable. Compared to the strongest baseline RE-Net, xERTE obtains a relative improvement of 5.60% and 15.15% in MRR and Hits@1, which are averaged on ICEWS14/05-15/18. Especially, xERTE achieves more gains in Hits@1 than in Hits@10. It confirms the assumption that subgraph reasoning helps xERTE make a sharp prediction by exploiting local structures. On the YAGO dataset, xERTE achieves comparable results with RE-Net in terms of MRR and Hits@1/3. To assess the importance of each component, we conduct several ablation studies and show their results in the following.

Representation update analysis

We train a model without the reverse representation update mechanism to investigate how this mechanism contributes to our model. Since the reverse representation update ensures that each node can receive messages from all its prior neighbors in the inference graph, we expect this mechanism could help nodes mine available information. This update mechanism should be especially important for nodes that only have been involved in a small number of events. Since the historical information of such nodes is quite limited, it is very challenging to forecast their future behavior. In Figure 2a and 2b we show the metrics of Hits@1 and Hits@10 against the number of nodes in the inference graph. It can be observed the model with the reverse update mechanism performs better in general. In particular, this update mechanism significantly improves the performance if the query subject only has a small number of neighbors in the subgraph, which meets our expectation. 

Time-aware representation analysis and node attention aggregation

To verify the importance of the time embedding, we evaluate the performance of a model without time encoding. As shown in Figure 2c , removing the time-dependent part from entity representations sacrifices the model's performance significantly. Recall that each node in inference graph G inf is associated with a timestamp, the same entity might appear in several nodes in G inf with different timestamps. To get a unified attention score for each entity, we aggregate the attention scores of nodes whose entity is the same. Figure 2d shows that the summation aggregator brings a considerable gain on ICEWS14.

Sampling analysis

We run experiments with different sampling strategies proposed in Section 4.2. To assess the necessity of the time-aware weighted sampling, we propose a deterministic version of the time-aware weighted sampling, where we chronologically sort the prior edges of node v in terms of their timestamps and select the last N edges to build the subset Qv . The experimental results are provided in Table 3 in the appendix. We find that the sampling strategy has a considerable influence on model's performance. Sampling strategies that bias towards recent quadruples perform better. Specifically, the exponentially time-weighted strategy performs better than the linear time-weighted strategy and the deterministic last-N-edges strategy.

Time cost analysis

The time cost of xERTE is affected not only by the scale of a dataset but also by the number of inference steps L. Thus, we run experiments of inference time and predictive power regarding different settings of L and show the results in Figures 2e and 2f . We see that the model achieves the best performance with L = 3 while the training time significantly increases as L goes up. To make the computation more efficient, we develop a series of segment operations for subgraph reasoning. Please see Appendix G for more details.

5.3. GRAPHICAL EXPLANATION AND HUMAN EVALUATION

The extracted inference graph provides a graphical explanation for model's prediction. As introduced in 4.7, we assign each edge in the inference graph a contribution score. Thus, users can trace back the important evidence that the prediction mainly depends on. We study a query chosen from the test set, where we predict whom will Catherine Ashton visit on Nov. 9, 2014 and show the final inference graph in Figure 3 . In this case, the model's prediction is Oman. And (Catherine Ashton, express intent to meet or negotiate, Oman, 2014-11-04) is the most important evidence to support this answer. To assess whether the evidence is informative for users in an objective setting, we conduct a survey where respondents evaluate the relevance of the extracted evidence to the prediction. More concretely, we set up an online quiz consisting of 7 rounds. Each round is centered around a query sampled from the test set of ICEWS14/ICEWS05-15. Along with the query and the ground-truth answer, we present the human respondents with two pieces of evidence in the inference graph with high contribution scores and two pieces of evidence with low contribution scores in a randomized order. Specifically, each evidence is based on a chronological reasoning path that connects the query subject with an object candidate. For example, given a query (police, arrest, ?, 2014-12-28) , an extracted clue is that police made statements to lawyers on 2014-12-08, then lawyers were criticized by citizens on 2014-12-10. In each round, we set up three questions to ask the participants to choose the most relevant evidence, the most irrelevant evidence, and sort the pieces of evidence according to their relevance. Then we rank the evidence according to the contribution scores computed by our model and check whether the relevance order classified by the respondents matches that estimated by our models. We surveyed 53 participants, and the average accuracy of all questions is 70.5%. Moreover, based on a majority vote, 18 out of 21 questions were answered correctly, indicating that the extracted inference graphs are informative, and the model is aligned with human intuition. The complete survey and a detailed evaluation are reported in Appendix H.

6. CONCLUSION

We proposed an explainable reasoning approach for forecasting links on temporal knowledge graphs. The model extracts a query-dependent subgraph from a given temporal KG and performs an attention propagation process to reason on it. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method. We conducted a survey about the evidence included in the extracted subgraph. The results indicate that the evidence is informative for humans.

APPENDIX

Figure 4 : The inference graph for the query (e 0 , p 1 , ?, t 3 ). The entity at an arrow's tail, the predicate on the arrow, the entity and the timestamp at the arrow's head build a true quadruple. Specifically, the true quadruples in this graph are as follows: {(e 0 , p 1 , e 1 , t 1 ), (e 0 , p 2 , e 1 , t 2 ), (e 0 , p 3 , e 2 , t 2 ), (e 0 , p 1 , e 2 , t 0 )}. Note that t 3 is posterior to t 0 , t 1 , t 2 . Algorithm 1 Reverse Representation Update at the L th Inference Step Input: Inference graph G inf , nodes in the inference graph V, nodes that have been added into G inf at the l th inference step V l , sampled prior neighbors Nv , hidden representation at the (L-1) th step h L-1 v , entity embeddings e i , weight matrices W L sub , W L obj , and W L h , query q = (e q , p q , ?, t q ), update ratio γ. Output: Hidden representations h L v at the L th inference step, ∀v ∈ V. 1: for l = L -1, ..., 0 do 2: for v ∈ V l do 3: for u ∈ Nv do 4: e L vu (q, p k ) = W L sub (h L-1 v ||p k ||h L-1 eq ||p q )W L obj (h L-1 u ||p k ||h L-1 eq ||p q ), 5: α L vu (q, p k ) = exp(e L vu (q,p k )) w∈ Nv pz ∈Pvw e L vw (q,pz)

6:

end for 7: h L v (q) = u∈ Nv oz∈Pvu α L vu (q, p k )h L-1 u (q), 8: h L v (q) = σ(W L h (γh L-1 v (q) + (1 -γ) h L v (q) + b L h )) 9: end for 10: end for 11: Return h L v , ∀v ∈ V. ) and Hits@1/3/10 (%). In this ablation study, we filter out test triples that contain unseen entities.

A RELATED WORK

A.1 KNOWLEDGE GRAPH MODELS Representation learning is an expressive and popular paradigm underlying many KG models. The key idea is to embed entities and relations into a low-dimensional vector space. The embeddingbased approaches for knowledge graphs can generally be categorized into bilinear models (Nickel et al., 2011; Balažević et al., 2019) , translational models (Bordes et al., 2013; Sun et al., 2019) , and deep-learning models (Dettmers et al., 2017; Schlichtkrull et al., 2018) . Besides, several studies (Hao et al., 2019; Lv et al., 2018; Ma et al., 2017) explore the ontology of entity types and relation types and utilize type-based semantic similarity to produce better knowledge embeddings. However, the above methods lack the ability to use rich temporal dynamics available on temporal knowledge graphs. To this end, several studies have been conducted for link prediction on temporal knowledge graphs (Leblay & Chekol, 2018; García-Durán et al., 2018; Ma et al., 2018b; Dasgupta et al., 2018; Trivedi et al., 2017; Jin et al., 2019; Goel et al., 2019; Lacroix et al., 2020) . Ma et al. (2018b) developed extensions of static knowledge graph models by adding timestamp embeddings to their score functions. Besides, García-Durán et al. ( 2018) suggested a straight forward extension of some existing static knowledge graph models that utilize a recurrent neural network (RNN) to encode predicates with temporal tokens derived from given timestamps. Also, HyTE (Dasgupta et al., 2018) embeds time information in the entity-relation space by arranging a temporal hyperplane to each timestamp. However, these models cannot generalize to unseen timestamps because they only learn embeddings for observed timestamps. Additionally, the methods are largely black-box, lacking the ability to interpret their predictions while our main focus is to employ an integrated transparency mechanism for achieving human-understandable results.

A.2 EXPLAINABLE REASONING ON KNOWLEDGE GRAPHS

Recently, several explainable reasoning methods for knowledge graphs have been proposed (Das et al., 2017; Xu et al., 2019; Hildebrandt et al., 2020) . Das et al. (2017) proposed a reinforcement learning-based path searching approach to display the query subject and predicate to the agents and let them perform a policy guided walk to the correct object entity. The reasoning paths produced by the agents can explain the prediction results to some extent. Also, Hildebrandt et al. (2020) framed the link prediction task as a debate game between two reinforcement learning agents that extract evidence from knowledge graphs and allow users to understand the decision made by the agents. Besides, and more related to our work, Xu et al. ( 2019) models a sequential reasoning process by dynamically constructing an input-dependent subgraph. The difference here is that these explainable methods can only deal with static KGs, while our model is designed for forecasting on temporal KGs.

B WORKFLOW

We show the workflow of the subgraph reasoning process in Figure 5 . The model conducts the reasoning process on a dynamically expanding inference graph G inf extracted from the temporal KG. This inference graph gives an interpretable graphical explanation about the final prediction. Given a query q = (e q , p q , ?, t q ), we initialize the inference graph with the query entity e q and define the tuple of (e q , t q ) as the first node in the inference graph (Figure 5a ). The inference graph expands by sample neighbors that have been linked with e q prior to t q , as shown in Figure 5b . The expansion would go rapidly that it covers almost all nodes after a few steps. To prevent the inference graph from exploding, we constrain the number of edge by pruning the edges that are less related to the query (Figure 5c ) . Here, we propose a query-dependent temporal relational attention mechanism in Section 4.4 to identify the nodes' importance in the inference graph for query q and aggregate information from nodes' local neighbors. Next, we sample the prior neighbors of the remaining nodes in the inference graph to expand it further, as shown in Figure 5d . As this process iterates, the inference graph incrementally gains more and more information from the temporal KG. After running L inference steps, the model selects the entity with the highest attention score in G inf as the prediction of the missing query object, where the inference graph itself serves as a graphical explanation. We provide the statistics of datasets in Table 4 . Since we split each dataset into subsets by timestamps, ensuring (timestamps of training set) < (timestamps of validation set) < (timestamps of test set), a considerable amount of entities in test sets is unseen. We report the number of entities in each subset in Table 5 .

D EVALUATION PROTOCOL

For each quadruple q = (e s , p, e o , t) in the test set G test , we create two queries: (e s , p, ?, t) and (e o , p -1 , ?, t), where p -1 denotes the reciprocal relation of p. For each query, the model ranks all entities E inf q in the final inference graph according to their attention scores. If the ground truth entity does not appear in the final subgraph, we set its rank as |E| (the number of entities in the dataset). Let ψ es and ψ eo represent the rank for e s and e o of the two queries respectively. We evaluate our model using standard metrics across the link prediction literature: mean reciprocal rank (MRR): 1 2•|Gtest| q∈Gtest ( 1 ψe s + 1 ψe o ) and Hits@k(k ∈ {1, 3, 10}): the percentage of times that the true entity candidate appears in the top k of the ranked candidates. In this paper, we consider two different filtering settings. The first one is following the ranking technique described in Bordes et al. (2013) , where we remove from the list of corrupted triples all the triples that appear either in the training, validation, or test set. We name it static filtering. Trivedi et al. (2017) , Jin et al. (2019), and Zhu et al. (2020) use this filtering setting for reporting their results on temporal KG forecasting. However, this filtering setting is not appropriate for evaluating the link prediction on temporal KGs. For example, there is a test quadruple (Barack Obama, visit, India, 2015-01-25) , and we perform the object prediction (Barack Obama, visit, ?, 2015-01-25) . We have observed the quadruple (Barack Obama, visit, Germany, 2013-01-18) in the training set. According to the static filtering, (Barack Obama, visit, Germany) will be considered as a genuine triple at the timestamp 2015-01-25 and will be filtered out because the triple (Barack Obama, visit, Germany) appears in the training set in the quadruple (Barack Obama, visit, Germany, 2015-01-18) . However, the triple (Barack Obama, visit, Germany) is only temporally valid on 2013-01-18 but not on 2015-01-25. Therefore, we apply another filtering scheme, which is more appropriate for the link forecasting task on temporal KGs. We name it time-aware filtering. In this case, we only filter out the triples that are genuine at the timestamp of the query. In other words, if the triple (Barack Obama, visit, Germany) does not appear at the query time of 2015-01-25, the quadruple (Barack Obama, visit, Germany, 2015-01-25) is considered as corrupted and will be filtered out. We report the time-aware filtered results of baselines and our model in Table 1 .

E IMPLEMENTATION

We implement our model and all baselines in PyTorch (Paszke et al., 2019) . We tune hyperparameters of our model using a grid search. We set the learning rate to be 0.0002, the batch size to be 128, the inference step L to be 3. Please see the source codefoot_4 for detailed hyperparameter settings. We implement TTransE, TA-TransE/TA-DistMult, and RE-Net based on the codefoot_5 provided in (Jin et al., 2019) . We use the released code to implement DE-SimplEfoot_6 , TNTComplExfoot_7 , and CyGNetfoot_8 . We use the binary cross-entropy loss to train these baselines and optimize hyperparameters according to MRR on the validation set. Besides, we use the datasets augmented with reciprocal relations to train all baseline models.

REASONING

In this section, we explain an additional reason why we have to update node representations along edges selected in previous inference steps. We show our intuition by a simple query in Figure 6 with two inference steps. For simplicity, we do not apply the pruning procedure here. First, we check the equations without updating node representations along previously selected edges. h l i denotes the Segment Sum Given a vector x ∈ R d and another vector s ∈ R d that indicates the segment index of each element in x, the segment sum operator returns the summation for each segment. For example, we have x = [3, 1, 5] T and s = [0, 0, 1] T , which means the first two element of x belong to the 0 th segment and the last elements belongs to the first segment. The segment sum operator returns [4, 5] T as the output. It is realized by creating a sparse matrix Y ∈ R n×d , where n denotes the number of segments. We set 1 in positions {(s[i], i), ∀i ∈ {0, ..., d}} of Y and pad other positions with zeros. Finally, we multiply Y with x to get the sum of each segment.

Segment Softmax

The standard softmax function σ : R K → R K is defined as: σ(z) i = exp (z i ) K j=1 exp(z j ) The segment softmax function has two inputs: z ∈ R K contains elements to normalize and s ∈ R K denotes the segment index of each element. It is then defined as: σ(z) i = exp (z i ) j∈{k|s k =si,∀k∈{0,...,K}} exp (z j ) , where s i denotes the segment that z i is in. The segment softmax function can be calculated by the following steps: 1. We apply the exponential function to each element of z and then apply the segment sum operator to get a denominator vector d. We need broadcast d such that it aligns with z, which means d[i] is the summation of segment s[i]. 

H SURVEY

In this section, we provide the online survey (see Section 5.3 in the main body) and the evaluation statistics based on 53 respondents. To avoid biasing the respondents, we did not inform them about the type of our project. Further, all questions are permuted at random. We set up the quiz consisting of 7 rounds. In each round, we sample a query from the test set of ICEWS14/ICEWS0515. Along with the query and the ground-truth object, we present the users with two pieces of evidence extracted from the inference graph with high contribution scores and two pieces of evidence with low contribution scores in randomized order. The respondents are supposed to judge the relevance of the evidence to the query in two levels, namely relevant or less relevant. There are three questions in each round that ask the participants to give the most relevant evidence, the most irrelevant evidence, and rank the four pieces of evidence according to their relevance. The answer to the first question is classified as correct if a participant gives one of the two statements with high contribution scores as the most relevant evidence. Similarly, the answer to the second question is classified as correct if the participant gives one of the two statements with low contribution scores as the most irrelevant evidence. For the relevance ranking task, the answer is right if the participant ranks the two statements with high contribution scores higher than the two statements with low contribution scores.

H.1 POPULATION

We provide the information about gender, age, and education level of the respondents in Figure 7 . H.2 AI QUIZ You will participate in a quiz consisting of eight rounds. Each round is centered around an international event. Along with the event, we also show you four reasons that explain why the given event happened. While some evidence may be informative and explain the occurrence of this event, others may irrelevant to this event. Your task is to find the most relevant evidence and most irrelevant evidence, and then sort all four evidence according to their relevance. Don't worry if you feel that you cannot make an informed decision: Guessing is part of this game! Additional Remarks: Please don't look for external information (e.g., Google, Wikipedia) or talk to other respondents about the quiz. But you are allowed to use a dictionary if you need vocabulary clarifications.

Example

Given an event, please rank the followed evidence according to the relevance to the given event. Especially, please select the most relevant reason, the most irrelevant reason, and rank the relevance from high to low. Event: French government made an optimistic comment about China on 2014-11-24.

I ADDITIONAL ANALYSIS OF TIME-AWARE ENTITY REPRESENTATIONS

We use a generic time encoding (Xu et al., 2020) defined as Φ(t) = 1 d [cos(ω 1 t + φ 1 ), . . . ., cos(ω d t + φ d )] to generate the time-variant part of entity representations (please see Section 4.2 for more details). Time-aware representations have considerable influence on the temporal attention mechanism. To make our point, we conduct a case study and extract the edges' attention scores from the final inference graph. Specifically, we study the attention scores of the interactions between military and student at different timestamps in terms of the query (student, criticize, ?, Nov. 17, 2014). We list the results of the model with time encoding in Table 7 and the results of the model without time encoding in Table 8 . As shown in Table 7 , by means of the time-encoding, quadruples that even have the same subject, predicate, and object have different attention scores. Specifically, quadruples that occurred recently tend to have higher attention scores. This makes our model more interpretable and effective. For example, given three quadruples {(country A, accuse, country B, t 1 ), (country A, express intent to negotiate with, country B, t 2 ), (country A, cooperate with, country B, t 3 )}, country A probably has a good relationship with B at t if (t 1 < t 2 < t 3 < t) holds. However, there would be a strained relationship between A and B at t if (t > t 1 > t 2 > t 3 ) holds. Thus, we can see that the time information is crucial to the reasoning, and attention values should be time-dependent. In comparison, Table 8 shows that the triple (military, use conventional military force, student) has randomly different attention scores at different timestamps, which is less interpretable. 



Throughout this work, we add reciprocal relations for every quadruple, i.e., we add (eo, p -1 , es, t) for every (es, p, eo, t). Hence, the restriction to predict object entities does not lead to a loss of generality. Prior neighbors linked with ei as subject entity, e.g., (ej, p k , ei, t), are covered using reciprocal relations. We found that CyGNet does not perform subject prediction in its evaluation code and does not report time-aware filtered results. The performance significantly drops after fixing the code. Code and datasets are available at https://github.com/TemporalKGTeam/xERTE https://github.com/TemporalKGTeam/xERTE https://github.com/INK-USC/RE-Net https://github.com/BorealisAI/de-simple https://github.com/facebookresearch/tkbc https://github.com/CunchaoZ/CyGNet We apply element-wise division between d and z.



Figure 2: Ablation Study. Unlike in Table 1 that reports results on the whole test set, here we filter out test quadruples that contain unseen entities. (a)-(b) We compare the model with/without the reverse representation update in terms of raw Hits@1(%) and Hits@10(%) on ICEWS14, respectively. (c) Temporal embedding analysis on YAGO. We refer the model without temporal embeddings as xERTE-Static. (d) Attention score aggregation function analysis on ICEWS14: raw MRR (%) and Hits@1/3/10(%). (e) Inference time (seconds) on the test set of ICEWS14 regarding different inference step settings L ∈ {1, 2, 3, 4}. (f) Raw MRR(%) on ICEWS14 regarding different inference step settings L.

Figure 3: The inference graph for the query (Catherine Ashton, Make a visit, ?, 2014-11-09) from ICEWS14. The biggest cyan node represents the object predicted by xERTE. The cyan node with the entity Catherine Ashton and the timestamp 2014-11-09 represents the given query subject and the query timestamp. The node size indicates the value of the node attention score. Also, the edges' color indicates the contribution score of the edge, where darkness increases as the contribution score goes up. The entity at an arrow's tail, the predicate on the arrow, the entity and the timestamp at the arrow's head build a true quadruple.

Figure 5: Inference step by step illustration. Node attention scores are attached to the nodes. Gray nodes are removed by the pruning procedure.

Figure 6: A simple example with two inference steps for illustrating reverse node representation update schema. The graph is initialized with the green node. In the first step (the left figure), orange nodes are sampled; and in the second step (the right figure), blue nodes are sampled. Each directed edge points from a source node to its prior neighbor.

Figure 7: Information about the respondent population.

have established themselves in the research community as benchmark datasets of temporal KGs. The ICEWS dataset contains information about political events with time annotations, Results of future link prediction on four datasets. Compared metrics are time-aware filtered MRR (%

Model parameters.

Comparison between model variants with different sampling strategies on ICEWS14 : raw MRR (%

Dataset Statistics

Unseen entities (new emerging entities) in the validation set and test set. |E tr | denotes the number of entities in the training set, |E tr+val | represents the number of entities in the training set and validation set, |E| denotes the number of entities in the whole dataset.

Reduction of time cost for a batch on ICEWS14

Attention scores of the interactions between military and student at different timestamps (with time encoding).

Attention scores of the interactions between military and student at different timestamps (without time encoding).

annex

hidden representation of node i at the l th inference step.

First inference step: h

Second inference step:Note that h 2 0 is updated with h 0 1 , h 0 2 , h 0 3 and has nothing to do with h 1 4 , h 1 5 , h 1 6 , h 1 7 , h 1 8 , i.e., two-hop neighbors. In comparison, if we update the node representations along previously selected edges, the update in second layer changes to:Second inference step part c:Thus, the node 1∼3 receive messages from their one-hop prior neighbors, i.e.). Then they pass the information to the query subject (node 0), i.e.,

G SEGMENT OPERATIONS

The degree of entities in temporal KGs, i.e., ICEWS, varies from thousands to a single digit. Thus, the size of inference graphs of each query is also different. To optimize the batch training, we define an array to record all nodes in inference graphs for a batch of queries. Each node is represented by a tuple of (inference graph index, entity index, timestamp, node index). The node index is the unique index to distinguish the same node in different inference graphs.Note that the inference graphs of two queries may overlap, which means they have the same nodes in their inference graphs. But the query-dependent node representations would be distinct in different inference graphs. To avoid mixing information across different queries, we need to make sure that tensor operations can be applied separately to nodes in different inference graphs. Instead of iterating through each inference graph, we develop a series of segment operations based on matrix multiplication. The segment operations significantly improve time efficiency and reduce the time cost. We report the improvement of time efficiency on ICEWS14 in Table 6 . Additionally, we list two examples of segment operations in the following. 

