TILP: DIFFERENTIABLE LEARNING OF TEMPORAL LOGICAL RULES ON KNOWLEDGE GRAPHS

Abstract

Compared with static knowledge graphs, temporal knowledge graphs (tKG), which can capture the evolution and change of information over time, are more realistic and general. However, due to the complexity that the notion of time introduces to the learning of the rules, an accurate graph reasoning, e.g., predicting new links between entities, is still a difficult problem. In this paper, we propose TILP, a differentiable framework for temporal logical rules learning. By designing a constrained random walk mechanism and the introduction of temporal operators, we ensure the efficiency of our model. We present temporal features modeling in tKG, e.g., recurrence, temporal order, interval between pair of relations, and duration, and incorporate it into our learning process. We compare TILP with state-of-the-art methods on two benchmark datasets. We show that our proposed framework can improve upon the performance of baseline methods while providing interpretable results. In particular, we consider various scenarios in which training samples are limited, data is biased, and the time range between training and inference are different. In all these cases, TILP works much better than the state-of-the-art methods.

1. INTRODUCTION

Knowledge graphs (KGs) contain facts (e s , r, e o ) representing relation r between subject entity e s and object entity e o , e.g., (David Beckham, plays for, Real Madrid) . In real world, many relations are time-dependent, e.g., a player joining a team for a season, a politician holding a position for a certain period of time, and two persons' marriage lasting for decades. To represent the evolution and change of information, temporal knowledge graphs (tKGs) have been introduced. An interval I, indicating the valid period of the fact, is utilized by tKGs to extend the triples (e s , r, e o ) into quadruples (e s , r, e o , I), e.g., (David Beckham, plays for, Real Madrid, [2003 , 2007] ). Automatically reasoning over KGs such as link predication, i.e., inferring missing facts using existing facts, is a common task for real-world applications. However, the introduction of temporal information makes this task more difficult. The important dynamic interactions between entities can not be captured by learning methods developed for static KGs. Recently, a few embedding-based frameworks have been proposed to address the above limitation, e.g., HyTE (Dasgupta et al. (2018) ), TNTComplEx (Lacroix et al. (2020) ), and DE-SimplE (Goel et al. (2019) ). The common principle adopted by these models is to create time-dependent embeddings for entities and relations. Alternatively, first-order inductive logical reasoning methods have some desirable features relative to embedding methods when applied to KGs, as they provide interpretable and robust inference results. Since the resulting logical rules contain temporal information in tKGs, we call them temporal logical rules. Some recent works, e.g., StreamLearner (Omran et al. (2019) ), and TLogic (Liu et al. (2021) ), have introduced a framework for temporal KG reasoning. However, there are still several unaddressed issues. First, these statistical methods count from graph the number of paths that support a given rule as its confidence estimation. As such, this independent rule learning ignores the interactions between different rules from the same positive example. For instance, given certain rules, the confidence of some rules might be enhanced, while that of others can be diminished. Sec-ond, these methods cannot deal with the similarity between different rules. Given a reliable rule, it is reasonable to believe that the confidence of another similar rule, e.g., with the same predicates but slightly different temporal patterns, is also high. However, its estimated confidence with these methods can be quite low if it is infrequent in the dataset. Finally, the performance of these timestamp-based methods on interval-based tKGs is not demonstrated. It should be noted that the temporal relations between intervals are more complex than those of timestamps. All these problems are solved by our neural-network-based framework. In this paper, we propose TILP, a differentiable inductive learning framework. TILP benefits from a novel mechanism of constrained random walk and an extended module for temporal features modeling. We achieve comparable performance to the state-of-the-art methods, while providing logical explanations for the inference results. More specifically, our main contributions are summarized as follows: • TILP, a novel differentiable and temporal inductive logic framework, is introduced based on constrained random walks on temporal knowledge graphs and temporal features modeling. It is the first differentiable approach that can learn temporal logical rules from tKGs without restrictions. • Experiments on two benchmark datasets, i.e., WIKIDATA12k and YAGO11k, are conducted, where our framework shows comparable or improved performance relative to the state-of-the-art methods. For test queries, our framework has the advantage that it provides both the ranked list of candidates and explanations for the prediction. • The superiority of our method compared to existing methods is demonstrated in several scenarios such as when training samples are limited, data is biased, and time range of training and testing are different.

2. RELATED WORKS

Embedding-based methods. Recently, embedding-based methods for tKGs started emerging for a more accurate link prediction. The common principle of these methods is to create time-dependent embeddings for entities and relations, e.g., HyTE (Dasgupta et al. (2018) ), TA-ComplEx (García-Durán et al. ( 2018)), TNTComplEx (Lacroix et al. (2020) ), and DE-SimplE (Goel et al. (2019) ). These embeddings are plugged into standard scoring functions in most cases. Further, other works, e.g, TAE-ILP (Jiang et al. (2016) ), and TimePlex (Jain et al. (2020) ), have investigated the explicit temporal feature modeling, and merged it into the embedding algorithms. The main weakness of embedding-based methods is lack of interpretability, as well as their failure when previously unobserved entities, relations, or timestamps present during inference. Logical-rule-based methods. Logical-rule-based methods for link prediction on tKGs is mainly based on random walks. Although these works show the ability of learning temporal rules, they perform random walks in a very restricted manner, which impairs the quality of learned rules. For example, Dynnode2vec (Mahdavi et al. (2018) ) and Change2vec (Bian et al. (2019) ) both process tKGs as a set of graph snapshots at different times where random walks are performed separately. DynNetEmbedd (Nguyen et al. ( 2018)) requires the edges in walks to be forward in time. Stream-Learner (Omran et al. (2019) ) first extracts rules from the static random walk, and then extend them separately into time domain. The consequence is that all body atoms in the extended rules have the same timestamp. TLogic (Liu et al. (2021) ) is the most recent work which extracts temporal logical rules from the defined temporal random walks. The temporal constraints for temporal random walks are built on timestamps instead of intervals, and are fixed during the learning. This inflexibility impairs its ability in temporal constraints learning. Furthermore, both StreamLearner and TLogic, which can truly learn temporal logical rules, are statistical methods which estimates the rule confidence by counting the number of rule groundings and body groundings. Differentiable rule learning. Several works utilize neural network architectures for rule learning, e.g., Neural-LP (Yang et al. (2017) ), NTP (Rocktäschel & Riedel (2017) ), DeepProblog (Manhaeve et al. (2018) ), ∂ILP (Evans & Grefenstette (2018) ), RuLES (Ho et al. ( 2018)), IterE (Zhang et al. (2019) ), dNL-ILP (Payani & Fekri (2019) ) and NLProlog (Weber et al. (2019) ). These works mainly focus on static KGs, lacking the ability of capturing temporal patterns. Converting from static KGs to temporal KGs is not a trivial extension. For example, Neural-LP reduces the rule learning problem to matrix multiplication. By defining the operators and neural control system, this framework realizes logical rule learning in an end-to-end fashion. However, the extra temporal constraints in logical rules break down the Markovian property of random walks which serves as the foundation of the past frameworks. Link prediction. This task is to predict missing links with observed facts from the same tKG. To be specific, given a query (e s , r, ?, I) and the observed facts from the same tKG G, a ranked list of candidates for the missing object is required. For subject prediction, the query is formulated as e o , r -1 , ?, I . Compared with static link predication, the temporal link prediction is much harder: Even given the same subject (or object) and relation, the correct answer can change with different query intervals.

3. PRELIMINARIES

Temporal relation. Temporal relation (TR) between two timestamps t and t ′ can take the form bef ore, equal, or af ter, which are denoted by t < t ′ , t = t ′ , and t > t ′ , respectively. For TR between two intervals I and I ′ , there are 13 different relations in total given by Allen's interval algebra (Allen (1983)). For example, I := [t s , t e ] is before I ′ := [t ′ s , t ′ e ] iff t e < t ′ s . However, the resulting temporal logical rules would become too specific by directly using all these 13 types. Thus, we group them into 3 classes: T R ∈ {before, touching, after}, where T R denotes the possible TR between two intervals, bef ore is defined as previously mentioned, af ter is the converse of bef ore, and touching is the group of all the other 11 types. In other words, touching is used when two intervals have overlap with each other. It should be noted that timestamp can be considered as a special kind of interval with equal start time and end time. Thus, our definition of temporal relation T R can be also used to describe TR between timestamps.

Temporal logical rule.

A temporal logical Rule of length l ∈ N is defined as P l+1 (E 1 , E l+1 , I l+1 ) ← ∧ l i=1 P i (E i , E i+1 , I i ) ∧ l j=1 ∧ l+1 k=j+1 T R j,k (I j , I k ) where E i ∈ E denotes variables of entities, I i ∈ I denotes variables of intervals, P i ∈ R denotes predicates, which are grounded relations in logical rules, and T R j,k ∈ {before, touching, after} denotes grounded temporal relations between intervals. The left arrow in Rule is called "entails", i.e., the rule body on the right entails the rule head on the left. The rule head contains a head predicate P l+1 , also called the target predicate. For the link prediction task, target predicates are given. Thus, we use P h to denote it in the following sections. I l+1 is called query interval. Similarly, P i and I i for i ∈ [1, l] are called body predicates and body intervals, respectively. The rule is also called "chain-like" because the rule body corresponds to a walk from E 1 to E l+1 . A rule is grounded by substituting the variables E and I with constants. For example, a grounding of the following temporal logical rule ReceiveAward (E 1 , E 2 , I 2 ) ← NominatedFor (E 1 , E 2 , I 1 ) ∧ touching(I 1 , I 2 ) is given by the edges (Alice Bradley Sheldon, receive award, Nebula Award for Best Novelette, [1977, 1977] ) and (Alice Bradley Sheldon, nominated for, Nebula Award for Best Novelette, [1977, 1977] ) in WIKIDATA12k dataset. Since logical rules can be violated, rule confidence, the probability that a rule is correct, needs to be estimated.

4. CONSTRAINED RANDOM WALK

Path constraint. Temporal logical rules can be considered as constraints for random walks on tKG. Generally speaking, these constraints can be divided into two classes: Markovian and non-Markovian. With Markovian constraints, the calculation of next state probability is only related to current state probability, i.e., a random walk is performed without the consideration of previous visited edges. Otherwise, we need to record the previous visited edges to ensure that non-Markovian constraints are satisfied. For the temporal logical rule given in (1), Markovian constraints include body predicates, i.e., P i for i ∈ [1, l], and TRs between query interval and every body interval, i.e., T R i,l+1 for i ∈ [1, l]. Further, the non-Markovian constraints include pairwise TRs between body intervals, i.e., T R j,k for j ∈ [1, l -1] and k ∈ [j + 1, l]. Filtering operators f for these constraints are defined as f P ((e s , r, e o , I)) = 1 if r = P , 0 otherwise. (2)  f T R ((e s , r, e o , I) , (e ′ s , r ′ , e ′ o , I ′ )) = 1 if g(I, I ′ ) = T R, 0 otherwise. (3) g (I, I ′ ) =    before if t e < t ′ := [t ′ s , t ′ e ] . Constrained random walk. Since a successful random walk should satisfy both classes of constraints, we first perform random walk under Markovian ones, and then filter out the results according to non-Markovian ones. To ensure the efficiency of our framework, we use matrix operators built from the filtering operators. Given a query (e s , r, ?, I), for every pair of entities e x , e y ∈ E, the operator M i,CM i ∈ {0, 1} |E|×|E| related to step i under corresponding Markovian constraints CM i = {P i , T R i,l+1 } is defined as: (M i,CM i ) x,y = max F ∈Fy,x f CM i (F ) = max F ∈Fy,x f Pi (F ) f T R i,l+1 (F, (e s , r, ?, I)) where (M i,CM i ) x,y denotes the (x, y) entry of M i,CM i , F denotes a single fact, and F y,x denotes the set of facts from e y to e x . The essence of the operator is the adjacency matrix under Markovian constraints, and we set the entry maximum to 1. With these operators, we can actually find all the paths between any pair of entities that satisfy these Markovian constraints. Suppose we start from entity e s . After l steps of random walk under corresponding Markovian constraints, the process is written as v i+1 = M i,CMi v i for 1 ≤ i ≤ l (6) where v i ∈ N |E| is the indicator vector, e.g., for v 1 only the entry related to e s is set to 1, other entries being 0. When arriving at some entities, the corresponding entries would be greater than 0. With these indicator vectors, we can obtain every single constrained random walk W (n) CM for n ∈ N given by ((e (n) 1 , r (n) 1 , e (n) 2 , I (n) 1 ), . . . , (e (n) l , r (n) l , e (n) l+1 , I (n) l )). For non-Markovian constraints CN j,k = {T R j,k }, we apply corresponding filtering functions to these walks as: f CN (W (n) CM ) = l-1 j=1 l k=j+1 f CN j,k ((e (n) j , r (n) j , e (n) j+1 , I (n) j ), (e (n) k , r (n) k , e (n) k+1 , I (n) k )) = l-1 j=1 l k=j+1 f T R j,k ((e (n) j , r (n) j , e (n) j+1 , I (n) j ), (e (n) k , r (n) k , e (n) k+1 , I (n) k )) (7) The set of filtered walks S W C := {W (n) CM : f CN (W (n) CM ) = 1} is the final result of our algorithm. In this process, we also remove walks that involve repeated edges. In fact, by introducing new filtering functions, this framework can have more possibilities. For example, using the numerical comparison filtering functions (Wang et al. (2019) ), our method is able to learn logical rules with numerical features such as a person's age, height and weight.

5. LEARNING TEMPORAL LOGICAL RULES

The framework mainly consists of two stages: rule learning followed by rule application. In the learning stage, given a set of positive examples for the target predicates, we are going to find all the paths from the subject entity to the object entity in the tKG. Then we extract temporal logical rules from these paths, and estimate their confidence. The parameterised distributions of different temporal features introduced in this sections are also measured at this stage. In the application stage, given a query and a set of temporal logical rules related to the target predicate, we are going to find all the random walks that satisfy these rules. By calculating the arriving rate of each rule and aggregating all the rules, we can obtain the temporal logical rule score for all candidates. In addition, we use all the evidences related to these candidates to evaluate their temporal feature scores which, together with temporal logical rule scores, form the final scores. Rule confidence. For the extracted temporal logical rules, we need to estimate their confidence for the rule application. Instead of estimating confidence for every single rule, we create attention vectors for every type of constraints, and re-use them in different rules. Sharing confidence of constraints creates a joint and robust learning process, and largely reduces model parameters. In our temporal logical rules, there are two types of constraints, predicates and temporal relations. For every target predicate, we create a set of attention vectors to denote the confidence of using a certain constraint. Obviously, there are many factors that can affect these attention vectors, e.g, the target predicate, the query interval, the rule length, entity properties, and so on. To make this problem tractable, some simplifications are made here. We suppose that the attention vectors of predicates and TRs are dependent on the target predicate and the rule length. Furthermore, to deal with varying lengths of rules, an attention vector of the rule length is also required. Inspired by Neural-LP (Yang et al. ( 2017)), we design a set of mapping functions based on RNN. Figure 1 : The RNN-based system for solving the attention vectors Specifically, as shown in Fig. 1 , let w Len ∈ R L be the attention vector of rule length where L is the maximum. For a certain length l ∈ [1, L], at each step i ∈ [1, l] we calculate the attention vector of predicate (w P ) l i ∈ R |R| and its TR between query interval (w T R ) l i,l+1 ∈ R |T R| . We also calculate the attention vector of pairwise TR between body intervals (w T R ) l j,k ∈ R |T R| for j ∈ [1, l -1] and k ∈ [j + 1, l] using corresponding states. h l i = update h l i-1 , f T embdd (X l , P h ) (8) (w P ) l i = softmax W P h l i + b P (9) (w T R ) l i,l+1 = softmax W T R h l i + b T R (10) (w T R ) l j,k = softmax W ′ T R [h l j ; h l k ] + b ′ T R (11) w Len = softmax W Len f T embdd (X 0 , P h ) + b Len (12) where h l i ∈ R d denotes the ith-step state in the l-length rule with feature dimension d, h l 0 is set as 0, X l ∈ R |R|×d denotes the embedding matrix for target predicate P h , f embdd (X, p) := u T p X is an embedding lookup function in which u p is a one-hot indicator vector for input p, i.e., f T embdd (X l , P h ) ∈ R d , and W P ∈ R |R|×d , W T R ∈ R |T R|×d , W ′ T R ∈ R |T R|×2d , W Len ∈ R L×d , b P ∈ R |R| , b T R , b ′ T R ∈ R |T R| , b Len ∈ R L are all learnable parameters. Given these attention vectors, the confidence of a l-length Rule given by ( 1) is written as: score(Rule) =f embdd (w Len , l) l i=1 f embdd ((w P ) l i , P i ) f embdd ((w T R ) l i,l+1 , T R i,l+1 ) l-1 j=1 l k=j+1 f embdd ((w T R ) l j,k , T R j,k ) where f embdd (w Len , l) ∈ R denotes the confidence of length l, f embdd ((w P ) l i , P i ) ∈ R denotes the confidence of predicate P i , f embdd ((w T R ) l i,l+1 , T R i,l+1 ) ∈ R denotes the confidence of temporal relation T R i,l+1 , and f embdd ((w T R ) l j,k , T R j,k ) ∈ R denotes the confidence of temporal relation T R j,k . Temporal feature modeling. In our framework, we only focus on the connecting paths between the two entities in a query, and the TR between intervals is actually discretized, which might impair our model performance. To address these limitations, we introduce temporal features modeling where extra evidences and continuous distribution measurements are involved. Inspired by TimePlex (Jain et al. ( 2020)), we design an extended module. The main features considered here include Recurrence, T emporalOrder, RelationP airInterval and Duration. • Recurrence describes the distribution of recurrence of the same relation. Different from TimePlex's description (e s , r, e o , * ), we measure this feature with a more general form (e s , r, * , * ). For each relation r, we use a parameter (p rec ) r to denote the probability that this relation will happen again for the same subject. It should be noted that inverse relations r -1 are also involved. • T emporalOrder describes the distribution of the temporal order between relation pairs happening for the same subject, i.e., (e s , r, * , I) , (e s , r ′ , * , I ′ ). This feature is implied by TRs in temporal logical rules to some extent. However, for two touching intervals, we still cannot tell which one happens earlier. Thus, a parameter (p order ) r,r ′ is adopted for every pair of relations r and r ′ to denote the probability that r happens earlier than r ′ . • RelationP airInterval describes the distribution of the time gaps between relation pairs happening for the same subject. Different from TimePlex, given a pair of relations r and r ′ , we consider two types of distributions for their time gap, Gaussian and exponential, with parameters (µ pair ) r,r ′ , (σ pair ) r,r ′ and (λ pair ) r,r ′ , respectively. Gaussian distribution is preferred when there is a roughly fixed interval such as the birth date and death date of the same person, while exponential distribution is more suitable for two strongly correlated relations. • Duration describes the distribution of the interval length of every relation. We suppose the duration of each relation r follows a Gaussian distribution with parameters (µ d ) r , (σ d ) r . It is common in large tKGs that the exact date of some facts are missing. With this feature, we can estimate these missing dates and improve our model performance. With these temporal feature distributions, we can further evaluate the candidates of a query. Similar to TimePlex, a linear function of probability is used as scoring function, i.e., ϕ rec , ϕ order , ϕ pair . However, more evidences are used by our model in the evaluation. The evidences of TimePlex only include facts happening between the known entity and every candidate, while our model extends with the constrained random walks. Given a query (e s , r, ?, I) and a candidate e c , all the evidences used in the temporal feature modeling module include facts between e s and e c , i.  The training is divided into two phases. In the first phase, the attention vectors for predicates, TRs and rule length are learned by maximizing the score of correct candidates. In the second phase, all the distribution parameters of temporal features are fitted with training samples. Then we train the parameters of weights for the temporal feature modeling module with frozen attention vectors, i.e, ϕ T LR is used for prediction in the first stage, and ϕ T ILP is adopted in the second stage. [1959, 1999] ). The temporal specificity of facts in these datasets can be at the year, month, or day level, although month and day data are not present in the majority of examples; we remove month and day information from such facts to achieve a more uniform data representation. For datasets with higher granularity, we would expect improved performance due to more precise temporal relations.

6. EXPERIMENTS

For the link predication task on data of the form (e s , r, e o , I), we generate a list of ranked candidates for both object prediction (e s , r, ?, I) and subject prediction e o , r -1 , ?, I . The maximum rule length is set to 5 for both datasets. The standard metrics, mean reciprocal rank (MRR), hit@1, hit@10 are used for comparison of the methods. Similar to Jain et al. (2020) , we perform timeaware filtering which gives a more valid performance evaluation. We compare TILP with state-of-the-art baselines in two dimensions: static v. 2020)). The results for all embedding-based models are from Jain et al. (2020) . Ablation studies on the temporal feature modeling module (TILP w/o tfm) are also conducted.

6.2. RESULTS AND ANALYSIS

The results of the experiments are shown in Table 1 with the efficiency study given in Appendix C. The performance of TILP is comparable with the best temporal embedding method across all metrics, with TimePlex performing slightly better than TILP against half of the evaluated metrics. Note that because of the human-interpretable form of the TILP predictions, its temporal logical rules provide explanatory support for predictions, while the interpretation of TimePlex embeddings would be more opaque. Some examples of the learned rules and groundings from TILP on the WIKIDATA12k dataset are given below. Rule 1: memberOf (E 1 , E 4 , I 4 ) ← memberOf (E 1 , E 2 , I 1 ) ∧ memberOf -1 (E 2 , E 3 , I 2 ) ∧memberOf (E 3 , E 4 , I 3 ) ∧ 3 i=1 (∧ 4 j=i+1 touching(I i , I j )) Grounding: E 1 = Somalia, E 2 = International Development Association, E 3 = Kingdom of the Netherlands, E 4 = International Finance Corporation, I 1 = [1962, present], I 2 = [1961, present], I 3 = [1956, present], I 4 = [1962, present]. Rule 2:  receiveAward (E 1 , E 4 , I 4 ) ← nominatedFor (E 1 , E 2 , I 1 ) ∧ nominatedFor -1 (E 2 , E 3 , I 2 ) ∧receiveAward (E 3 , E 4 , I 3 ) ∧ before(I 1 , I 2 ) ∧ 2 i=1 after(I i , I 3 ) ∧ 3 j=1 before(I j , I 4 ) Grounding: E 1 = ZDF, E 2 = International

Granularity of temporal relations:

We observe that the well-known TLogic approach, which uses a point-in-time data representation of temporal facts, doesn't perform as well as more recently developed methods on these two interval-based tKGs. Several factors likely contribute to this phenomenon. First, the foundation of the temporal walks in TLogic is the relative order of time points. Without the introduction of temporal relation between intervals, TLogic can not be simply extended for interval-based tKGs. Second, to improve model efficiency, TLogic uses sampling strategy with the control of the total number of temporal walks. This strategy actually impairs model performance due to no guarantee for successful long-distance random walks. It is proven by the facts that a lot of temporal logical rules we found in these two datasets are of length 5, which can be challenges for TLogic. Temporal feature modeling: These experiments suggest that time-aware learning is an important component for link prediction in tKGs since all static learning methods are out-performed by their counterparts with temporal learning abilities. This leaves us with a comparison of logical-rule-based methods to embedding-based methods. The former integrates temporal relations into logical rule representation, while the latter uses time-dependent embeddings. Both approaches can be implemented in a variety of ways, for example previous explicit temporal features modeling approaches have used: Recurrence, T emporalOrder, RelationP airInterval and Duration. The alternative representation of temporal intervals as continuous distributions has demonstrated the greater success than these previous models, as shown by the evaluation of both TimePlex and TILP (our model). However, our extensions of the temporal feature modeling module are non-trivial since evidences of constrained random walks can not be found by any embedding-based methods.

6.3. MORE DIFFICULT PROBLEM SETTINGS

Our model uses neurally generated symbolic representations of temporal and entity relationships, while performing on par with state-of-the-art embedding-based methods. Looking beyond performance, symbolic representations convey several advantages in understanding the prediction quality in tKGs. To demonstrate these strengths, we propose the following more difficult problem settings in consideration of data efficiency, model robustness, and transfer learning. All these scenarios are important and changeling tasks in KG reasoning, and temporal information in tKGs makes the problems even harder. A few related works (Mirtaheri et al. (2020) , Xu et al. (2021) , Liu et al. (2021) ) try to give some solutions, but there is still room for improvements. To simplify the discussion, we restrict comparison of our model to the standard and highest performing methods for temporal link prediction: TLogic (logical-rule-based) and TimePlex (embedding-based). Few training samples. Training data is often expensive to obtain for new scenarios, and therefore the performance of a method under limited training data is an important consideration. We parametrically examine the relative performance of our model in this low-data scenario, and demonstrate its data efficiency using the MRR metric. 

7. CONCLUSION

TILP, the first differentiable framework for temporal logical rules learning, has been proposed for the link predication task on temporal knowledge graphs. Experiments on two standard datasets indicate that TILP achieves comparable performance to the state-of-the-art embedding-based methods while additionally providing logical explanations for the link predictions. In addition, we consider some important learning problems in temporal knowledge graphs, where TILP outperforms most baselines. An interesting direction for future work is to predict event intervals with temporal logical rules. In such a task, the learned rules must contain both numerical values and temporal relations, a situation which should further benefit from the expressive power of the logical rules considered in the TILP framework.

A DETAILS OF TEMPORAL FEATURE MODELING

Given a query (e s , r, ?, I) and a candidate e c , the facts used in this module include three parts: the set of edges from e c to e s denoted by F c,s , the set of edges from e c to other entities denoted by F c,s , and the set of paths from e s to e c denoted by S W C (es,ec) . Let R (1) , R (2) , R (3) ⊆ R denote the set of relations existing in F c,s , F c,s and S W C (es,ec) , respectively. Given a relation r ′ , let (T s ) 1 r ′ , (T s ) 2 r ′ , (T s ) 3 r ′ ∈ R be the closest start time, existing in F c,s , F c,s and S W C (es,ec) , to the start time of the query interval denoted by t s , respectively. In addition, we introduce a scoring function ϕ to integrate different probabilities: ϕ(e c ; h, w, b) = r ′ ∈R(ec) exp(w r,r ′ )(h r,r ′ + b r,r ′ ) r ′′ ∈R(ec) exp(w r,r ′′ ) where ϕ(e c ) ∈ R denotes the score of e c , the probability h ∈ R |R|×|R| is corresponding with the query relation r and another relation r ′ , i.e., h r,r ′ ∈ R, R(e c ) ⊆ R denotes the set of relations related to e c , and w, b ∈ R |R|×|R| are learnable parameters, i.e, w r,r ′ , b r,r ′ ∈ R. In our model, given a temporal feature x ∈ R related to candidate e c , query relation r, and another relation r ′ , its probability h r,r ′ may follow three types of distributions: 1) Bernoulli with parameter p ∈ R |R|×|R| , i.e., h r,r ′ = (p r,r ′ ) x (1 -p r,r ′ ) 1-x ; 2) Gaussian with parameters µ, σ ∈ R |R|×|R| , i.e., h r,r ′ = N (x; µ r,r ′ , σ r,r ′ ); 3) Exponential with parameter λ ∈ R |R|×|R| , i.e., h r,r ′ = λ r,r ′ exp(-λ r,r ′ x). For example, with a yearly resolution in YAGO11k dataset, we found the RelationP airInterval between a person's birth and graduation is normally distributed with mean 22 and standard deviation 1, and that between a person's birth and death is also normally distributed with mean 70 and standard deviation 6. Let r = wasBornIn, r ′ = graduatedFrom, r ′′ = diedIn, we have µ r,r ′ = 22, σ r,r ′ = 1, µ r,r ′′ = 70, σ r,r ′′ = 6. Given a query (?, wasBornIn, Nashville, [1872, 1872] ) and a candidate e c = Cass Canfield, we first check the known facts related to e c which include (Cass Canfield, graduatedFrom, Harvard University, [1919 , 1919] ) and (Cass Canfield, diedIn, New York City, [1986, 1986] ). Then the temporal feature, i.e., RelationP airInterval, related to e c , r and r ′ is x = |1872 -1919| = 47. Thus, h r,r ′ = N (x; µ r,r ′ , σ r,r ′ ) = N (47; 22, 1) = 7.65 × e -137 . Similarly, h r,r ′′ = N (|1872 -1986|; 70, 6) = 1.39 × e -13 . These probabilities h r,r ′ , h r,r ′′ are integrated with (17) to form the scoring function ϕ pair (e c ). As for other temporal features, the calculation processes of the scoring functions are similar. The differences are: 1) Since Recurrence is only related to the query relation, we use (18) which can be considered as a simplified version of (17). 2) Both Recurrence and T emporalOrder follow Bernoulli distribution, i.e., x = 0 or 1. More details are shown below. • Recurrence describes the probability distribution of recurrence of relation r on entity e c , and is considered in F c,s and F c,s . We suppose this probability follows a Bernoulli distribution with parameter p rec,1 , p rec,2 ∈ R |R| in F c,s and F c,s , respectively. The following scoring function is defined: ϕ rec (e c ; h rec , w rec , b rec ) = (w rec ) r (h rec ) r + (b rec ) r ( ) where the temporal feature related to e c and r is x = 1(r ∈ R(e c )), its probability • T emporalOrder describes the probability distribution of the temporal order of two relations r and r ′ , and is considered in F c,s , F c,s and S W C (es,ec) . We suppose this probability follows a Bernoulli distribution with parameter p order,1 , p order,2 , p order,3 ∈ R |R|×|R| in F c,s , F c,s and S W C (es,ec) , respectively. Given R(e c ) = R (1) , we calculate ϕ order,1 (e c ; h order,1 , w order,1 , b order,1 ) with ( 17), where h order,1 ∈ R |R|×|R| is based on p order,1 with x = 1(t s < (T s ) 1 r ′ ), and w order,1 , b order,1 ∈ R |R|×|R| are learnable parameters. Similarly, given R(e c ) = R (2) , we calculate ϕ order,2 (e c ; h order,2 , w order,2 , b order,2 ) with parameters h order,2 , w order,2 , b order,2 ∈ R |R|×|R| , where h order,2 is based on p order,2 with x = 1(t s < (T s ) 2 r ′ ); Given R(e c ) = R (3) , we calculate ϕ order,3 (e c ; h order,3 , w order,3 , b order,3 ) with parameters h order,3 , w order,3 , b order,3 ∈ R |R|×|R| , where h order,3 is based on p order,3 with x = 1(t s < (T s ) 3 r ′ ). (h rec ) r = ((p rec ) r ) x (1 -(p rec ) r ) 1-x ∈ R is based on p rec ∈ R |R| , i.e., • RelationP airInterval describes the distribution of the time gap between two relations r and r ′ , and is considered in F c,s , F c,s and S W C (es,ec) . We suppose this probability follows a Gaussian or exponential distribution with parameters µ pair,1 , σ pair,1 , λ pair,1 , µ pair,2 , σ pair,2 , λ pair,2 , µ pair,3 , σ pair,3 , λ pair,3 ∈ R |R|×|R| in F c,s , F c,s and S W C (es,ec) , respectively. Given R(e c ) = R (1) , we calculate ϕ pair,1 (e c ; h pair,1 , w With these temporal feature distributions, we can further evaluate the candidates given a query. We fisrt combine the scores in each part with ( 19)-( 21). Then the scores from different parts are combined to obtain the temporal feature modeling score ϕ tf m given in ( 14). ϕ tf m,1 (e c ) =γ rec,1 ϕ rec,1 (e c ) + γ order,1 ϕ order,1 (e c ) + γ pair,1 ϕ pair,1 (e c ) (19) ϕ tf m,2 (e c ) =γ rec,2 ϕ rec,2 (e c ) + γ order,2 ϕ order,2 (e c ) + γ pair,2 ϕ pair,2 (e c ) (20) ϕ tf m,3 (e c ) =γ order,3 ϕ order,3 (e c ) + γ pair,3 ϕ pair,3 (e c ) where all γ ≥ 0 are learnable weights. For each part, the sum of weights is equal to 1, i.e., γ rec,1 + γ order,1 + γ pair,1 = 1, γ rec,2 + γ order,2 + γ pair,2 = 1, and γ order,3 + γ pair,3 = 1.

B DATASET DISCUSSION

There are four benchmark temporal knowledge graph datasets including ICEWS (Lautenschlager et al. (2015) ), GDELT (Leetaru & Schrodt (2013) ), WIKIDATA (Leblay & Chekol (2018) ) and YAGO (Mahdisoltani et al. (2014) ). As for the temporal representation, the first two datasets use timestamps, while the last two uses intervals. Compared with timestamp-based methods, intervalbased tKGs are both more general and more difficult to learn. Thus, we mainly focus on WIKIDATA and YAGO datasets in our experiments. WIKIDATA is a large knowledge base based on Wikipedia. To form the WIKIDATA12k dataset, a subgraph with temporal information is extracted by Dasgupta et al. (2018) . It is guaranteed that every single node are related to multiple facts, and the top 24 frequent relations are selected. YAGO is another large knowledge graph built from multilingual Wikipedias. Similarly, some temporally associated facts are distilled out from YAGO3 to form the YAGO11k dataset (Dasgupta et al. ( 2018)). In this dataset, every single node is connected by more than one edge, and the top 10 frequent relations are selected. For the WIKIDATA12k and YAGO11k datasets, they contain many time-sensitive relations such as 'residence', 'position held', 'member of sports team', 'member of', 'educated at' in WIKIDATA12k and 'worksAt', 'playsFor', 'isAffiliatedTo', 'hasWonPrize', 'owns' in YAGO11k. These time-sensitive relations make the link prediction task in these two datasets more challenging. For example, in the WIKIDATA12k datatset, a person can become a member of different teams, hold different positions, and receive different awards in various periods. Thus, it is necessary to model temporal information for link prediction tasks in tKGs. 

D DETAILS OF THE MORE DIFFICULT PROBLEM SETTINGS

Few training samples. In experiments, we randomly reduce the number of training samples in training set and evaluate different models for the two datasets. To alleviate the effects of different data distributions, we repeat this experiments for 5 rounds. The results are shown in Fig. 2 , where we draw the average MRR curve with err bars. When the training set size decreases, TILP outperforms all the baseline methods. Through constrained random walks, TILP is able to capture all the patterns related to a query relation which are independent on entities. Reducing training set size only changes the frequency of different patterns. To contrast, embedding-based methods require enough training samples to learn good embeddings of entities and relations. In this setting, our method is also better than TLogic, which demonstrates the advantages of the neural-network-based logical rule learning framework relative to the statistical methods. The original test set from random generation contains few queries of rare relations. To guarantee enough test queries for each relation, we adjust the distribution of queries in the test set, trying to set the number of queries of different relations to be equal. If queries of a certain relation are not enough, we would randomly choose half of them. In Fig. 3 , the average change of MRR for different models is shown, where error bars have been suppressed for readability. We conclude that the attention vectors of predicates, temporal relations and rule length are relation-dependent in TILP, making it less susceptible than other methods to data imbalance. In contrast, embeddings of entities are shared among all the relations, making embedding-based method suffer more susceptible to data imbalance. As a statistical method, TLogic also fails since it can not build dependency between different rules. -431, 2006] , [2006, 2011] , [2011, 2022] , respectively. The results are shown in Table 2 . One major limitation of most time-aware embedding-based methods is the use of absolute timestamp as anchors, preventing generalization to either time shifting settings and inductive settings (Liu et al. (2021) ). With such limitations in mind, TILP extracts temporal logical rules with relative temporal relations, providing greater flexibility, e.g. for transfer learning to arbitrary temporal periods. In this setting, our method is still better than TLogic which builds their model on timestamps, and ignores the necessity of learning all possible temporal patterns from data.



e, F c,s := {(e c , * , e s , * )}, facts on e c but not on e s , i.e., F c,s := {(e c , * , * , * )} -F c,s and constrained random walks S W C between e s and e c . The temporal feature modeling score ϕ tf m of the candidate is given as ϕ tf m (e c ) = γ 1 ϕ tf m,1 (e c ) + γ 2 ϕ tf m,2 (e c ) + γ 3 ϕ tf m,3 (e c ) (14) where ϕ tf m,1 (e c ), ϕ tf m,2 (e c ), ϕ tf m,3 (e c ) are the scoring functions given F c,s , F c,s and S W C (es,ec) , respectively, and γ 1 , γ 2 , γ 3 ≥ 0 are the corresponding weights. The details of these scoring functions are shown in Appendix A.

α c := N c /N . By aggregating all the rules learned for the target predicate, we obtain the logical score ϕ T LR . Combining with the temporal feature modeling score ϕ tf m via corresponding weights γ T LR , γ tf m ≥ 0, the final score ϕ T ILP becomes ϕ T LR (e c ) = Rule α c (Rule)score(Rule) (15) ϕ T ILP (e c ) = γ T LR ϕ T LR (e c ) + γ tf m ϕ tf m (e c )

s. temporal, logicalrule-based v.s. embedding-based. The static logical-rule-based methods include Neural-LP(Yang et al. (2017)), and AnyBURL(Meilicke et al. (2020)). The temporal logical-rule-based method is TLogic(Liu et al. (2021)). The static embedding-base model is ComplEx(Trouillon et al. (2016)). The temporal embedding-base model include TA-ComplEx (García-Durán et al. (2018)), HyTE(Dasgupta et al. (2018)), DE-SimplE(Goel et al. (2019)), TNT-Complex(Lacroix et al. (2020)), and TimePlex(Jain et al. (

Emmy Award for best drama series, E 3 = DR, E 4 = Peabody Awards, I 1 = [2005, 2005], I 2 = [2009, 2009], I 3 = [1997, 1997], I 4 = [2013, 2013].

h rec ∈ R |R| , and w rec , b rec ∈ R |R| are learnable parameters. Given R(e c ) = R (1) , we have ϕ rec,1 (e c ; h rec,1 , w rec,1 , b rec,1 ) with parameters h rec,1 , w rec,1 , b rec,1 ∈ R |R| ; Given R(e c ) = R (2) , we have ϕ rec,2 (e c ; h rec,2 , w rec,2 , b rec,2 ) with parameters h rec,2 , w rec,2 , b rec,2 ∈ R |R| .

Figure 2: Link prediction performance with few training samples

Figure 3: Link prediction performance with biased data

Temporal knowledge graph. A temporal knowledge graph (tKG) G is a collection of facts represented by a quadruple (e s , r, e o , I). This fact, also called edge or link, implies a relation r from the subject entity e s to the object entity e o during interval I. We define an interval I with its start time t s and end time t e , i.e., I = [t s , t e ]. To allow bidirectional random walks, we imagine the existence of inverse edges, i.e., (e o , r -1 , e s , I). The set of entities, relations, timestamps and intervals are denoted by E, R, T and I, respectively.

P and f T R denote filtering operators for predicate P and temporal relation T R, respectively, and g is the TR evaluation function with I := [t s , t e ] and I ′

Link prediction performance on the two benchmark datasets Suppose by applying a temporal logical Rule, we find a total of N successful random walks in which N c of them arrive at entity e c . With the assumption that each walk has equal contribution, we can calculate the arriving rate

Detailed dataset introduction and statistics are given in Appendix B. These datasets contain temporal facts in the form (e s , r, e o , I), e.g. (John Okell, worksAt, SOAS University of London,

Details of the setting and the results are shown in Appendix D. It is observed that when the training set size decreases, TILP outperforms all the baseline methods. Through constrained random walks, TILP is able to capture all the patterns related to a query relation which are independent on entities. Reducing training set size only changes the frequency of different patterns. To contrast, embedding-based methods require enough training samples to learn good embeddings of entities and relations.Biased data. Another common problem in knowledge graph learning is data imbalance. For some rare relations, it is hard to collect samples. For example, in the WIKIDATA12k dataset, where there are 40621 edges in total, the number of edges for relation capitalOf and residence are only 86 and 80, respectively. This demonstrates that achieving a good model for every relation in the dataset requires the ability to manage biased representation frequency. Details of the setting and the results are shown in Appendix D. From the results, we conclude that the attention vectors of predicates, temporal relations and rule length are relation-dependent in TILP, making it less susceptible than other methods to data imbalance. In contrast, embeddings of entities are shared among all the relations, making embedding-based method suffer more susceptible to data imbalance. Link prediction performance with time shifting setting

pair,1 , b pair,1 ) with (17), where h pair,1 ∈ R |R|×|R| is based on {µ pair,1 , σ pair,1 } or λ pair,1 with x = |t s -(T s ) 1 r ′ |, and w pair,1 , b pair,1 ∈ R |R|×|R| are learnable parameters. Similarly, given R(e c ) = R (2) , we calculate ϕ pair,2 (e c ; h pair,2 , w pair,2 , b pair,2 ) with parameters h pair,2 , w pair,2 , b pair,2 ∈ R |R|×|R| , where h pair,2 is based on {µ pair,2 , σ pair,2 } orλ pair,2 with x = |t s -(T s ) 2 r ′ |; Given R(e c ) = R (3) , we calculate ϕ pair,3 (e c ; h pair,3 , w pair,3 , b pair,3 ) with parameters h pair,3 , w pair,3 , b pair,3 ∈ R |R|×|R| , where h pair,3 is based on {µ pair,3 , σ pair,3 } or λ pair,3 with x = |t s -(T s ) 3 r ′ |.• Duration describes the probability distribution of interval length of every relation. We suppose this probability follows a truncated Gaussian distribution with parameters µ d , σ d ∈ R |R| within the interval [0, +∞). It is common in large tKGs that the exact date of some events are missing. With this feature, we can estimate these missing dates and improve our model performance. For example, given a fact or query with incomplete interval I = [t s , ?], we generate a duration t d ∼ ψ((µ d ) r , (σ d ) r , 0, +∞) where r is the corresponding relation, and set Î = [t s , t s + t d ].

Table3shows the dataset statistics with a yearly time resolution, where G, E, R, T , I are the set of edges, entities, relations, timestamps and intervals, respectively.

Dataset statistics with a yearly time resolution path is the maximum number of paths for a given rule that can be found in a single example. The time complexity of tfm module (distribution parameters measurement) is O(N pos (δ + |R| + LN rule N path )), where δ is the maximum node degree. Similarly, for the rule application process, the time complexity of TLR module isO(N qry N ′ rule (L|G| + L 2 N ′ path )), where N qry is the number of queries, N ′ rule is the maximum number of rules for a given target predicate, and N ′ path is the maximum number of paths for a given rule that can be found in a single query. The time complexity of tfm module is O(N qry (Kδ + K|R| + LN ′ rule N ′ path )), where K is the maximum number of candidates for a single query. Since the path search for each positive example and the rule application for each query are both independent, these processes are done in parallel. Given the maximum rule length L = 5, the rule searching of TLR module on a 4-CPU machine takes 1740.6s on WIKIDATA12k training set, and 571.9s on YAGO11k training set. The distribution parameters measurement of tfm module takes 44.1s on WIKIDATA12k training set, and 6.7s on YAGO11k training set. The rule application of TLR module takes 1522.8s on WIKIDATA12k validation set, and 529.6s on YAGO11k validation set. The scoring of tfm module takes 1713.8s on WIKIDATA12k validation set, and 527.8s on YAGO11k validation set.

ACKNOWLEDGMENTS

This work was supported by a sponsored research award by Cisco Research.

