WEIGHTED CLOCK LOGIC POINT PROCESS

Abstract

Datasets involving multivariate event streams are prevalent in numerous applications. We present a novel framework for modeling temporal point processes called clock logic neural networks (CLNN) which learn weighted clock logic (wCL) formulas as interpretable temporal rules by which some events promote or inhibit other events. Specifically, CLNN models temporal relations between events using conditional intensity rates informed by a set of wCL formulas, which are more expressive than related prior work. Unlike conventional approaches of searching for generative rules through expensive combinatorial optimization, we design smooth activation functions for components of wCL formulas that enable a continuous relaxation of the discrete search space and efficient learning of wCL formulas using gradient-based methods. Experiments on synthetic datasets manifest our model's ability to recover the ground-truth rules and improve computational efficiency. In addition, experiments on real-world datasets show that our models perform competitively when compared with state-of-the-art models.

1. INTRODUCTION AND RELATED WORK

Multivariate event streams are emerging types of data that involve occurrences of different types of events in continuous time. Event streams are observed in a wide range of applications, including but not limited to finance (Bacry et al., 2015) , politics (O'Brien, 2010) , system maintenance (Gunawardana et al., 2011) , healthcare (Weiss & Page, 2013) , and social networks (Farajtabar et al., 2015) . As opposed to time series data that typically comprises continuous-valued variables evolving in regular discrete time stamps, event streams involve events occurring irregularly and asynchronously in continuous time. Modeling the dynamics in event streams is important for a wide range of scientific and industrial processes, such as predicting the occurrence of events of interest or understanding why some deleterious events occur so as to possibly prevent their occurrence. A (multivariate) temporal point process (TPP) provides a formal mathematical framework for representing event streams, where a conditional intensity rate for each event measures its occurrence rate at any time given the historical events in the stream (Daley & Vere-Jones, 2003; Aalen et al., 2008) . There has been a proliferation of research around TPPs in recent years, particularly around the use of neural networks for modeling conditional intensity rates as a function of historical occurrences (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017; Xu et al., 2017; Gao et al., 2020; Zhang et al., 2020; Zuo et al., 2020) . One stream of research studies graphical event models (GEMs) as a compact and interpretable graphical representation for TPPs, where the conditional intensity rate for any particular event depends only on the history of a subset of the events (Didelez, 2008; Gunawardana & Meek, 2016) . While any TPP can be represented as a GEM, various models make assumptions about the parametric form of conditional intensity rates for the sake of learnability, for instance that rates are piece-wise constant with respect to occurrences within historical windows (Gunawardana et al., 2011; Bhattacharjya et al., 2018) . Ordinal GEMs(OGEM) (Bhattacharjya et al., 2020; 2021) are a recent model from this family where a conditional intensity rate depends on the order in which parent events occur within the most recent historical time period. A temporal logic point process (TLPP) framework was proposed as an alternate way to lend some interpretability to TPPs by modeling intensity rates using temporal logic rules (Li et al., 2020) . Although the initial work pre-specified temporal logic rules, recent work has introduced a temporal logic rule learner (TELLER) for automatically discovering rules (Li et al., 2021) . There is however the issue of scalability since TELLER exploits an expensive branch-and-price algorithm to search for temporal logic rules in a discrete space. Another important limitation of this work is that TELLER's rules are not informative enough to explain how the interval length between ordered events impacts the conditional intensity rate. For instance, while predicting the occurrence of diabetes, the rule that "insulin injection happens 20 minutes before eating meal" is more informative and accurate in predicting "blood glucose remains normal" than the rule that "insulin injection happens before eating meal", as the latter rule cannot expose the interval between 'insulin injection' and 'eating meal'. To tackle the above limitations, we propose novel atomic predicates enriching the expressiveness of temporal logic rules as well as a differentiable framework to learn rules in an end-to-end manner. This work introduces a differentiable neuro-symbolic framework, clock logic neural network (CLNN), to model TPPs by learning weighted clock logic (wCL) formulas as explanations. Firstly, event streams are converted into continuous-time clock signals representing the time interval between the last occurrence of an event and the current time. Next, we propose a novel wCL to describe the underlying temporal relations with relative interval length, enabling the design of a CLNN to learn the generative mechanisms. Instead of searching for temporal logic rules in some vast discrete space, CLNN associates every neuron with an order representation or a logical operator and assigns weights to edges to reflect the importance of various inputs, which relaxes the search space to be continuous. Moreover, architecture weights are introduced into CLNN to make the formula structure search differentiable. wCL formula-informed intensity rates are carefully designed so that the parameters appearing in the rules can be learned through maximum likelihood estimation using gradient-based approaches. CLNN is tested on synthetic datasets to show that CLNN can recover the ground-truth rules as well as on real-world datasets to demonstrate its model-fitting performance.

2. PRELIMINARIES 2.1 NOTATION & BACKGROUND

Let L denote the set of event labels, and M = |L| denote the number of event labels. An event stream is a sequence of events including time stamps, denoted as D = {(l 1 , t 1 ), (l 2 , t 2 ), ..., (l N , t N )}, where t i ∈ R + denotes a time stamp between the beginning time t 0 = 0 and end time t N +1 = T , and l i ∈ L is the event label that happens at t i . We refer to 'event label' and 'label' interchangeably. Every event label l ∈ L has an associated conditional intensity rate describing the occurrence rate of label l at t given the history up to t. In multivariate temporal point processes, conditional intensity rates describe the dynamics of events. Let H t = {(l i , t i ) : t i < t} denote the historical events up to time t. The conditional intensity rate of event label l is denoted as λ l (t|H t ). Specifically, λ l (t|H t ) describes the expected number of occurrences of event label l in an infinitesimal interval [t, t+∆t] given the history H t , i.e., λ l (t|H t ) = lim ∆t→0 (E[N l (t + ∆t) -N l (t)|H t ]/∆t), where N l (t) denotes the number of event label l's occurrences up to t. Example 1 A running example of an event stream with 11 events of 4 labels is shown in Figure 1(a) . The overall workflow of the proposed method (POC: paired order cell, SOC: singleton order cell, AC: architecture cell, details presented in Section 2.2 to 3.3).

2.2. ORDER REPRESENTATIONS FOR EVENT STREAMS

The overall workflow of the proposed framework is visualized as Figure 1 (b). The raw event streams first go through a masking function to generate the masked event streams, which are then transformed into event clocks using a clocking function. The event clocks are given as inputs to the clock logic neural network (CLNN) to learn interpretable wCL formulas and the intensity rate of event occurrences. The following sections provide a detailed explanation for each module in Figure 1 (b). We are interested in exploring the effect of temporal ordering between event labels and the occurrences of causal event labels in a historical window on the occurrence rate of a particular event label, where the generative mechanism is expressed as interpretable formulas. An event stream up to t may include multiple occurrences of the same event label, thus a masking function is required to mask out duplicated event labels in the history for accessing the ordering information at any t. Here we adopt a technique similar to Bhattacharjya et al. (2020) for extracting distinct event labels from H t . Definition 1 (Masking Function) A masking function Γ(•) is a function that takes an event stream as input and returns a new event stream that is a subset of the input stream and contains no duplicated event labels. Mathematically, Γ(•) is applied to H t = {(l i , t i )} and converts it into a new stream H ′ t = {(l j , t j ) ∈ H t : l j ̸ = l j ′ if j ̸ = j ′ }. We consider the following two masking functions as per Bhattacharjya et al. (2020) due to simplicity: 'first' masking and 'last' masking. The 'first' (resp. 'last') masking function keeps the first (resp. last) occurrence of an event label in an event stream. Example 1 (cont.) Let H 13 = {(A, 1), (B, 3), (A, 6), (D, 8), (C, 10), (D, 12)}. The 'first' masking function converts it to H ′ 13 = {(A, 1), (B, 3), (D, 8), (C, 10)}, and the 'last' masking function converts it to H ′ 13 = {(B, 3), (A, 6), (C, 10), (D, 12)}. With the masked event history H ′ t , we define two order representations for the order relationship between any two event labels and the occurrence of an event within a historical window of t. Definition 2 (Paired Order Representation (POR)) A paired order representation is defined as [l i , l j ] ∈ [L] 2 , where [L] 2 denotes two-element permutation of a subset of L. A paired order representation for H ′ t can be obtained by arranging any two distinct labels in H ′ t in a sequential order. Definition 3 (Singleton Order Representation (SOR)) A singleton order representation is denoted as [l j , u lj ] ∈ L × R + , representing event label l j ∈ L occurred within the past u lj time units, where u lj is a variable to learn through a process that will be explained in Section 3. 

2.3. WEIGHTED CLOCK LOGIC FORMULA

To adapt H ′ t to continuous-time signals that can be described by logical statements, we extract clock signals from H ′ t to describe the time passed since the last occurrence of a label. A clocking function is introduced to convert t j into a clock signal c j denoting the time interval length between t j and t. Definition 4 (Clocking Function) A clocking function Ξ(•) converts H ′ t into a vector of clock sig- nals as C ′ (t) = [c 1 (t), c 2 (t), ..., c M (t)] T ∈ R M + with c i (t) denoting the clock signal for event label i ∈ L, where c i (t) is computed as c i (t) = t-t j if (l j , t j ) ∈ H ′ t and l j = i, and c i (t) = Z otherwise. Note that Z is a user-defined, large positive number to indicate event label i not happening in H ′ t . Example 1 (cont.) Taking the 'first' masked event stream H ′ 13 = {(A, 1), (B, 3), (D, 8), (C, 12)} as an example, the event clocks are extracted as C ′ (13) = [12, 10, 1, 5] T . The event clocks can essentially provide the ordering between any two event labels in that the difference between any two event labels' clock signals reflects which event label happens first. As shown in the diabetes prediction example in the Introduction section, the time interval between ordering events is notably important in explaining and predicting an event label's occurrence. In contrast to (Li et al., 2020; 2021) which only learns the temporal ordering relation between event labels, we define a paired order predicate (POP) with a learnable parameter u lilj to describe the time interval between two ordered event labels l i and l j and a singleton order predicate (SOP) with a learnable parameter u lj to describe the occurrence of label l j within a historical window u lj as follows. Definition 5 (Paired Order Predicate) A POP describes the order between two labels l i , l j ∈ L, l i ̸ = l j , denoted as π lilj pop := g(c li , c lj ) = c li -c lj > u lilj , where u lilj ∈ R is a parameter to learn. A positive u lilj means l i happened before l j for at least u lilj time units, and a negative u lilj means l j happened before l i for at most -u lilj time units. A POP is used in the POC of Figure 1(b) . Definition 6 (Singleton Order Predicate) An SOP describes a causal label l j ∈ L occurring within the past u lj time units, defined as π lj sop := c lj -u lj < 0, where u lj ∈ R + is a learnable parameter. Instead of taking a heuristic approach for some underlying combinatorial search problem for a given set of temporal predicates (Bhattacharjya et al., 2020; 2021; Li et al., 2021) to uncover the effective order relations, this work proposes a differentiable learning model to learn suitable singleton and paired order predicates among all the possible choices of order predicates through a gradient-based approach. The scheme of weighted signal temporal logic (wSTL) in Yan et al. (2021; 2022) is exploited to build weighted clock logic (wCL) formulas that are logical compositions of singleton and paired order predicates. The syntax of wCL is recursively defined as (Mehdipour et al., 2021) : ϕ := π lilj pop | π lj sop | ¬ϕ | ϕ w1 1 ∧ ϕ w2 2 • • • ∧ ϕ w k k | ϕ w1 1 ∨ ϕ w2 2 • • • ∨ ϕ w k k , where ϕ 1 , • • • ϕ k are wCL formulas, ¬ denotes negation, ∧ denotes logical conjunction, ∨ denotes logical disjunction, w j ≥ 0, j = 1, • • • , k denotes non-negative weights assigned to ϕ 1 , • • • , ϕ k in the conjunction and disjunction operations. A wCL formula can describe the characteristics of H t , thus the conditional intensity rate of event l given H t can be equivalently denoted as λ l|ϕ (t). Remark 7 The syntax above means each wCL formula can be built by using predicates in π lilj pop or π lj sop and then by recursively applying the ¬ or the ∧ or the ∨ operations. Example 1 (cont.) A wCL formula example is ϕ = (c A -c B > 1) 1 ∧ (c C < 3) 0.05 . The first and second clauses read "A happened before B for at least one day" and "C happened less than 3 days ago", respectively. Note that ϕ is satisfied by the event stream up to t = 13 in Figure 1(a) . The two clauses have weights of 1 and 0.05, reflecting the first clause is more important than the second one.

3.1. TRUTH DEGREE OF WEIGHTED CLOCK LOGIC

To quantitatively measure the satisfaction degree of a wCL formula ϕ over the event clocks C ′ (t), i.e., how well does ϕ describe the underlying patterns of C ′ (t), we propose smooth activation functions (AFs) to compute the truth degree, denoted p(C ′ , ϕ, t) ∈ [0, 1], defined as (Riegel et al., 2020) : p(C ′ , π lilj pop , t) = sigmoid(c li (t) -c lj (t) -u lilj ), p(C ′ , π lj sop , t) = sigmoid(u lj -c lj (t)), p(C ′ , ¬ϕ, t) = 1 -p(C ′ , ϕ, t). In contrast to the combinatorial search of the temporal logic predicates in Li et al. (2021) , the smooth design of AFs in (2) -(4) benefits the maximum likelihood estimation problem shown later in Section 3.6 by allowing it to learn the parameters in the POP and SOP through gradient-based methods. Next, we present the design of activation functions (AF) for the ∧ operator. Here we use a 2-ary conjunction operator to motivate the design. Let p ∧ = p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) ∈ [0, 1]. Intuitively, p ∧ is low when either input is low, and p ∧ is high when both inputs are high. Here we adopt a similar idea to Sen et al. (2022) for capturing the low and high. A user-defined hyperparameter α ∈ [ 1 2 , 1] is introduced to aid the interpretability of low and high such that p ∧ represents high if p ∧ ∈ [α, 1] and low if p ∧ ∈ [0, 1 -α]. Considering the importance weights, a low input with a zero weight should not impact the output, which implies p ∧ should be low when both inputs are low. With these considerations, the AF for the ∧ operator is defined as follows: (See Appendix A for more details.) p(C ′ , ϕ w1 1 ∧ ϕ w2 2 • • • ∧ ϕ w k k , t) = f (β - k j=1 w j (1 -p(C ′ , ϕ j , t))), subject to β - k j=1 w j (1 -α) ≥ α, β - k j=1 w j α ≤ 1 -α, where f (z) = max{0, min{z, 1}} clamps the truth degree into [0, 1], w j ≥ 0 and β ≥ 0 are parameters to learn. By De Morgan's law (Hurley, 2014) , the AF for the ∨ operator is defined as p(C ′ , ϕ w1 1 ∨ ϕ w2 2 • • • ∨ ϕ w k k , t) = f (1 -β + k j=1 w j (p(C ′ , ϕ j , t))), subject to 1 -β + k j=1 w j α ≥ α, 1 -β + k j=1 w j (1 -α) ≤ 1 -α. An event stream with M event labels would generate P 2 M = M ! (M -2)! paired order predicates and M singleton order predicates. If a conjunction or disjunction operator takes these predicates as inputs, how it recognizes the effective order predicates in describing the event dynamics becomes a critical issue. By carefully designing the AFs in ( 5) -( 6), the logical operators exhibit the following properties so as to recognize effective inputs. This is a critical advantage over Bhattacharjya et al. (2020; 2021) ; Li et al. (2021) in that it allows a differentiable search of the suitable predicates among all the possible choices of order predicates in an end-to-end manner. Here we illustrate the properties for ∧ with two inputs, which can be generalized to k-ary inputs. (See Appendix B for more details.) Theorem 8 The AF for the ∧ operator with two inputs exhibits the following properties. 1) Nonimpact for zero weights: If w j = 0, j = 1, 2, p(C ′ , ϕ j , t) has no impact on p(C ′ , ϕ 1 ∧ ϕ 2 , t). 2) Impact ordering: If p(C ′ , ϕ 1 , t) = p(C ′ , ϕ 2 , t), and w 1 ≥ w 2 , then ∂p(C ′ ,ϕ1∧ϕ2,t) ∂p(C ′ ,ϕ1,t) ≥ ∂p(C ′ ,ϕ1∧ϕ2,t) ∂p(C ′ ,ϕ2,t) . 3) Monotonicity:  f (β- 2 j=1 w j (1-p(C ′ , ϕ j , t))) ≤ f (β - 2 j=1 w j (1 -(p(C ′ , ϕ j , t) + d))), d ≥ 0. = (π A,B pop ∧ π B,C pop ) ∨ (π A sop ).

3.2. LEARNING OF PAIRED ORDER REPRESENTATION

With the smooth AFs designed in ( 2) -( 6), a neuro-symbolic model called clock logic neural network (CLNN) can be designed for any given wCL formula ϕ, in which every neuron has a corresponding symbolic representation. A typical CLNN for ϕ = (π A,B pop ∧π B,C pop )∨(π A sop ) is visualized as Fig. 2 (b), which can be considered as the discrete structure obtained by learning the parameters of the model in Figure 2 (a) and keeping the dominant components. Here ϕ can be interpreted as "(A happens before B for at least u AB time units or B happens before C for at least u BC time units) and A happens within the past u A time units." This part describes the continuous relaxation of the search space by designing a paired order cell, a singleton order cell, and an architecture cell for learning the paired order representation, singleton order representation and the formula structure.

Paired Order Cell (POC).

A POC is a directed acyclic graph (DAG) comprising two paired order predicate (POP) nodes and one logical node for the ∧ operator, shown as an orange block in Figure 2 denotes "l i happened before l j for at least u li,lj time units" and π lj ,li pop denotes "l j happened before l i for at least u li,lj time units". Each POP has an associated weight w li,lj pop or w lj ,li pop to be learned, and the ∧ operator forces one of the two weight parameters to dominate the other one such that the learned POR is consistent with the event stream. For example, the POC in Figure 2 (a) aims to learn the POR between A and B, whose discretized version would be either π A,B pop or π B,A pop . An event stream with M event labels can generate P 2 M = M ! (M -2)! PORs between any two event labels, resulting in (P 2 M /2) POCs. Similar to learning the POR between any two events, the discrete order representations for the entire history H t can be learned using a POP selection node (as shown in Figure 2(a) ) that takes the outputs of all the POCs as input and identifies the important PORs. The learning of the POCs essentially becomes learning the w, β in (5) for the POCs and the POP selection node, as well as u lilj in (2) for the POPs through back propagation. The discrete PORs can be acquired by keeping the top-k strongest POCs and the dominant POPs.

3.3. LEARNING OF SINGLETON ORDER REPRESENTATION

Singleton Order Cell (SOC). The learning of SOR is accomplished by an SOC, which is displayed as a green block in Figure 2(a ). An SOC is a DAG comprising M singleton order predicate (SOP) nodes and one SOP selection node for the ∧ operator. An SOP node represents π lj sop that takes c lj (t) as input and returns the truth degree of π lj sop over c lj (t). The SOP selection node has the same functionality as the POP selection node. The ∧ operator in the SOP selection node assigns a nonnegative weight to every SOP node and learns the importance weights w and β to extract the dominant SORs affecting the conditional intensity rate the most. The learning of the SOC is thus learning the w, β in (5) for the SOP selection node and u lj in (3) for the SOPs through back propagation. The discrete SORs can be determined by keeping the top-k strongest SOPs.

3.4. LEARNING OF FORMULA STRUCTURE

Architecture Cell (AC). For a given set of PORs or SORs, their conjunction or disjunction will behave differently and have distinct meanings. For instance, given two causal formulas ϕ 1 = (c A -c B > 1) 1 ∧(c C < 5) 1 and ϕ 2 = (c A -c B > 1) 1 ∨(c C < 5) 1 for the occurrence of event label D, ϕ 1 means "(A happens before B for at least 1 time unit) and (C happens within the past 5 time units) simultaneously will cause D to happen", whereas ϕ 2 means "(A happens before B for at least 1 time unit) or (C happens within the past 5 time units) alternatively will cause D to happen." The afore-mentioned cells can learn the order representations. Nevertheless, whether their outputs should be connected by the ∧ or ∨ operator needs to be determined. Here we consider the outputs of the POCs and the SOCs having two choices of being connected by a ∧ or ∨ operator, each of which is associated with an architecture weight α ∧ arc or α ∨ arc that enables continuous learning of the two choices; this is also called differentiable architecture search (Liu et al., 2019 ). An architecture cell is introduced for learning the model architecture, which comprises two logical nodes representing a ∧ operator and a ∨ operator as well as a logical selection node (LSN), shown as the blue block in Figure 2 (a). Let p = {p 1 , ..., p k } denote the set of inputs for each logical operator. Subsequently, the conjunction operator takes p as input and returns p ∧ = f (β ∧ - k j=1 w ∧ j (1 -p j )) , and the disjunction operator takes p as input and returns p ∨ = f (1 -β ∨ + k j=1 w ∨ j p j ). The LSN represented by ⊖ takes p ∧ and p ∨ as inputs and returns their weighted sum, where the weights are computed using the softmax of the architecture weights as shown below: p ⊖ = p(C ′ , ϕ, t) = m∈{∧,∨} e α m arc m ′ ∈{∧,∨}e α m ′ arc p m . ( ) The task of architecture search then reduces to learning the architecture weights α ∧ arc , α ∨ arc and the w, β in ( 5) -( 6) for the two logical operators, which can be executed simultaneously while learning parameters in the POCs and SOCs. The outcome of the architecture search process is a discrete architecture obtained by retaining the logical operator with the strongest architecture weight. The output of a CLNN is the truth degree of ϕ over C ′ at t, which is incorporated into modeling the conditional intensity rates. The modeling process aims to discover the generative mechanism as wCL formulas for every l ∈ L. In other words, a larger value of p(C ′ , ϕ, t) should reflect that ϕ has a greater impact on the occurrence of a particular label. For example, if the wCL formula for affecting the occurrence of event label D is given as ϕ = ((π A,B pop ) w1 ∧ (π C sop ) w2 ), it means if ϕ is satisfied or the truth degree of ϕ is high, then it has a strong impact on the occurrence of D, where the impact can be promoting or inhibiting the occurrence of D. In terms of the relation between the truth degree and the conditional intensity rate, the higher the truth degree p(C ′ , ϕ, t), the greater its impact on λ D|ϕ . Note that the occurrence of one event label may depend on multiple wCL formulas. This work follows the assumption that the impact of multiple formulas are additive in predicting the intensity rate, similar to Li et al. (2020) . To incorporate a set of wCL formulas Φ = {ϕ 1 , ϕ 2 , ..., ϕ n } into the modeling of the conditional intensity rate, we define a wCL formula-informed conditional intensity rate as:

3.5. WCL-INFORMED INTENSITY FUNCTION

λ l|Φ (t) = exp( n i=1 w ϕi p(C ′ , ϕ i , t) + ρ), where w ϕi is the weight of ϕ i , and ρ is a bias term that allows for spontaneous occurrence without the influence from ϕ.

3.6. MAXIMUM LIKELIHOOD ESTIMATION

Suppose event stream D contains n l occurrences of event l, for which the occurrence time stamps are denoted as t l1 , t l2 , ..., t ln l . Let t 0 = 0, t ln l +1 = T . Based on the conditional intensity function in (8), the likelihood for label l over the event stream is calculated as (Daley & Vere-Jones, 2003): L l = n l -1 i=0 exp - t l i+1 t l i λ l|Φ (s)ds λ l|Φ (t li+1 ) exp - T t ln l λ l|Φ (s)ds . The corresponding log-likelihood for event label l is expressed as LL l = (- T 0 λ l|Φ (s)ds) + n l i=1 [log(λ l|Φ (t li ))]. The total log-likelihood of all the events in D is thus LL D = l∈L LL l . During the training process, we train the model parameters for each event label separately. Specifically, the maximum likelihood estimation problem for event label l can be formulated as follows: min -LL l (10) s.t. ∀ϕ ∈ Φ, ∀1 ≤ k ≤ K ∧ ϕ , β k - i∈I k w i,k (1 -α) ≥ α, β k - i∈I k w i,k α ≤ 1 -α, ∀ϕ ∈ Φ, ∀1 ≤ k ′ ≤ K ∨ ϕ , 1 -β k ′ + i∈I k ′ w i,k ′ α ≥ α, 1 -β k ′ + i∈I k ′ w i,k ′ (1 -α) ≤ 1 -α, ( ) w i,k ≥ 0, β k ≥ 0, w i,k ′ ≥ 0, β k ′ ≥ 0, u lj ≥ 0, where K ∧ ϕ (resp. K ∨ ϕ ) is the number of ∧ (resp. ∨) operators in ϕ, I k (resp. I k ′ ) denotes the inputs to the k-th ∧ (resp. k ′ -th ∨) operator. Please see Appendix A for more details about the above formulation. The overall learning framework is shown in Figure 3 , in which the forward propagation computes LL l by using n CLNNs; each learns a wCL formula ϕ i and the backward propagation updates the parameters in n CLNNs using projected gradient descent.

4. EXPERIMENTS

We conduct several experiments on synthetic and real-world datasets to demonstrate the efficacy of our proposed model. Simultaneously, we compare with state-of-the-art (SOTA) models. The experiments are run using the AdamW optimizer in Pytorch (1.10.2) on a Windows 10 system desktop with a 16-core CPU (i7, 3.60GHz) and 32 GB RAM. Our code is available at https://ICLR-CLNN.

4.1. MODELS

Multivariate Hawkes Process (MHP) [ (Bacry et al., 2017) ]: A conventional multivariate Hawkes process utilizing an exponential kernel function to describe the conditional intensity rate, which involves a decay rate and an infectivity matrix characterizing the inter-dependence among events. This model is implemented in the tickfoot_0 library, where the learning problem is posed as a convex quadratic programming problem with a fixed decay rate. Proximal Graphical Event Model (PGEM) [ (Bhattacharjya et al., 2018) ]: A type of GEM that models event data by considering whether a parent in some underlying graph happens in a proximal (recent) window. Published as a conference paper at ICLR 2023 Ordinal Graphical Event Model (OGEM) [ (Bhattacharjya et al., 2020; 2021) ]: An ordinal GEM that models the impact of the order of events on the conditional intensity rate. OGEM-tab (resp. OGEM-tree) refers to an OGEM that adopts a tabular (resp. tree) representation of orders. Ground truth φ1 = (cA -cB > 1) 1 ∧ (cA -cC > 3) 1 CLNN's rule (cA -cB > 1.21) 1.52 ∧ (cA -cC > 3.00) 1.41 ∧ (cA -cD > 0.82) 0.33 ∧ (cB -cC > 4.33) 0 ∧ (cB -cD > 10.69) 0 ∧ (cD -cC > -6. Temporal Logic Rule Learner (TELLER)foot_1 [ (Li et al., 2021) ]. This is a method to learn first-order temporal logic rules explaining the generative mechanism of TPPs. The rule discovery process is formulated as a maximum likelihood estimation problem solved by a branch-and-price algorithm.

4.2. SYNTHETIC DATASETS

The first part of this experiment demonstrates CLNN's capability of recovering ground-truth rules using three synthetic datasets generated by CLNN with pre-specified formula structure and parameters, including u lilj in π li,lj pop , as well as the importance weights w and bias β in (5) for logical operators, and the w ϕ and ρ in (8) for the conditional intensity rate. Experimental Setting. Each synthetic dataset contains 1, 000 event streams partitioned into three sets: training (70%), validation (15%), and test (15%). Every dataset is generated using a wCL formula with w ϕ = 3 and ρ = -5. The truth value threshold is set as α = 0.5, and the clock signal for representing an event not occurring in H ′ t is set as Z = 1.5T max , where T max is the maximal ending time among all the event streams. During the training process, we initialize the parameters using four approaches (see Appendix C.5 for more details) and report the best one, and CLNN aims to recover the manually set parameters. Results. The ground-truth rule φ1 for generating the first synthetic dataset (Syn-1) with L = {A, B, C, D} and the rules discovered by CLNN, TELLER, and OGEM-tab are summarized in Table 1 . Results for the other synthetic datasets are presented in Appendix C. The rules are learned using the 'last' masking method, which was also used for data generation. The experimental results show an accurate recovery performance of CLNN in terms of order representation recovery and parameter identification. The unweighted version of the ground truth rule reads: "If A happens before B for at least 1 time unit and A happens before C for at least 3 time units, then D will happen". The rule of TELLER only reflects the temporal relation between events A, B, C and D but is unable to capture the temporal relation between A and B or A and C, which does not match the ground-truth rule. In OGEM-tab's rule, [l] denotes a single parent. We show the top 3 excitation and inhibitory rules from OGEM-tab, where excitation (resp. inhibitory) means λ l|Φ is higher (resp. lower) than the λ l|Φ with all w ϕi = 0. The excitation rules of OGEM-tab do not match the ground-truth rule. In contrast, the rule discovered by CLNN (ϕ 1 ) assigns larger weights to the paired order predicates π A,B pop = (c A -c B > 1.21) and π A,C pop = (c A -c C > 3.00) and small weights to the other predicates, where the interval values of 1.21 and 3.00 are both learned. By ignoring the small weights, ϕ 1 can be interpreted as "If A happens before B for at least 1.21 time units and A happens before C for at least 3.00 time units, then D will happen", meaning the paired order representations discovered by CLNN match well with the ground truth. Moreover, CLNN's rules are more expressive than TELLER and OGEM as it provides a detailed interval length between two ordered labels. To show the computational efficiency of our gradient-based learning, we compare the runtimes of CLNN and TELLER on the synthetic datasets in Table 2 . Notably, CLNN not only recovers the correct order representations but also was two orders of magnitude faster on average (5.62 s vs 635.99 s). In addition, CLNN can learn more expressive order representations that describe both the order relation between two events and their interval length.

4.3. REAL-WORLD DATASETS

LinkedIn [ (Xu et al., 2017) ]. An event dataset related to job hopping records of 3, 000 LinkedIn users in 82 IT companies. Each event stream records a user's check-in time stamps for different companies or the time stamps for role change within the same company. We filter the dataset to popular companies as per Bhattacharjya et al. (2020) , resulting in 1, 000 users. Mimic II [ (Saeed et al., 2011) ]. An event dataset concerning health records of patients from Intensive Care Unit (ICU) visits over 7 years. A patient's event stream records each visit's time stamp and the corresponding diagnosis. We filter out sequences with few visits, resulting in 650 patients. Stack Overflow [ (Grant & Betts, 2013) ]. An event dataset that is related to the badges awarded to users in the question-answering website, the Stack Overflow. Each user's event stream records the badges that he/she receives at various time stamps. We keep the event streams with one or more of 20 types of badges and sample 1, 000 users from the dataset used in Du et al. (2016) . Experimental Setup. Each dataset is partitioned into three sets: training (70%), validation (15%), and test (15%). For simplicity, u lilj are set as 0 to study the ordering representations. The truth value threshold is α = 0.5, and Z = 1.5T max , same as the setting for the synthetic datasets, and the number of subformulas is n = 5, and the parameters are initialized as random numbers from a uniform distribution on [0, 1). CLNN is trained on the training set, and the validation set is utilized for model selection during training. Model fit is evaluated using log-likelihood on the test set. Results. We follow a similar trend to Bhattacharjya et al. (2018; 2020; 2021) to use the loglikelihood for evaluation of the model's performance. The log-likelihood on the real-world datasets is reported in Table 3 , where DR denotes the difference ratio -the difference between CLNN and the best SOTA divided by the absolute value of best SOTA. CLNN's result is chosen as the better one among the 'first' or the 'last' masking. Notably, CLNN outperforms the baseline models on the LinkedIn dataset (13.40% advantage) and achieves a competitive result on the MIMIC II dataset (1.63% loss only). It is observed that PGEM achieves a better result on the Stack Overflow dataset. In Stack Overflow, one type of badge can be awarded only when a user receives a particular badge multiple times, for example, the 'Epic' badge is awarded only when earning 200 daily reputations 50 times, depending on the 'Mortarboard' badge acquired while answering or asking questions. CLNN and OGEMs apply masking methods to the data, which may not capture the above dependence. In contrast, PGEM models data without masking, making it more suitable for this dataset. Table 3 : Dataset information and log-likelihood for all models on the real-world datasets.

Rules Effect

CLNN ϕ 1 = (c D > c H ) 0.90 ∧ (c I > c J ) 0.72 Inhibitory ϕ 2 = ((c B < 0.45) 0.58 ∧ (c D < 0.05) 0.66 Excitation ϕ 3 = (c B > c F ) 0.50 ∧ (c J > c D ) 0.47 Inhibitory ϕ 4 = (c A < 0.84) 0.76 ∧ (c H < 1.09) 0.50 Inhibitory TELLER [A, F], [C, F], [E, F], [B, F], [D, F] Excitation OGEM-tab [F ], [F , A] Excitation [A] Inhibitory Table 4 : Formulas and their effect as learned by CLNN, TELLER and OGEM-tab on company F of LinkedIn. Case Study. The primary strength of CLNN over the SOTA models is that it can describe the generative mechanism as wCL formulas, being more expressive and potentially providing more detailed information. CLNN can be deployed as a valuable tool for assisting domain specialists in knowledge discovery from event data. Here we showcase the above strength of CLNN using an illustrative example. We select the experimental result on company F of the LinkedIn dataset to demonstrate the expressivity of CLNN's rules, which are shown in Table 4 . Here we specify the model to learn five formulas, four of which are inhibitory, and one exhibits excitation. One inhibitory formula has a weight of 0.05, thus not reported in Table 4 . Each formula shows the dominant singleton or paired order predicates. Notably, CLNN learns expressive wCL formulas that describe how the logical composition of paired order predicates and(or) singleton order predicates affect a role change in the company F . CLNN's rules are more expressive than TELLER and as expressive as OGEM-tab for describing the occurrence of a causal event within a specific historical window.

5. CONCLUSION

In this paper, we proposed a novel neuro-symbolic model, CLNN, to learn interpretable wCL formulas from multivariate event data. Experimental results using synthetic and real-world datasets demonstrate CLNN's expressiveness in recovering ground-truth rules in multivariate temporal point processes. Further, CLNN can be trained using gradient-based methods, which improve the learning speed compared to the SOTA.

A FORMULATION OF LOGICAL CONSTRAINTS & OBJECTIVE FUNCTION

The optimization problem in ( 10) is formulated by maximizing the log-likelihood subject to the logical constraints for the ∧ and ∨ operators. This section discusses the details of the formulation for the two logical constraints and how to formulate the optimization problem while considering the logical constraints. Without loss of generality, we illustrate the formulation of the constraints for the ∧ operator, and the constraints for ∨ operator can be derived from the constraints for the ∧ operator using De Morgan's law. • Logical constraints for ∧ operator. Let x, y ∈ [0, 1] denote the inputs of the ∧ operator, and f (x, y) denote the quantitative satisfaction of ∧. The conventional characteristic of the ∧ operator is illustrated as follows: 1) f (x, y) is low when either input is low, and 2) f (x, y) is high when both inputs are high. However, we associate each input with a nonnegative weight, implying the input with a zero weight should not affect the output. In other words, if a low input has a zero weight, it should not affect the output of f (x, y). Therefore, we require the ∧ operator to exhibit the following characteristics: 1) f (x, y) is low when both inputs are low, and 2) f (x, y) is high when both inputs are high. Here we introduce a user-defined hyperparameter α ∈ [ 1 2 , 1] to capture low vs. high: x ∈ [0, 1 -α) represents low and x ∈ [α, 1] represents high. According to the above characteristics, we have (Sen et al., 2022)  f (x, y) ≤ 1 -α, ∀x, y ∈ [0, 1 -α), f (x, y) ≥ α, ∀x, y ∈ [α, 1]. ( ) Here we follow a specific choice of f by using a triangular norm (t-norm) and define the quantitative satisfaction function of ∧ as (Riegel et al., 2020 ) p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) = f (β - 2 j=1 w j (1 -p(C ′ , ϕ j , t))), subject to β - 2 j=1 w j (1 -α) ≥ α, β - 2 j=1 w j α ≤ 1 -α, ( ) where f (z) = max{0, min{z, 1}} is introduced to clamp the truth value into the range of [0, 1]. • Logical constraints for ∨ operator. By using De Morgan's law, we could derive the quantitative satisfaction function and the logical constraints for the ∨ operator with 2 inputs as follows: p(C ′ , ϕ w1 1 ∨ ϕ w2 2 , t) = f (1 -β + 2 j=1 w j (p(C ′ , ϕ j , t))), ( ) subject to 1 -β + 2 j=1 w j α ≥ α, 1 -β + 2 j=1 w j (1 -α) ≤ 1 -α. ( ) Here we show the characteristics of the activation functions for the ∧ and ∨ operators using Figure 4 . Figure 4 (a) shows the truth value of the ∧ operator with α = 0.7. Figure 4 (b) shows the truth value of the ∧ operator with α = 0.9. It can be distinctly observed that f (x, y) is close to 0 when both x and y are low, and f (x, y) is close to 1 when both x and y are high. In addition, the unconstrained region for α = 0.9 is larger than the unconstrained region for α = 0.7. Figure 4 (c) shows the truth value of the ∨ operator with α = 0.7. It is obvious that f (x, y) is close to 0 when both x and y are low, and f (x, y) is close to 1 when both x and y are high. In general, we could extend the quantitative satisfaction for the ∧ and ∨ operators in ( 14) -( 17) to k-ary conjunction and k-ary disjunction. The k-ary conjunction formulation is expressed as follows.  p(C ′ , ϕ w1 1 ∧ ϕ w2 2 • • • ∧ ϕ w k k , t) = f (β - k j=1 w j (1 -p(C ′ , ϕ j , t))), subject to β - k j=1 w j (1 -α) ≥ α, β - k j=1 w j α ≤ 1 -α. ( ) The k-ary disjunction formulation is expressed as follows. p(C ′ , ϕ w1 1 ∨ ϕ w2 2 • • • ∨ ϕ w k k , t) = f (1 -β + k j=1 w j (p(C ′ , ϕ j , t))), ( ) subject to 1 -β + k j=1 w j α ≥ α, 1 -β + k j=1 w j (1 -α) ≤ 1 -α. ( ) With the above constraints, we can formulate the maximum likelihood estimation problem as min -LL l (22) s.t. ∀ϕ ∈ Φ, ∀1 ≤ k ≤ K ∧ ϕ , β k - i∈I k w i,k (1 -α) ≥ α, β k - i∈I k w i,k α ≤ 1 -α, ( ) ∀ϕ ∈ Φ, ∀1 ≤ k ′ ≤ K ∨ ϕ , 1 -β k ′ + i∈I k ′ w i,k ′ α ≥ α, 1 -β k ′ + i∈I k ′ w i,k ′ (1 -α) ≤ 1 -α. ( ) In this paper, we set α = 0.5, thus the constraints in (19) become k i=1 w i ≥ 2β -1, k i=1 w i ≤ 2β -1, 2β -1 ≥ 0, w i ≥ 0. Reformulating the above constraints, we have k i=1 w i = 2β -1, β ≥ 0.5, w i ≥ 0. ( ) The above constraints hold for each conjunction operator in ϕ. Therefore, we can incorporate the constraints in ( 26) into the objective function, which becomes min -LL l + K ∧ ϕ k=1 ( i∈I k w i,k -2β k + 1) 2 , ( ) subject to w i,k ≥ 0, β k ≥ 0.5, ∀i ∈ I k , ∀1 ≤ k ≤ K ∧ ϕ , ∀ϕ ∈ Φ. Similarly, we propose a set of logical constraints for the ∨ operator as ( 21). If we set α = 0.5, the constraints in (21) become k i=1 w i ≥ 2β -1, k i=1 w i ≤ 2β -1, 2β -1 ≥ 0, w i ≥ 0. ( ) Reformulating the above constraints, we have k i1 w i = 2β -1, β ≥ 0.5. w i ≥ 0. The above constraints hold for each disjunction operator in ϕ. Therefore, we can incorporate the constraints in (31) into the objective function. The maximum likelihood estimation problem then becomes min -LL l + K ∧ ϕ k=1 ( i∈I k w i,k -2β k + 1) 2 + K ∨ ϕ k ′ =1 ( i∈I k ′ w i,k ′ -2β k ′ + 1) 2 , subject to w i,k ≥ 0, β k ≥ 0.5, ∀i ∈ I k , ∀1 ≤ k ≤ K ∧ ϕ , ∀ϕ ∈ Φ, w i,k ′ ≥ 0, β k ′ ≥ 0.5, ∀i ∈ I k ′ , ∀1 ≤ k ′ ≤ K ∨ ϕ , ∀ϕ ∈ Φ.

B PROOF OF THEOREM 8

The activation function designed for the ∧ operator satisfies the properties of nonimpact for zero weights, impact ordering, and monotonicity. Without loss of generality, we present the proof for the ∧ operator connecting two clauses, which can be generalized to the ∧ operator connecting k-ary clauses. Proof 1 Here we present the proof for the activation function for the ∧ operator satisfying each property mentioned above. • Nonimpact for zero weights. This means if w j = 0, j = 1, 2, then p(C ′ , ϕ j , t) should have no impact on p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t). Without loss of generality, we suppose w 1 = 0, thus we have p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) = f (β -0 • (1 -p(C ′ , ϕ 1 , t)) -w 2 • (1 -p(C ′ , ϕ 2 , t))), = f (β -w 2 • (1 -p(C ′ , ϕ 2 , t))), meaning p(C ′ , ϕ 1 , t) has no impact on p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t).

• Impact Ordering

This means the truth degree of subformula with higher weights has a greater impact on p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t). Mathematically, we need to prove that if p(C ′ , ϕ 1 , t) = p(C ′ , ϕ 2 , t) and w 1 ≥ w 2 , then ∂p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) ∂p(C ′ , ϕ 1 , t) ≥ ∂p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) ∂p(C ′ , ϕ 2 , t) . ( ) As f (x) = max{0, min{x, 1}}, we have df dx =    0, if x < 0, 1, if 0 < x < 1, 0, if x > 1. (36) If β - 2 j=1 w j (1 -p(C ′ , ϕ j , t)) < 0 or β - 2 j=1 w j (1 -p(C ′ , ϕ j , k)) > 1, then we have ∂p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) ∂p(C ′ , ϕ 1 , t) = ∂p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) ∂p(C ′ , ϕ 2 , t) = 0. (37) Also, if 0 < β - 2 j=1 w j (1 -p(C ′ , ϕ j , t)) < 1, then we have ∂(β - 2 j=1 w j (1 -p(C ′ , ϕ j , t))) ∂p(C ′ , ϕ 1 , t) = w 1 (β - 2 j=1 w j (1 -p(C ′ , ϕ j , t))), and ∂(β - 2 j=1 w j (1 -p(C ′ , ϕ j , t))) ∂p(C ′ , ϕ 2 , t) = w 2 (β - 2 j=1 w j (1 -p(C ′ , ϕ j , t))). As w 1 ≥ w 2 , the following holds: ∂p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) ∂p(C ′ , ϕ 1 , t) ≥ ∂p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) ∂p(C ′ , ϕ 2 , t) , which proves the impact ordering property holds. • Monotonicity. This means p(C ′ , ϕ w1 1 ∧ ϕ w2 2 , t) increases monotonically over p(C ′ , ϕ j , t), i.e. f (β - 2 j=1 w j (1 -p(C ′ , ϕ j , t))) ≤ f (β - 2 j=1 w j (1 -p(C ′ , ϕ j , t) -d)) for d ≥ 0. (41) First, note that β -2 j=1 w j (1 -p(C ′ , ϕ j , t)) can be rewritten as β - 2 j=1 w j (1 -p(C ′ , ϕ j , t)) = β -w 1 -w 2 + w 1 p(C ′ , ϕ 1 , t) + w 2 p(C ′ , ϕ 2 , t). (42) This implies f (β - 2 j=1 w j (1 -p(C ′ , ϕ j , t)) ) is monotonically increasing over p(C ′ , ϕ 1 , t) and p(C ′ , ϕ 2 , t). Also, from the proof of impact ordering we know f (x) = max{0, min{x, 1}} is monotonically nondecreasing, we can show that f (β - 2 j=1 w j (1 -p(C ′ , ϕ j , t))) ≤ f (β - 2 j=1 w j (1 -p(C ′ , ϕ j , t) -d)), d ≥ 0. (43) Thus the property of monotonicity is satisfied.

C EXPERIMENT RESULTS OF SYNTHETIC DATASETS

Dataset Generation. In the experiments on synthetic datasets, we manually generate 3 synthetic datasets considering different settings, where the details and results for the first synthetic dataset is reported in Section 4.2. Each setting considers a different order representation, different number of event labels or different intensity of causal event labels. C.1 SYNTHETIC DATASET-1 (SYN-1). Generation process. The first synthetic dataset contains 4 event labels: A, B, C, and D, where D is the event for prediction, and A, B, C are causal events. The wCL formula used to generate event D in the first synthetic dataset is set as φ1 = (c A -c B > 1) 1 ∧ (c A -c C > 3) 1 , whose unweighted version reads as "If A happens before B for at least 1 time unit and A happens before C for at least 3 time units, then D will happen." Here we consider event labels A, B, C as free predicates, whose occurrences are generated by a homogeneous Poisson process. The homogeneous intensity rate for A, B, C are set as λ A = 0.2, λ B = 0.2, and λ C = 0.2. The algorithm used to generate instances of A, B, C is described as Algorithm 1 (Chen, 2016) . Algorithm 1 Simulation of a homogeneous Poisson process with intensity rate λ.

Input:

Intensity rate λ, simulation horizon T Output: Occurrence time stamps T = {t k } 1: Initialize n = 0, t 0 = 0; 2: while True do Results. The rules learned by CLNN, TELLER, and OGEM-tab on the first synthetic dataset are presented in Table 5 , where the paired order predicate among the two candidates with the highest Algorithm 2 Simulation of an inhomogeneous Poisson process with intensity rate λ(t). Input: intensity rate λ(t), simulation horizon T Output: Occurrence time stamps T = {t k } 1: Initialize n = m = 0, t 0 = s 0 = 0, λ = sup 0≤t≤T ; λ(t); 2: while s m < T do 3: Generate a uniform random variable u ∼ uniform(0, 1); end if 17: end while

Dataset

Syn-1 weight is presented. It can be clearly observed that by truncating the predicates with small weights, we could obtain the formula as N (# events) N = 4, L = {A, B, C, D} Ground truth φ1 = (c A -c B > 1) 1 ∧ (c A -c C > 3) 1 CLNN's rule (c A -c B > 1.21) 1.52 ∧ (c A -c C > 3.00) 1.41 ∧ (c A - c D > 0.82) 0.33 ∧ (c B -c C > 4.33) 0 ∧ (c B -c D > ϕ 1 = (c A -c B > 1.21) 1.52 ∧ (c A -c C > 3.00) 1.41 , which matches well with the ground-truth rule. However, TELLER cannot capture the paired order representation between A and B or A and C. OGEM-tab captures the order representation [A, B] and [A, C] as inhibitory causes, which contradicts the ground-truth rule. C.2 SYNTHETIC DATASET-2 (SYN-2). Generation Process. The second synthetic dataset contains 5 event labels: A, B, C, D and E, where E is the event for prediction, and A, B, C, D are causal events. The wCL formula used to generate the occurrence of event E in the second synthetic dataset is set as φ2 = (c A -c B > 0.5) 1 ∧ (c A -c C > 1.5) 1 ∧ (c C -c D > 2) 1 , whose unweighted version reads as "If A happens before B for at least 0.5 time units, A happens before C for at least 1.5 time units, and C happens before D for at least 2 time units, then E will happen." The occurrence of events A, B, C and D are generated using Algorithm 1, in which λ A = λ B = λ C = λ D = 0.2. After obtaining the occurrence of A, B, C and D, we simulate the generation of event label E using Algorithm 2, in which the intensity rate λ E| φ2 (t) is computed using the model shown in Figure 6 . Results. The rules learned by CLNN, TELLER and OGEM-tab on the second synthetic dataset are presented in Table 6 , where the paired order predicate with the highest weight is presented. It can be clearly observed that by truncating the predicates with small weights, CLNN learns a wCL formula as:  ϕ 2 = (c A -c B > 0.77) 1.27 ∧ (c A -c C > 2.09) 1.15 ∧ (c C -c D > 2.60) 1.06 , = (c A -c B > 0.5) 1 ∧(c B -c C > 1.5) 1 ∧(c C -c D > 2) 1 CLNN's rule (c A -c B > 0.77) 1.27 ∧ (c A -c C > 2.09) 1.15 ∧ ((c A - c D ) > -5.00) 0.25 ∧ ((c A -c E ) > -2.74) 0.09 ∧ (c B - c C > -9.31) 0.02 ∧ (c B -c D > -8.54) 0.08 ∧ (c B - c E > 2.07) 0 ∧ ((c C -c D ) > 2.60) 1.06 ∧ ((c C -c E ) > -4.27) 0.03 ∧ ((c D -c E ) > 1.17) 0.07 TELLER's rule A before E, B before E, A and B before E, A and C before E

OGEM-tab's rule

Excitation: C.3 SYNTHETIC DATASET 3 (SYN-3). [D], [D, E], [E], [E, D] Inhibitory: [D, A], [A], [A, D], [A,D,E], [E, D, A], [D, A, E], [A, E], [E, A], [D, E, A], [A, E, D], [E, A, D] The third synthetic dataset is generated using a more interesting scheme by combining the generation schemes of the first synthetic dataset and the second synthetic dataset. The third synthetic dataset includes five event labels: A, B, C, D and E. Here we consider A, B, and C as the causal events for the occurrence of D, and A, B, C, and D as the causal events for the occurrence of E. The occurrence of events A, B, C are generated using Algorithm 1, in which λ A = 0.2, λ b = 0.2, and λ c = 0.2. The wCL formula used to generate the occurrence of event D is set as φ3,1 = (c B -c A > -2) 1 ∧ (c C -c A > -5) 1 , whose unweighted version reads as "If A happens before B for less than 2 time units, and A happens before C for less than 1 time unit, then D will happen." The generation of D's occurrence follows Algorithm 2, where λ D| φ3,1 (t) is computed using the model shown in Figure 7 . We call the third synthetic dataset at this step as Syn-3.1. After obtaining the occurrences of events A, B, C, and D, we could simulate the occurrence of E using the following formula: φ3,2 = (c B -c A > -5) 1 ∧ (c C -c B > -4) 1 ∧ (c D -c C > -3) 1 . Similarly, the generation of E's occurrence follows Algorithm 2, where the intensity rate λ E| φ3,2 (t) is computed using the model shown in Figure 8 . We call the third synthetic dataset at this step as Syn-3.2.

Results.

The rules learned by CLNN, TELLER, and OGEM-tab on the cause of event D in the third synthetic dataset are presented in Table 7 , where the paired order predicate with the highest weight among the two candidates is reported. It can be clearly observed that by truncating the predicates with small weights, CLNN learns a wCL formula as ϕ 3,1 = (c B -c A > -1.85) 1.72 ∧ (c C -c A > -3.90) 1.59 , whose order representation match well with the ground-truth rule. On the other hand, TELLER's rule only reveal the temporal relation between event labels A, B, C and D, but it does not capture the temporal relation between event labels A and B or A and C. In addition, we could observe that OGEM-tab does not capture that C is a parent event of D. Dataset Syn-3.1 N (# events) N = 5, L = {A, B, C, D, E} Ground truth φ3,1 = (c B -c A > -2) 1 ∧ (c C -c A > -5) 1 CLNN's rule (c B -c A > -1.85) 1.72 ∧ (c C -c A > -3.90) 1.59 ∧ ((c D -c A ) > -16.25) 0.33 ∧ ((c C -c B ) > -3.01) 0 ∧ (c D -c B > -7.37) 0.02 ∧ (c D -c C > -7.55) 0 TELLER's rule A before D, B before D, C before D OGEM-tab's rule Excitation: [A], [A, B, D], [B, D, A], [D, A], [D, A, B], [B, A], [A, D], [D], [B, A, D], [D, B, A] Inhibitory: [A, B], [B, D], [B], [A, D, B], [D, B] Table 7 : Comparison of rule discovery of ϕ 3,1 for CLNN and TELLER on the Syn-3.1 dataset. The rules learned by CLNN, TELLER, and GEM on the cause of event E in the third synthetic dataset are presented in Table 8 , in which the discrete wCL formula learned by CLNN is To quantitatively evaluate the difference between the ground-truth rules and the rules learned by CLNN, we adopt the Jaccard similarity score to assess the learned formulas against the ground truth. Let G denote the set of paired ordering representations from the ground-truth rule, and C denote the set of paired ordering representations from the learned rules, the Jaccard similarity score is calculated as J = |C∩G| |C∪G| . For TELLER and OGEM-tab, the ordering representations are extracted from the excitation rules. The comparison of Jaccard similarity score for the synthetic datasets is shown in Figure 9 , where the Jaccard similarity score of 0 is manually set to the minimum threshold 0.05 for clarity purposes. It is clearly observed that the Jaccard similarity scores for CLNN is higher than the ones for TELLER or OGEM, implying the rules discovered by CLNN are more consistent with the ground truth. ϕ 3,2 = (c B -c A > -3.94) 1.49 ∧ (c C -c B > -3.02) 2.03 ∧ ((c D -c C ) > -2.00) 1.92 . ( ,2 = (c B -c A > -5) 1 ∧ (c C -c B > -4) 1 ∧ (c D -c C > -3) 1 CLNN's rule (c B -c A > -3.94) 1.49 ∧(c C -c A > -9.12) 0.25 ∧ ((c D -c A ) > -1.42) 0.13 ∧ ((c E -c A ) > -3.88) 0.15 ∧ (c C -c B > -3.02) 2.03 ∧(c D -c B > -6.27) 0.02 ∧ (c E -c B > -7.30) 0.04 ∧ ((c D - c C ) > -2.00) 1.92 ∧ ((c E -c C ) > -5.30) 0.09 ∧ ((c E -c D ) > -1.57) 0.01 TELLER's rule A before E, B before E, C before E OGEM-tab's rule Excitation: [A, D], [D, A], [D, E], [E], [A, D, E], [D, E, A], [E, A], [A, E], [E, A, D], [A, E, D], [D, A, E], [E, D, A] Inhibitory: [A], [D], [E, D]

C.5 STABILITY ANALYSIS OF CLNN'S RULES WITH RESPECT TO INITIALIZATION

To further validate the model's stability in learning wCL rules, different parameter initialization methods are carried out, including: 1. rand -parameter initialization as random numbers from a uniform distribution on the interval [0, 1); 2. randn -random numbers from a normal distribution with mean 0 and variance 1; 3. ones -constant values of 1; 4. xavier -random numbers from a uniform distribution on the interval [-1/ √ n, 1/ √ n], where n is the dimension of the parameter. The rules learned by CLNN for the above parameter initializations are summarized in Table 9 . By inspecting the rules for different initialization methods, it is clear that CLNN can still recover the correct paired order representations even if initializing the learning process from a different position. In the meantime, the logic formulas learned by CLNN are stable as the variance of learned parameters is relatively small. 2, 8, 9, 11, 12, 14, 20, 21, 22, 23, 26, 27, 42, 47} . Similar to the setting for the LinkedIn dataset, where the initialization of parameters follow the same setting as the LinkedIn dataset. The detailed log-likelihood for each event label is presented in Table 12 . 

D.3 STACK OVERFLOW DATASET

Stack Overflow is a question-and-answer website spanning a wide range of domains. A badge rewarding scheme is exploited to encourage users to participate in the questioning and answering activities. The badge system of Stack Overflow comprises 81 types of non-topical badges, including the badges that can be awarded only once and the badges that can be awarded several times. The dataset in (Du et al., 2016) was obtained by first filtering out the badges that can be awarded only once, then restricting to the users who have acquired at least 40 badges from 2012-01-01 to 2014-01-01, from which the badges have been awarded more than 100 times are selected as the determinate dataset. Our dataset was acquired by retaining the event streams with one or more of the 20 types of specified badges and then randomly sampling 1000 users to obtain 1000 event streams. The detailed log-likelihood for each event label in the Stack Overflow dataset is summarized in Table 13 . 

D.4 ANALYSIS OF EXPRESSIVENESS ON MODEL'S PERFORMANCE

In this part, we conduct an experiment by training the CLNN without the singleton order cell (SOC) on real-world datasets to show the effectiveness of the singleton order predicates. The comparison of log-likelihood for CLNN with SOC and CLNN without SOC is summarized in Table 14 . As evidenced by Table 14 , the log-likelihood of CLNN with SOP is higher than the log-likelihood of CLNN without SOP, meaning enriching the expressiveness of wCL formulas can better explain the generative mechanism of events.



https://x-datainitiative.github.io/tick/modules/hawkes.html https://github.com/FengMingquan-sjtu/Logic Point Processes ICLR



Figure1: (a): An event stream example with N = 11 events of M = 4 event labels over T = 30 days. (Integer-valued time stamps are utilized for easy interpretation, note that the proposed approach also works for t i ∈ R). (b): The overall workflow of the proposed method (POC: paired order cell, SOC: singleton order cell, AC: architecture cell, details presented in Section 2.2 to 3.3).

3.Example 1 (cont.) With first masking, an example of paired order representation for H ′ 13 can be [A, B] representing "A happens before B" or [B, C] representing "B happens before C". The overall order representation for H ′ 13 is expressed as [A, B, D, C], which can be derived from the paired order representations: [A, B], [B, D], [D, C]. A singleton order representation example of H ′ 13 can be expressed as [B, 10.5], meaning B happened in the past 10.5 days.

Figure 2: CLNN Structure. (a): Continuous relaxation of the search space using weights. (b): The learned discrete model structure for ϕ = (π A,B pop ∧ π B,C pop ) ∨ (π A sop ).

(a). The two POP nodes represent π li,lj pop and π lj ,li pop sharing the same parameter u li,lj , where π li,lj pop

Figure 3: The overall learning framework for n wCL formulas.

57) 0.16 TELLER's rule A before D, B before D, C before D, A before D and C before D OGEM-tab's rule Excitation: [B], [C, B], [B, C]; Inhibitory: [A], [C, A], [A, C]

Figure 4: Plot of truth degree for (a) CLNN-∧ with α = 0.7, (b) CLNN-∧ with α = 0.9, (c) CLNN-∨ with α = 0.7.

Figure 5: Model structure of φ1 for generating the first synthetic dataset.

With the above algorithm, we can generate the occurrences of event labels A, B, and C. Next, we build a CLNN for φ1 = (c A -c B > 1) 1 ∧ (c A -c C > 3) 1 to calculate the conditional intensity rate λ D| φ1 , whose model structure is shown in Figure5. After obtaining λ D| φ1 (t), we could use Algorithm 2(Chen, 2016) to generate the occurrence of D.

t n ≤ T then 13: return {t k } k=1,2,...,n 14: else 15: return {t k } k=1,2,...,n-1 16:

.69) 0 ∧ (c D -c C > -6.57) 0.16 TELLER's rule A before D, B before D, C before D, A before D and C before D OGEM-tab's rule Excitation: [B], [C], [C, B], [B, C], [A, C, B], [A, B, C] Inhibitory: [A], [B, A], [B, A, C], [C, B, A], [A, B], [A, C], [B, C, A], [C, A, B], [C, A]

Figure 6: Model structure of φ2 for generating the second synthetic dataset.

whose order representation match well with the ground-truth rule. Nevertheless, TELLER's rule only capture the ordering between A, B and E, whereas the ordering between A and B or B and C or C and D are not learned. OGEM-tab's rules can only capture the relation between event label D and event label E can excite the occurrence of event label E, whereas not able to capture the dependence of event label E's occurrence on the order relation between A and B or B and C or C and D.DatasetSyn-2 N (# events) N = 5, L = {A, B, C, D, E} Ground truth φ2

Figure 7: Model structure of φ3,1 for generating the occurrence of D in the Syn-3 dataset.

It is obvious that ϕ 3,2 is able to learn the temporal relation between A and B, B and C, and C and D. However, TELLER's rules only reflect the temporal relation between A, B, C and E, which cannot give the information about the temporal relation betwee A and B, or B and C, or C and D. OGEM-tab's rule indicate that it considers event labels A, D, E as the parent events of D, which does not match with the ground-truth parent set.DatasetSyn-3.2 N (# events) N = 5, L = {A, B, C, D, E} Ground truth φ3

Figure 9: Comparison of ground-truth rules with CLNN's rules in terms of Jaccard similarity score for a) Syn-1, b) Syn-2, c) Syn-3.1, d) Syn-3.2.

Comparison of rule discovery for CLNN, TELLER, and OGEM-tab on the Syn-1 dataset.

Comparison of rule discovery for CLNN and TELLER on the Syn-1 dataset.

Comparison of rule discovery for CLNN and TELLER on the Syn-2 dataset.

Comparison of rule discovery of ϕ 3,2 for CLNN and TELLER on the Syn-3.2 dataset. C.4 QUANTITATIVE COMPARISON OF CLNN'S RULES WITH GROUND TRUTH

Log likelihood for each event label in the LinkedIn dataset.D.2 MIMIC II DATASETMIMIC II dataset is obtained from the intensive care unit research database that consists of 25,328 intensity care unit stays. The records include laboratory data, therapeutic intervention profiles such as nursing progress notes, discharge summaries and others. Here we restrict the event types to the diagnosis of patients and filter out the shorter event sequences with few visits, ending up with 650 patients and 15 event labels: L = {1,

Log likelihood for each event label in the MIMIC II dataset.

Log likelihood for each event label in the Stack Overflow dataset.

Comparison of log-likelihood for CLNN with and without SOP on the real-world datasets.

6. ACKNOWLEDGEMENT

This research is sponsored by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network; the National Science Foundation under Grant CMMI-1936578; and the Defense Advanced Research Projects Agency (DARPA) through Cooperative Agreement D20AC00004 awarded by the U.S. Department of the Interior (DOI), Interior Business Center. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Dataset Initialization Rules

Syn -1 In this part, we investigate the effect of the interpretability using an experiment of the impact of logical constraints on the model's performance. The log-ikelihood on the synthetic datasets for CLNN with and without logical constraints is summarized in Table 10 . 

D EXPERIMENT RESULTS OF REAL-WORLD DATASETS D.1 LINKEDIN DATASET

The LinkedIn dataset is a collection of job hopping records between 82 IT companies of 3,000 LinkedIn users. Each event stream represents a user's check-in time stamps for different companies or role changes within the same company. Here we select 1000 users' event streams to compose the dataset by filtering out the event streams with uncommon companies, resulting in 10 event labels: L = {A, B, C, D, E, F, G, H, I, J}. Here we set the number of formulas as 5, i.e., Φ = {ϕ 1 , ϕ 2 , ϕ 3 , ϕ 4 , ϕ 5 }, each of which embodies a model structure shown in Figure 2 (a) and CLNN aims to learn the parameters for each formula. The weight parameters in the paired order cell or the singleton order cell are initialized as random variables following a Gaussian distribution, and the bias terms of conjunction or disjunction operators are initialized as 1. The architecture weights are initialized as random variables following a Gaussian distribution, and the formula impact weights and bias are initialized as Gaussian random variables. The detailed log-likelihood for each event label is summarized in Table 11 .

