REINFORCEMENT LOGIC RULE LEARNING FOR TEM-PORAL POINT PROCESSES

Abstract

We aim to learn a set of temporal logic rules to explain the occurrence of temporal events. Leveraging the temporal point process modeling and learning framework, the rule content and rule weights are jointly learned by maximizing the likelihood of the observed noisy event sequences. The proposed algorithm alternates between a master problem, where the rule weights are updated, and a subproblem, where a new rule is searched and included. The formulated master problem is convex and relatively easy to solve, whereas the subproblem requires searching the huge combinatorial rule predicate and relationship space. To tackle this challenge, we propose a neural search policy to learn to generate the new rule content as a sequence of actions. The policy parameters will be trained end-to-end using the reinforcement learning framework, where the reward signals can be efficiently queried by evaluating the subproblem objective. The trained policy can be used to generate new rules, and moreover, the well-trained policies can be directly transferred to other tasks to speed up the rule searching procedure in the new task. We evaluate our methods on both synthetic and real-world datasets, obtaining promising results. Understanding the generating process of events with irregular timestamps has long been an interesting problem. Temporal point process (TPP) is an elegant probabilistic model for modeling these irregular events in continuous time. Instead of discretizing the time horizons and converting the event data into time-series event counts, TPP models directly model the inter-event times as random variables and can be used to predict the time-to-event as well as the future event types. Recent advances in neural-based temporal point process models have exhibited superior ability in event prediction (Du et al., 2016; Mei & Eisner, 2017) . However, the lack of interpretability of these black-box models hinders their applications in high-stakes systems like healthcare. In healthcare, it is desirable to summarize medical knowledge or clinical experiences about the disease phenotypes and therapies to a collection of logic rules. The discovered rules can contribute to the sharing of clinical experiences and aid to the improvement of the treatment strategy. They can also provide explanations to the occurrence of events. For example, the following clinical report "A 50 years old patient, with a chronic lung disease since 5 years ago, took the booster vaccine shot on March 1st. The patient got exposed to the COVID-19 virus around May 12th, and afterward within a week began to have a mild cough and nasal congestion. The patient received treatment as soon as the symptoms appeared. After intravenous infusions at a healthcare facility for around 3 consecutive days, the patient recovered... " contains many clinical events with timestamps recorded. It sounds appealing to distill compact and human-readable temporal logic rules from these noisy event data. In this paper, we propose an efficient reinforcement temporal logic rule learning algorithm to automatically learn these rules from event sequences. See Fig. 1 for a better illustration of the types of temporal logic rules we aim to discover, where the logic rules are in disjunctive normal form (i.e., OR-of-ANDs) with temporal ordering constraints. Our proposed reinforcement rule learning algorithm builds upon the temporal logic point process (TLPP) models (Li et al., 2020) , where the intensity functions (i.e., occurrence rate) of events are informed by temporal logic rules. TLPP is intrinsically a probabilistic model that treats the temporal



logic rules as soft constraints. The learned model can tolerate the uncertainty and noisiness in events and can be directly used for future event prediction and explaination. Given this TLPP modeling framework, our reinforcement rule learning algorithm jointly learns the rule content (i.e., model structures) and rule weights (i.e., model parameters) by maximizing the likelihood of the observed events. The designed learning algorithm alternates between solving a convex master problem, where the continuous rule weight parameters are easily optimized, and solving a more challenging subproblem, where a new candidate rule that has the potential to most improve the current likelihood is discovered via reinforcement learning. New rules are progressively discovered and included until by adding new rules the objective will not be improved. Specifically, we formulate the rule discovery subproblem as a reinforcement learning problem, where a neural policy is learned to efficiently navigate through the combinatorial search space to search for a good explanatory temporal logic rule to add. The constructed neural policy emits a distribution over the prespecified logic predicate and temporal relation libraries, and generates the logic variables as actions in a sequential way to form the rule content. The generated rules can be of various lengths. Once a temporal logic rule is generated, a terminal reward signal can be efficiently queried by evaluating the current subproblem objective using the generated rule, which is computationally expedient, without the need to worry about the insufficient reward samples. The neural policy is gradually improved by a risk-seeking policy gradient to learn to generate rules to optimize the subproblem objective, which is rigorously formulated from the dual variables of the master problem so as to search for a rule that has the potential to best improve the current likelihood. This proposed reinforcement logic rule learning algorithm has the following advantages: 1) We utilize differentiable policy gradient to solve the temporal logic rule search subproblem. All the policy parameters can be learned end-to-end via policy gradient using the subproblem objective as reward. 𝐱 𝟑 𝐱 𝟐 𝐘 𝐱 𝟒 𝐱 𝟓 Rule-1(𝑓 ! ) Rule-2(𝑓 " ) Before None Equal 𝐱 𝟐 𝐱 𝟔 𝐱 𝟑 𝐱 𝟏 𝑓 ! 𝑓 " 2) Domain knowledge or grammar constraints for the temporal logic rules can be easily incorporated by applying specific dynamic masks to the rule generative process at each time step. 3) The memories of how to search through the rule space have been encoded in the policy parameters. The well-trained neural policies for each subproblem can be directly transferred to similar rule learning tasks to speed up the computation in new tasks, where we don't need to learn rules from scratch. Contributions Our main contributions have the following aspects: i) We propose an efficient and differentiable reinforcement temporal logic rule learning algorithm, which can automatically discover temporal logic rules to predict and explain events. Our method will add flexibility and explainability to the temporal point process models and broaden their applications in scenarios where interpretability is important. ii) All the well-trained neural policies in solving each subproblem can be readily transferred to new tasks. This fits the continual learning concept well. The quality of the rule search policies can be continually improved across various tasks. For a new task, we can utilize the preceding tasks' memories even though we cannot get access to the old training data. We empirically evaluated the transferability of our neural policies and achieved promising results. iii) Our discovered temporal logic rules are human-readable. Their scientific accuracy can be easily judged by human experts. The discovered rules may also trigger experts in thinking. In our paper, we considered a real healthcare dataset and mined temporal logic rules from these clinical event data. We invited doctors to verify these rules and incorporated their feedback and modification into our experiments.

2. RELATED WORK

Temporal point process (TPP) models. TPP models can be characterized by the intensity function. The modeling framework boils down to the design of various intensity functions to add the model flexibility and interpretability (Mohler et al., 2011) . Recent development in deep learning has significantly enhanced the flexibility of TPP models. (Du et al., 2016) proposed a neural point process model, named RMTPP, where the intensity function is modeled by a Recurrent Neural Network. (Mei & Eisner, 2017) improved RMTPP by constructing a continuous-time RNN. (Zuo et al., 2020) and (Zhang et al., 2020a) recently leveraged the self-attention mechanism to capture the long-term dependencies of events. Although flexible, these neural TPP models are black-box and are hard to interpret. To add transparency, (Zhang et al., 2020b) used Granger causality as a latent graph to explain point processes, and the structures are jointly learned via gradient descent. However, Granger causality is still limited to the mutual triggering patterns of events. Recently, (Li et al., 2020) proposed an explainable Temporal Logic Point Process (TLPP), where the intensity function is built on the basis of temporal logic rules. TLPP model enables one to perform symbolic reasoning for events however their rules are required to be pre-specified. A follow-up work (Li et al., 2022) designed a column generation type of temporal rule learning algorithm. However, their subproblems are solved by enumeration, which will be intractable for long temporal logic rules and their search memories cannot be reused for future tasks. By contrast, our proposed algorithm makes the subproblem differentiable and the trained neural policies can be reused and transferred. Logic rule learning methods. Learning logic rules without temporal relation constraints has been studied from various perspectives. Recently, (Wang et al., 2017) tried to learn an explanatory binary classifier using the Bayesian framework. SATNet (Wang et al., 2019a) transformed rule mining into a SDP-relaxed MaxSAT problem. Attention-based methods (Yang & Song, 2019) were also introduced. Neural-LP (Yang et al., 2017) provided the first fully differentiable rule mining method based on TensorLog (Cohen, 2016), and (Wang et al., 2019b) extended Neural-LP to learn rules with numerical values via dynamic programming and cumulative sum operations. In addition, DRUM (Sadeghian et al., 2019) connected learning rule confidence scores with low-rank tensor approximation. (Dash et al., 2018; Wei et al., 2019 ) introduced a column generation (i.e., branch and price) type of MIP algorithm to learn the logic rules. However, all these above-mentioned logic learning methods cannot be directly applied to event sequences with timestamps. By contrast, we designed a differentiable algorithm to learn temporal logic rules from event sequences. Learning model structures via reinforcement learning (RL). RL provided a promising approach to automatically finding the best-fitting model structures, which inspired us to apply it to rule learning. For example, in AutoML, the famous NAS (Zoph & Le, 2017) trained a recurrent neural network by RL to design the architectures of deep neural networks and has achieved comparable performance with the human-designed models. Similar ideas have been adopted to aid the design of the explainable machine learning models. For example, the RL algorithm has been successfully introduced to learn the causal graph to explain the data (Zhu et al., 2020) . Recently, the RL algorithm has been used in symbolic regression (Petersen, 2021; Landajuela, 2021) , which aims to learn the set of compact mathematical expressions to explain the dynamics of complex dynamic systems. In our paper, we customized the RL algorithm to learn the temporal logic rules to explain the event sequences.

3.1. TEMPORAL POINT PROCESSES

Given an event sequence H t = {t 1 , t 2 , . . . , t n |t n < t} up to t, which yields a counting process {N (t), t ≥ 0}, the dynamics of the TPP can be characterized by conditional intensity function, denoted as λ(t|H t ). By definition, we have λ(t|H t )dt = E[N ([t, t + dt])|H t ], where N ([t, t + dt]) denotes the number of points falling in an interval [t, t + dt] . By some simple proof (Rasmussen, 2018) , one can express the joint likelihood of the events H t as p({t 1 , t 2 , . . . , t n |t n < t}) = ti∈Ht λ(t i |H ti ) • exp - t 0 λ(τ |H τ )dτ . (1) The TPP modeling boils down to the design of intensity functions and the model parameters can be learned by maximizing the likelihood. Temporal Logic Rule A first-order logic rule is a logical connectives of predicates, such as f : ∀c, ∀c , Smokes(c ) ← F riend(c, c ) ∧ Smokes(c). A temporal dimension is added to the predicates. The temporal logic rule is a logic rule with temporal ordering constraints. For example, f : ∀c, Covid(c, t 3 ) ← SymptomsAppear(c, t 2 ) ∧ ExposedT oV irus(c, t 1 ) ∧ Bef ore(t 1 , t 2 ). For discrete events, we consider three types of temporal relations, Bef ore(t 1 , t 2 ) = 1{t 1 -t 2 < 0}, Af ter(t 1 , t 2 ) = 1{t 1 -t 2 > 0}, Equal(t 1 , t 2 ) = 1{t 1 = t 2 }. We also treat the temporal relation as the temporal predicate, which is a boolean variable. Formally, we consider the following general temporal logic rule (for simplicity, we omit the entity index c and assume all the predicates are defined for the same entity), f : Y (t y ) ← u∈X f X u (t u ) property predicates u,u ∈X f T uu (t u , t u ) temporal relations (2) where Y (t y ) is the head predicate evaluated at time t y , X f is the set of predicates defined in rule f , T uu denotes the temporal relation of predicate u and u that can take any mutually exclusive relation from the set {Bef ore, Af ter, Equal, N one}. Note that N one indicates there is no temporal relation constraint between predicate u and u . t y , t u , and t u are the occurrence times associated with the predicates.

3.3. TEMPORAL LOGIC POINT PROCESS

The grounded temporal predicate x(c, t), such as Smokes(c, t), generates a sequence of discrete events {t 1 , t 2 , . . . }, with the time that the predicate becomes 1 (i.e., True) is recorded.

Logic-informed intensity function

The main idea of TLPP (Li et al., 2020) is to construct the intensity functions using temporal logic rules. TLPP considers complicated logical dependency patterns, which enables symbolic reasoning for temporal event sequences. Gven a rule as Eq. ( 2), TLPP builds the intensity function conditional on history. Only the effective combinations of the historical events that makes the body condition True will be collected to reason about the intensity of the head predicate. We introduce a logic function g f (•) to check the body conditions of f . g f (•) can also incorporate temporal decaying kernels to capture the decaying effect of the evidence like Hawkes process. The logic-informed feature φ f gathers the information from history, which is computed as φ f (t) = u∈X f {tu}∈H u t g f {t u } u∈X f where H u t indicates the historical event specific to predicate u up to t. Suppose there is a rule set F that can be used to reason about Y . For each f ∈ F, one can compute the features φ f (t) as above. Assume that the rules are connected in disjunctive normal form (OR-of-ANDs). TLPP models the intensity of the head predicate {Y (t)} t≥0 as a log-linear function of the features, λ(t | H t ) = exp b 0 + f ∈F w f • φ f (t) where w = [w f ] f ∈F ≥ 0 are the learnable weight parameters associated with each rule, and b 0 is the learnable base intensity. All the model parameters can be learned by maximizing the likelihood as defined in Eq. (1). Note that given this intensity model, the likelihood is convex with respect to w (Fahrmeir et al., 1994) .

4. REINFORCEMENT TEMPORAL LOGIC RULE LEARNING

Our goal is to jointly learn the set of OR-of-ANDs temporal logic rules and their weights by MLE. Each rule has a general form (2). To discover each rule, the algorithm needs to navigate through the combinatorial space considering all the combinations of the property predicates and their temporal relations. Moreover, each rule can have various lengths and the computational complexity grows exponentially with the rule length. the algorithm alternates between a master problem (rule evaluator) and a subproblem (rule generator).

4.1. OVERALL LEARNING OBJECTIVE: A REGULARIZED MLE

We formulate the overall model learning problem as a regularized MLE problem, where the objective function is the log-likelihood with a rule set complexity penalty, i.e., Original Problem : w * , b * 0 = arg min w,b0 -(w, b 0 ) + Ω(w) s.t. w f ≥ 0, f ∈ F (5) where F is the complete rule set, and Ω(w) is a convex regularization function that has a high value for "complex" rule sets. For example, we can formulate Ω(w) = λ 0 f ∈ F c f w f where c f is the rule length.

4.1.1. RESTRICTED MASTER PROBLEM: CONVEX OPTIMIZATION

The above original problem is hard to solve, due to that the set of variables is exponentially large and can not be optimized simultaneously in a tractable way. We therefore start with a restricted master problem (RMP), where the search space is much smaller. For example, we can start with an empty rule set, denoted as F 0 ⊂ F. Then we gradually expand this subset to improve the results, this will produce a nested sequence of subsets F 0 ⊂ F 1 ⊂ • • • ⊂ F k ⊂ • • • . For each F k , k = 0, 1, . . . , the restricted master problem is formulated by replacing the complete rule set F with F k : Restricted Master Problem : w * (k) , b * 0,(k) = arg min w,b 0 -(w, b0) + Ω(w) s.t. w f ≥ 0, f ∈ F k . (6) Solving the RMP corresponds to the evaluation of the current candidate rules. All rules in the current set will be reweighed. The optimality of the current solution can be verified under the principle of the complementary slackness for convex problems, which in fact leads to the objective function of our subproblem. More proof can be found in the Appendix F.

4.1.2. SUBPROBLEM: COMBINATORIAL PROBLEM

A subproblem is formulated to propose a new temporal logic rule, which can potentially improve the optimal value of the RMP most. Given the current solution w * (k) , b * 0,(k) for the restricted master problem (6), a subproblem is formulated to minimize the increased gain: Subproblem: min φ f - ∂ (w, b 0 ) ∂w f + ∂Ω(w) ∂w f w * (k) ,b * 0,(k) (7) where φ f is a rule-informed feature. We aim to search for a new rule so that the corresponding feature minimizes the above objective function. If the optimal subproblem value is negative, we will include the new rule to the set. Otherwise, if the optimal subproblem value is non-negative, we reach the optimality and we can stop the overall algorithm. However, this search procedure is also computationally expensive. It requires search rule structures to have a feature evaluation. In the following, we will discuss how to solve the subproblem more efficiently using reinforcement learning. We want to emphasize here that, although the idea above as how to formulate the master and subproblems have been widely used in machine learning, including the gradient grafting algorithm for learning high-dimensional linear models (Perkins et al., 2003) , column-generation algorithm for solving mixed integer programming (Savelsbergh, 2002; Nemhauser, 2012; Lübbecke & Desrosiers, 2005) and large linear programming (Demiriz et al., 2002) , and for learning ordinal logic rules (Dash et al., 2018; Wei et al., 2019) , our overall learning framework is gradient-based. For the master problem, we solve a convex optimization with continuous variables by gradient descent (or SGD). For the subproblem, although it requires searching over the combinatorial space, we convert the problem into learning a neural policy where the policy parameters can be learned by policy gradient. 

4.2. SOLVING SUBPROBLEMS VIA REINFORCEMENT LEARNING

… C Figure 3 : Illustration of generating a temporal logic rule using a neural policy (LSTM). The subproblem formulated as Eq. ( 7) proposes a criterion to propose a new temporal logic rule. The subproblem itself is a minimization, which aims to attain the most negative increased gain. However, explicitly solving the subproblem requires enumerating all possible conjunctions of the input property predicates and all possible pairwise temporal relations among the selected predicates, which is extremely computationally expensive. In this paper, instead of trying to enumerate all the conjunctions, we propose to learn a neural policy by reinforcement learning to generate the best-explanatory rules.

4.2.1. GENERATING RULES WITH RECURRENT NEURAL NETWORK

We leverage the fact that the temporal logic rules can be represented as a sequence of "tokens" subject to some unique structures. Using the pre-ordered traversal trick, we parameterize the policy by an RNN or LSTM, combined with dynamic masks, to guarantee that the generated tokens can yield a valid temporal logic rule. We generate the rules one token at a time according the pre-order traversal, as demonstrated in Fig. 3 . Specifically, denote s as the state, which is the embedding of the previously generated tokens. We model the policy as π θ (a|s) with learnable parameter θ, which quantifies the token selection probability given s. Each token/action can be chosen from the two predefined libraries: i) the property predicate libraries, and ii) the temporal relation libraries. Given the head predicate, we generate the body predicates and their temporal relations in a sequential way. Every time a property predicate is generated, we need to consider its temporal relation with all the previously generated property predicates. Note that the temporal relation token can be None, which means there is no temporal relation constraints. All these generative prior knowledge can be incorporated as constraints by designing dynamic masks.

4.2.2. RISK-SEEKING POLICY GRADIENT

The standard policy gradient objective J(θ) is defined as an expectation. This is the desired objective for control problems in which one seeks to optimize the average performance of a policy. However, rule learning problems described in our paper are to search for best-fitting rules. For such problems, J(θ) may not appropriate, as there is a mismatch between the objective being optimized and the final performance metric. We consider risk-seeking policy gradient like (Petersen, 2021; Landajuela, 2021) , which proposed an alternative objective that aims to maximize the best-case performance. According to the original work (Landajuela, 2021; Petersen, 2021) , we first define R (θ) as the (1 -)-quantile of the distribution of rewards under the current policy. Then the new objective J risk (θ; ) is given by: J risk (θ; ) = E τ ∼π(τ |θ) [R(τ ) | R(τ ) ≥ R (θ)] (8) Then the risk-seeking policy gradient can be estimated using the roll-out samples, i.e., ∇ θ J risk (θ; ) ≈ 1 N N i=1 K k=1 R (i) (τ ) -R(i) (θ) • 1 R (i) (τ )≥ R(i) (θ) ∇ θ log π θ (a (i) k |s (i) k ) (9) Group-1: Case-4 2 1 2 1 3 2 1 3 3 Truth Learned Truth Learned Truth Learned Learned Learned Learned Truth Truth Truth Group-2: Case-4 Group-3: Case-4 Group-4: Case-4 Group-5: Case-4 Group-6: Case-4 Jaccard = 1.00 Jaccard = 0.67 Jaccard = 0.33 Jaccard = 1.00 Jaccard = 1.00 Jaccard = 0.33 Rules with different structures Tree-Like Rules where N is the number of episodes, and K is the length of tokens (actions). R(i) (θ) is the empirical (1 -)-quantile of the batch of rewards, and 1 returns 1 if condition is true and 0 otherwise. We use this estimated policy gradient to update the policy θ. θ ← θ + α∇ θ J risk (θ; ) (10) where α is the learning rate. Further, according to the maximum entropy reinforcement learning framework (Haarnoja et al.) , a bonus can be added to the loss function proportional to the entropy to help the policy do the exploration.

5.1. SYNTHETIC DATA

We prepared 6 groups of synthetic event data, each group with a different set of ground truth rules. We considered the following baselines: TELLER (Li et al., 2022) , policy gradient without risk seeking, and brute-force method (enumerating all possible rules).

Accuracy and Scalability

For each group, we further considered 5 cases, with the to-be-searched property predicate library being sized 8, 12, 16, 20, and 24, respectively. Note that only a small amount of the predicates will be in the true rules. Many of the predicates are redundant information and they will act as background predicates. We aim to test: 1) whether our reinforcement temporal logic learning algorithm can truly uncover the rules from the noisy variable set, 2) how accurate can the rule weights be estimated, 3) and how the performance will evolve if we gradually increase the variable set with more and more redundant variables. The ground truth rules of different groups are with different length and various content structures. For some groups, each rule shares many common predicates in content, while for some groups, the designed rules are quite distinct in content. For example, the ground truth rules in group-{1, 2, 3} are quite different in their content, and the ground truth rules in group-{4, 5, 6} share many common predicates. On the other hand, in group-{1, 4} and group-{2, 5}, the number of property predicates in one ground truth rule is 6 and 7 respectively. We set the number of property predicates of ground truth rules to be 8 in group-{3, 6} to craft relatively long rules, especially considering intricate temporal relation at the same time. When uncovering ground truth rules in every group, we fix 2 predicates as prior knowledge and ask the algorithm to complete the temporal logic rules. Complete results for all the datasets can be found in Appendix I. In Fig. 4 , we reported the learning results for the case with predicate library size 20 for all groups (with different rule content). We used 1000 samples of event sequences as training data. Each plot in the top row uses a Venn diagram to show the true rule set and the learned rule set, from which the Jaccard similarity score (area of the intersection divided by the area of their union) is calculated. Our proposed model discovered almost all the ground truth rules. Each plot in the bottom compares the true rule weights with the learned rule weights, with the Mean Absolute Error (MAE) reported. Almost all the truth rule weights are accurately learned and the MAEs are quite low. In group-3 and group-6, we crafted long and complex rules with 8 body property predicates and various temporal relations, which yield an extremely huge search space, but our model still discovered almost all the ground truth rules. Fig. 5 illustrates the Jaccard similarity score and MAE for all cases in all 6 groups using 1000 samples. For all 6 groups, as the number of predicates in the predicate set increases, the Jaccard similarity scores decrease slightly and the MAE increases slightly, but it is still within an acceptable range. This is because as the number of redundant predicates increases, the search space expands exponentially and the complexity of searching is dramatically increased. But if the number of predicates in the predicate set is appropriate and the samples are sufficient, our model is very stable and reliable. Computational Efficiency As shown in Fig. 6 (left), we displayed the curve of the log-likelihood function versus the running time for our method and baselines. Our method uses the risk-seeking policy gradient in solving subproblems (refer to the flat-line period in figure). As a comparison, vanilla (normal) policy gradient still optimize the expectation of the reward to solve the subproblem. For TELLER, it uses enumerative search to generate new rules in subproblems although adopts the depth-first type of heuristic to try to append predicates to important short rules to generate long rules. As for the brute-force method, it first considers all the one-body-predicate rules to optimize the likelihood by learning the rule importance, then adds all the two-body-predicate rules to the model, and so on. From the results, we see that our method uncovers the ground truth rules faster and more accurately compared with all the baselines in the long-run. It is almost intractable for the brute-force method. Compared with the normal policy gradient method, the performance and accuracy of our model are also better, mainly because by using risk-seeking policy gradient, the model can focus learning only on maximizing best-case performance. We did more experiments to compare the normal policy gradient and risk-seeking policy gradient. Please refer to Appendix K for more details. Transferability Our well-trained neural search policies are also transferable. As shown in Fig. 6 (right), once we get a collection of well-trained policies in solving each subproblem, we can use it to train on other different datasets which are generated by the same or similar ground truth rules. The results showed that this will speed up the process of solving subproblems and improve the efficiency and accuracy of uncovering ground truth rules in these datasets.

5.2. HEALTHCARE DATASET

MIMIC-III is an electronic health record dataset of patients admitted to the intensive care unit (ICU) (Johnson et al., 2016) . We considered patients diagnosed with sepsis (Saria, 2018; Raghu et al., 2017; Peng et al., 2018) , since sepsis is one of the major causes of mortality in ICU. Previous studies suggest that the optimal treatment strategy is still unclear. It is unknown what is the optimal treatment strategy in terms of using intravenous fluids and vasopressors to support the circulatory system. There also exists clinical controversy about when and how to use these two groups of drugs to reduce the side effect for the patients. For this real problem, we implemented our proposed reinforcement temporal logic rule learning algorithm to learn explanatory rules and their weights and gain insight into this problem. Discovered temporal logic rules In Appendix C, we reported all the uncovered temporal logic rules and their weights learned by our algorithm. We use LowUrine and NormalUrine as the head predicates respectively. We also invited human experts (doctors) to check the correctness of these discovered rules and the doctors think most of these rules have clinical meaning and are consistent with the pathogenesis of sepsis. Doctors' modifications and suggestions for these algorithmdiscovered rules are also provided. Experts think that Rule 1-9 capture the major lab measures, like low systolic blood pressure, high blood urea nitrogen and low central venous pressure (appeared in Rule 2), that usually emerge before extremely low urine. Rule 10-18 shed light on drug and treatment selection. For example, reflected in Rule 12, Crystalloid and Dobutamine together yields a weight of 0.4535 for patient's normal urine.

Compared with baselines in event prediction

We considered the following SOTA baselines: 1) Recurrent Marked Temporal Point Processes (RMTPP) (Du et al., 2016) , the first neural point process (NPP) model, where the intensity function is modeled by a Recurrent Neural Network (RNN); 2) Neural Hawkes Process (NHP) (Mei & Eisner, 2017) , an improved variant of RMTPP by constructing a continuous-time LSTM; 3) THP, Transformer Hawkes Process (Zuo et al., 2020) , which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efficiency; 3) Tree-Regularized GRU (TR-GRU) (Wu et al., 2018) , a deep time-series model with a designed tree regularizer to add model interpretability; 5) Hexp (Lewis & Mohler, 2011) , Hawkes Process with an exponential kernel; 6) Transformer (Vaswani et al., 2017) , an advanced model which follows an encoder-decoder structure, but does not rely on recurrence and convolutions in order to generate an output. 7) TELLER (Li et al., 2022) , alternating between master problem and subproblem (enumerative search) to learn logic rules; 8) PG-normal, neural search policy learned by normal policy gradient, without risk seeking. We used mean absolute error (MAE) as the evaluation metrics for the event prediction tasks of the two head predicates: 1) LowUrine and 2) NormalUrine, which are evaluated by predicting the time when these events will occur. Lower MAE (unit is hour in this example) indicates better performance of model. The performance of our model and all baselines are compared in Tab. 1, from which one can observe that our model outperforms all the baselines in this experiment.

6. CONCLUSION

In this paper, we proposed a reinforcement temporal logic rule learning algorithm to jointly learn temporal logic rules and their weights from noisy event data. The proposed learning algorithm alternates between a rule generator stage and a rule evaluator stage, where a neural search policy is learned by risk-seeking gradient descent to discover new rules in rule generator stage. The use of the neural policy makes the subproblem differentiable, and the well-trained policies can be easily transferred to other tasks. We empirically evaluated our method on both synthetic and healthcare datasets, obtaining promising results.

A APPENDIX

In the following, we will provide supplementary materials to better illustrate our methods and experiments. • B presents different ground truth rule structures and some experiment results on synthetic data. • C presents all learned rules on MIMIC-III data and expert identification and revision suggestion. • Section D provides the definitions of all types of temporal relations considered in our model. • Section E and F elaborate on the necessary proofs, which justify our model and learning framework. • Section G provides pseudocode to illustrate our rule generator policy, subproblem formulation, and the complete algorithm. • Section H provides our computing infrastructure. • Section I, J and K show the comprehensive experiments on synthetic data. In K we compared the performance between standard policy gradient and risk seeking policy gradient. • Section L introduces the background on the MIMIC-III data and provides a list of the chosen predicates used in our real experiment. Section M presents more experiment results in terms of the modified rules by experts. • Section N introduces another application using our proposed method, which is an experiment about understanding shoppers' purchase patterns given their eye fixation event data.

B ALL LEARNED RULES ON SYNTHETIC DATA

We did experiments on 6 groups of event data, each group with a different set of ground truth rules. For each group, we further considered 5 cases. To illustrate different rule structures in different structure, we listed all the discovered rules for Case-2 of all groups in Tab. 2, Tab. 3, and Tab. 4. Due to limited space, we did not show all learned rule content for all cases of all groups. Please refer to I for the complete results of experiments on synthetic data.  * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F * Y ← C ∧ B ∧ F ∧ D ∧ E ∧ G ∧ C Before B ∧ B Before F ∧ F Before D ∧ D Before E ∧ E Before G * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ G ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before G * Y ← C ∧ B ∧ G ∧ D ∧ E ∧ H ∧ C Before B ∧ B Before G ∧ G Before D ∧ D Before E ∧ E Before H * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ H ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before H -- * Y ← D ∧ B ∧ C ∧ A ∧ E ∧ H ∧ D Before B ∧ B Before C ∧ C Before A ∧ A Equal E ∧ E Equal H * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ G ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F ∧ F Equal G * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ G ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F ∧ F Equal G * Y ← C ∧ B ∧ F ∧ D ∧ E ∧ G ∧ H ∧ C Before B ∧ B Before F ∧ F Before D ∧ D Before E ∧ E Before G ∧ G Equal H * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ H ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F ∧ F Equal H * Y ← C ∧ B ∧ A ∧ D ∧ E ∧ H ∧ G ∧ C Before B ∧ B Before A ∧ A Before D ∧ D Before E ∧ E Before H ∧ H Equal G * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ H ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F ∧ F Before H * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ G ∧ H ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F ∧ F Equal G ∧ G Equal H * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ G ∧ H ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F ∧ F Equal G ∧ G Equal H * Y ← C ∧ B ∧ F ∧ D ∧ E ∧ G ∧ H ∧ A ∧ C Before B ∧ B Before F ∧ F Before D ∧ D Before E ∧ E Before G ∧ G Equal H ∧ H Equal A * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ G ∧ H ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F ∧ F Equal G ∧ G Before H * Y ← C ∧ B ∧ A ∧ D ∧ E ∧ H ∧ G ∧ F ∧ C Before B ∧ B Before A ∧ A Before D ∧ D Before E ∧ E Before H ∧ H Equal G ∧ G Equal F * Y ← A ∧ B ∧ C ∧ D ∧ E ∧ F ∧ H ∧ G ∧ A Before B ∧ B Before C ∧ C Before D ∧ D Before E ∧ E Before F ∧ F Equal H ∧ H Before G * Y ← A ∧ C ∧ H ∧ G ∧ B ∧ D ∧ E ∧ F ∧ A Before C ∧ C Before H ∧ H Before G ∧ G Before B ∧ B Before D ∧ D Before E ∧ E Before F -- C ALL LEARNED RULES ON MIMIC-III DATA The complete sets of learned rule on MIMIC-III data are shown in Tab. 5 and Tab. 6. We invite human experts to check our learned rules and provide several revision suggestion. Rules shown in these tables have been identified by experts as clinically meaningful and consistent with sepsis pathology. Some of the rules are directly learned by our method without any modification (marked with a blue asterisk). And some are basically consistent with clinical facts, but some contents of the rules are not pathological. These rules have been slightly modified by experts (marked with a red asterisk). 

D DETAILED EXPLANATION OF TEMPORAL RELATION

In this paper, the temporal relation was defined among events. For any pairwise events, denoted as A and B, there exist only three types of temporal relations, which can be grounded by their occurrence times, denoted as t A and t B . See below Table 7 for illustrations. The temporal relation of any two events will be treated as temporal ordering constraints and can be included in a temporal logic rule as Eq. ( 4). Note that when included in the rule, the temporal relation can be none, which indicates that there is no temporal relation constraint between the two events in order to satisfy the rule.

E PROOF OF THE LIKELIHOOD FUNCTION OF TLPP

The likelihood function of the TLPP is a straightforward result from TPP. Readers can refer to the proofs in (Rasmussen, 2018) . To be self-contained, we will provide a sketch of proof here. For a specific entity c, given all the events associated with the head predicate (t c 1 , t c 2 , . . . ) ∈ [0, t), the likelihood function is the joint density function of these events. Using the chain rule, the joint likelihood can be factorized into the conditional densities of each points given all points before it.  A Before B t A < t B 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " A After B t A > t B 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " A Equals B t A = t B 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " 𝑡 ! 𝑡 " For entity c, this yields:  L c = p c (t c 1 | H 0 ) p c t c 2 | H t c 1 • • • p c t c n | H t c n-1 1 -F c t λ c (t) = p c (t | H tn ) 1 -F c (t | H tn ) we will have p c (t|H tn ) = λ c (t) exp - t tn λ c (s)ds . Using the above equation, we can get L c = n i=1 p c t c i | H t c i -1 λ c (t) p c (t | H tn ) = n i=1 λ c (t c i ) exp - t c i t c i-1 λ c (τ )dτ exp - t t c n λ c (τ )dτ = n i=1 λ c (t c i ) exp - t 0 λ c (τ )dτ . Now consider the likelihood function of all entities C = {c 1 , c 2 , ..., c n }, which can be factorized according to the entities, the likelihood can be written as Likelihood: c∈C t c i ∈Ht λ c (t c i |H ti ) • exp - t 0 λ c (τ |H τ )dτ which completes the proof.

F OPTIMALITY CONDITION AND COMPLEMENTARY SLACKNESS

We will provide more descriptions on the optimality condition and the complementary slackness, which provides a sound guarantee to our learning algorithm. Given the original restricted convex problem, Orignial Problem : w * , b * 0 = arg min w,b0 -(w, b 0 ) + Ω(w); s.t. w f ≥ 0, f ∈ F where Ω(w) is a convex regularization function that has a high value for "complex" rule sets. For example, we can formulate Ω(w) = λ 0 f ∈ F c f w f where c f is the rule length. The Lagrangian of the original master problem is L(w, b 0 ν) = -(w, b 0 ) + Ω(w) - f ∈ F ν f w f , where ν f ≥ 0 is the Lagrange multiplier associated with the non-negativity constraints of w f . As it is a convex problem and strong duality holds under mild conditions. Define w * , b * 0 as the primal optimal, and ν * as the dual optimal, then: - (w * , b * 0 ) = inf w,b0 L(w, b 0 , ν * ) (strong duality) = inf w,b0   -(w, b 0 ) + Ω(w) - f ∈ F ν * f w f   ≤ -(w * , b * 0 ) + Ω(w * ) - f ∈ F ν * f w * f ≤ -(w * , b * 0 ) + Ω(w * ). (18) Therefore, f ∈ F ν * f w * f = 0, for f ∈ F. This implies the complementary slackness, i.e., w * f = 0 ⇒ ν * f ≥ 0, w * f > 0 ⇒ ν * f = 0 (19) Given the Karush-Kuhn-Tucker (KKT) conditions, the gradient of Lagrangian L(w * , b * 0 , ν * ) w.r.t. w * , b * 0 vanishes, i.e., ν * f := - ∂ [ (w, b 0 ) -Ω(w)] ∂w f w * ,b * 0 . In summary, combining conditions ( 19) and (20), we obtain the optimalitiy condition of the original problem, 1. if w * f > 0, then ν * f = 0; 2. if w * f = 0, then ν * f ≥ 0, where the gradient ν * f can be computed via (20). At each iteration, we solve the subproblem to find the candidate rule that most violates this optimality condition, i.e., yields the most negative Eq. ( 20).

G ALGORITHM BOX

Our method alternates between solving a restricted master problem and a subproblem. When executing the subproblem, we need to generate several candidate rules. We summarize the algorithm in Algorithm 1, Algorithm 2, and Algorithm 3. RG refers to the Rule Generator used to generate a new candidate rule when solving the subproblem. Here, we will parameterize the RG as a LSTM. SP is the abbreviation of Sub I COMPLETE RESULTS OF EXPERIMENTS ON SYNTHETIC DATA Fig. 7 and Fig. 8 demonstrate the learning results for the cases with predicate size 8, 12, 16, 20, and 24 (predicates in the library that we need to search) for all groups using 1000 samples of networked events. We set the learning rate in solving the subproblem to be ×10 -2 . Hidden state size of LSTM was 32. The learning rate in solving the restricted master problem was ×10 -3 . Our proposed model discovered almost all the ground truth rules for all cases in all groups. And almost all the truth rule weights are accurately learned and the MAEs are quite low. For cases in group-3 and group-6, we considered long and complex rules with 8 body property predicates and the associated temporal relations, which yield a very big search space, but our model still discovered most of the ground truth rules. These final choices define the head predicate set. Another 18 predicates are about the location of the items, value of the items, and the time of eye fixation of a shopper on one specific item. Please refer the Tab. 9 below for a complete predicate set. 

FinalChoice_LargestValue FinalChoice_LastFixation FinalChoice_LongestFixation

Temporal Relation Type Before Rules Discussion: We displayed the discovered important rules in Tab. 10, which summarize the eye fixation patterns before shoppers making choices. From the results, we have the following discoveries: 1) the final fixation is shorter 2) the later (but not the final) fixations are longer 3) people are more likely to begin to look from the left or from the middle. Specifically, if a shopper finally chooses the item with the largest value, he may first glance over all three items at least once, or after looking at all three items, go back to check the item he wants to choose, and then make a choice (Rule 1, 2, 5, and 8). And people are more used to looking from left to right (Rule 2, 3, 5, and 7). If a shopper finally chooses the item with the last eye fixation, he may only take a quick look at these items and may miss one or two of these items (Rule 3, 6, and 9). If a shopper finally chooses the item with the longest eye fixation, he may spend a lot of time on most of these items, reevaluating the value of these items back and forth in his mind (Rule 4, 7, and 10). In summary, our discovered temporal logic rules provide insight into shopper's perchance behaviors in terms of eye fixation patterns. 



https://physionet.org/content/mimiciii/view-license/1.4/



Figure 1: Example of temporal logic rules: f1: Y ← x1 ∧ x2 ∧ x3 ∧ x4 ∧ (x1 Bef ore x2) ∧ (x2 N one x3) ∧ (x3 Bef ore x4), f2 : Y ← x5 ∧ x2 ∧ x3 ∧ x6 ∧ (x5 N one x2) ∧ (x2 Equal x3) ∧ (x3 Bef ore x6).N one means no temporal order constraints.

Figure2: Overall Learning Framework: alternating process between rule generator and rule evaluator algorithm to discover new rules one-by-one. The overall learning framework is shown in Fig.2wherethe algorithm alternates between a master problem (rule evaluator) and a subproblem (rule generator).

Figure 4: Rule discovery ability and rule weight learning accuracy of our proposed model based on Case-4 for all 6 groups. Blue one indicates ground truth rule and red/yellow one indicates learned rule in different groups.

Figure 5: Jaccard similarity score and MAE for all 6 groups. X-axis indicates the predicate library size and Y-axis indicates the value of Jaccard similarity and MAE

Figure 6: Log Likelihood Trajectory. Left: Compare our model (risk seeking policy gradient) with several baselines (normal policy gradient, TELLER, Brute force way). Right: Transfer experiments. First completely train a collection of rule generators and save the model parameters, then use the well-trained model to uncover new rules on different datasets. Transfer ## indicates the new datasets with ground truth rules that are slightly different compared with the datasets used by the well-trained model. Same Rule ## indicates the new datasets generated by same ground truth rules.

Rule 11:NormalUrine ← LowUrine ∧ Phenylephrine ∧ Dopamine ∧ (LowUrine Before Phenylephrine) ∧ (Phenylephrine Before Dopamine) 0.4535 *Rule 12:NormalUrine ← LowUrine ∧ Crystalloid ∧ Dobutamine ∧ (LowUrine Equal Crystalloid) ∧ (Crystalloid Equal Dobutamine) 0.2113 *Rule 13:NormalUrine ← LowUrine ∧ Phenylephrine ∧ Norepinephrine ∧ (LowUrine Equal Phenylephrine) ∧ (Phenylephrine Equal Norepinephrine) 0.5459 *Rule 14:NormalUrine ← LowUrine ∧ Norepinephrine ∧ Dopamine ∧ NormalArterialBE ∧ (LowUrine Equal Norepinephrine) ∧ (Norepinephrine Equal Dopamine) ∧ (Dopamine Equal NormalArterialBE) 0.3926 *Rule 15:NormalUrine ← LowUrine ∧ Norepinephrine ∧ NormalRBCcount ∧ NormalBUN ∧ (LowUrine Equal Norepinephrine) ∧ (Norepinephrine Equal NormalRBCcount) ∧ (NormalRBCcount Equal NormalBUN) 0.6430 *Rule 16:NormalUrine ← LowUrine ∧ Colloid ∧ NormalArterialpH ∧ Dobutamine ∧ (LowUrine Equal Colloid) ∧ (Colloid Equal NormalArterialpH) ∧ (NormalArterialpH Equal Dobutamine) 0.3464 *Rule 17:NormalUrine ← LowUrine ∧ Norepinephrine ∧ Dobutamine ∧ NormalSysBP ∧ (LowUrine Equal Norepinephrine) ∧ (Norepinephrine Equal Dobutamine) ∧ (Dobutamine Equal NormalSysBP) 0.5669 *Rule 18:NormalUrine ← LowUrine ∧ Phenylephrine ∧ NormalSysBP ∧ Dopamine ∧ NormalCVP ∧ (LowUrine Equal Phenylephrine) ∧ (Phenylephrine Equal NormalSysBP) ∧ (NormalSysBP Equal Dopamine) ∧ (Dopamine Equal NormalCVP)

← Left_MaxValue_LongFixation ∧ Middle_MidValue_ShortFixation ∧ Left_MaxValue_LongFixation ∧ Right_MinValue_LongFixation ∧ Left_MaxValue_ShortFixation

Predicate Define a set of entities C = {c 1 , c 2 , ..., c n }. The predicate is defined as the property or relation of entities, which is a logic function that is defined over the set of entities, i.e., x(•) : C × C • • • ×C → {0, 1} . For example, Smokes(c) is the property predicate and F riend(c, c ) is the relation predicate.

To tackle this challenging problem, we propose a RL search

Event prediction results.

Results of Case-2 for Group-1, Group-4. *Ground truth rules which are learned. * Ground truth rules which are not learned. * Rules that are wrongly learned.

Results of Case-2 for Group-2 and Group-5. *Ground truth rules which are learned. * Ground truth rules which are not learned. * Rules that are wrongly learned.

Results of Case-2 for Group-2 and Group-5. *Ground truth rules which are learned. * Ground truth rules which are not learned. * Rules that are wrongly learned.

Part of learned rules with LowUrine as the head predicate. *Rules that are clinically meaningful confirmed by experts. *Rules that are modified by experts.

All learned rules with NormalUrine as the head predicate. *Rules that are clinically meaningful confirmed by experts. *Rules that are modified by experts.

Event-based temporal relations. Temporal Relation Mathematical Expression Illustration

(t | H tn ) represents the conditional density and F c (t | H tn ) refers to its cumulative distribution function for any t > t n . (1 -F c (t | H tn )) appears in the likelihood since the unobserved point t n+1 hasn't happened up to t. Further, by the hazard rate definition of the intensity function

-Problem which is optimized to construct a new rule. RMP indicates the Restricted Master Problem used to update model parameters. Set the location of NewBodyPred zero (DynamicMask)

Defined predicates for eye fixation trials. For eye fixation predicates, there are three parts: location of item, value of item, duration of eye fixation

Temporal Logic Rules Discovered for Eye Fixation and Final Choice. Since there is only one type of temporal relation "Before", in the following representation of temporal logic rules, we just ignore the temporal relation

annex

We also evaluated our model using 500 and 2000 samples with the same experiment settings. Our model also achieved satisfactory performance when sample size is small, and the performance of the model was further improved as the sample size increases.

Case-1: 8

Case-3: 16 Case-4: 20Truth Learned 2 1Group-1Group-2Rules with different structures 3 3Group-3Tree-Like RulesGroup-4Group-5Group-6 2 1 2 1Case-2: 12 

J REUSE THE LSTM MEMORY AND EARLY STOP MECHANISM

When to reuse the LSTM memory for different subproblem iterations? The ground truth rules in group-{1, 2, 3} are quite distinct in their content with almost no common predicates in each rule. Given this prior knowledge, when training the datasets in these groups, whenever the algorithm enters the subproblem to search for a new candidate rule, we reset our LSTM model and clean the memory.If not, we will get the convergence results as illustrated in Fig. 9 , where we observe that keeping the LSTM memory will hinder the convergence speed of the subproblems. As illustrated in Fig. 10 (left) where we refresh the LSTM model parameters whenever the algorithm enters the subproblem stage. As a comparison, by doing this, the convergence of the subproblem is much faster. The ground truth rules in group-{4, 5, 6} are similar in content and share many predicates. Given this prior knowledge, reuse the LSTM across subproblems may help the convergence.Early stop mechanism. To speed up our algorithm, we propose an early stop mechanism. We consider that when the iteration times reach a pre-set reasonable number, and when the LSTM model consecutively generates identical logic rules (like identical 10 rules), we conclude that the LSTM model has been trained enough and is able to generate a good-performing logic rule. We don't need to train the LSTM until the norm of policy gradient is within a very small tolerance. Hence, we may early stop our training. Fig. 10 (right) illustrates the training trajectories of solving subproblem for group-1 case-1, where we reset the LSTM in the beginning of the subproblems and use early stop mechanism. It's obvious that this mechanism can reduce redundant iterations and help us complete the training process in a short time.

K COMPARE WITH STANDARD POLICY GRADIENT

For standard policy gradient, the policy π θ (a|s) is trained end-to-end by minimizing the subproblem objective using policy gradient, i.e., max where τ is the generated tokens (i.e., candidate rule) and R(τ. The policy gradient can be estimated using the roll-out samples, i.e.,where N is the number of episodes, and K is the length of tokens (actions). We use this estimated policy gradient to update the policy θ, θ ← θ + α∇ θ J(θ), where α is the learning rate. However, in our problem, the final performance of our model is measured by the single or few best-performing rules found during training. So normal policy gradient may not be satisfactory while risk-seeking policy gradient may be more suitable. We compare the performance of the risk-seeking policy gradient model with the normal policy gradient model for cases in group-3 and group-6, mainly because the ground truth rules in these two groups are long and complex with many body property predicates and various temporal relations. Due to the difficulty of recovering ground truth rules in these groups, it may be more obvious to distinguish the performance of the two models. We set to be 0.3. The learning rate in solving the subproblem was ×10 -2 and the hidden state size of LSTM was 32. The learning rate in solving the restricted master problem was ×10 -3 . The results are shown in Fig. 11 and Fig. 12 . By using risk-seeking policy gradient, there are actually some significant improvements. The results show that more, and more importantly, more accurate ground truth rules were learned by the riskseeking policy gradient.

Predicates in Library

Policy Gradient: Normal Figure 12 : Jaccard similarity score for all 5 cases in group-6 using 1000 samples. Blue one indicates the ground truth rule, green one indicates the learned rule using normal policy gradient, and yellow one indicates the learned rule using risk-seeking policy gradient.

L PREDICATE DEFINITION IN MIMIC-III

MIMIC-III is an electronic health record ICU dataset, which is released under the PhysioNet Credentialed Health Data License 1.5.0 1 . It was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). In this dataset, all the patient health information was deidentified. We manually checked that this data do not contain personally identifiable information or offensive content. We defined 63 predicates, including two groups of drugs (i.e., intravenous fluids and vasopressors) and lab measurements, see Tab. 8 for more details. Among all these predicates we were interested in reasoning about two predicates and define them as head predicates: 1) LowUrine and 2) NormalUrine. We treated real time urine as head predicates since low urine is the direct indicator of bad circulatory systems and the signal for septic shock; normal urine reflects the effect of the drugs and treatments and the improvement of the patients physical condition. In our experiments, lab measurement variables were converted to binary values (according to the normal range used in medicine) with the transition time recorded. For drug predicates, they were recorded as 1 when they were applied to patient. We extracted 2023 patient sequences, and randomly selected 80% of them for training and the remaining 20% for testing. The average time horizon is 392.69 hours and the average events per sequence is 79.03.

M EXPERTS' MODIFICATION ALSO IMPORTANT

We also invite human experts (i.e. doctors in ICU) to justify correctness and modify our learned rules. We compare the log-likelihood trajectories of directly updating the rule weights and directly updating the weights of the modified rules by human experts, and the results are shown in Fig. 13 . The results show that the modification suggestion from human experts can indeed help improve the performance of our model, since the log-likelihood trajectory of modified rules rise higher and faster.N REAL-WORLD EXPERIMENT: EYE FIXATION Simple choices are made by integrating noisy evidence that is sampled over time and influenced by visual attention. As a result, fluctuations in visual attention can affect choices. We aim to understand shoppers' purchase patterns given their eye fixation event data (Callaway et al., 2021) . Our conjecture is: the location of the items, shopper-assessed values of the items, and the shopper's visual habits (usually looking from left to right) will affect their final item choice. We learned temporal logic rules and their weights to quantitatively understand this. Dataset Description: Three items randomly placed on the "left", "middle" and "right" on the supermarket shelf, each has a unique "price" (value). Each shopper evaluated three items by eye fixation until they identified an item to purchase. The data record each shopper's eye fixated items, when and where, and their final purchased item. There are 30 participants, each with at most 100 independent trials. At each trial, participants were asked to look at these three items and choose the item that they think is most valuable. There are 2966 trials in total. On average, for one trial, a participant has 4.3011 eye fixations. Predicate Definition: We are interested in explaining three final choices of a shopper: 1) finally choose the item with the actual (not shopper-assessed) largest value 2) finally choose the item with the last eye fixation, and 3) finally choose the item with the longest eye fixation.

