RHINO: DEEP CAUSAL TEMPORAL RELATIONSHIP LEARNING WITH HISTORY-DEPENDENT NOISE

Abstract

Discovering causal relationships between different variables from time series data has been a long-standing challenge for many domains such as climate science, finance, and healthcare. Given the complexity of real-world relationships and the nature of observations in discrete time, causal discovery methods need to consider non-linear relations between variables, instantaneous effects and historydependent noise (the change of noise distribution due to past actions). However, previous works do not offer a solution addressing all these problems together. In this paper, we propose a novel causal relationship learning framework for timeseries data, called Rhino, which combines vector auto-regression, deep learning and variational inference to model non-linear relationships with instantaneous effects while allowing the noise distribution to be modulated by historical observations. Theoretically, we prove the structural identifiability of Rhino. Our empirical results from extensive synthetic experiments and two real-world benchmarks demonstrate better discovery performance compared to relevant baselines, with ablation studies revealing its robustness under model misspecification.

1. INTRODUCTION

Time series data is a collection of data points recorded at different timestamps describing a pattern of chronological change. Identifying the causal relations between different variables and their interactions through time (Spirtes et al., 2000; Berzuini et al., 2012; Guo et al., 2020; Peters et al., 2017) is essential for many applications e.g. climate science, health care, etc. Randomized control trials are the gold standard for discovering such relationships, but may be unavailable due to cost and ethical constraints. Therefore, causal discovery with just observational data is important and fundamental to many real-world applications (Löwe et al., 2022; Bussmann et al., 2021; Moraffah et al., 2021; Wu et al., 2020; Runge, 2018; Tank et al., 2018; Hyvärinen et al., 2010; Pamfil et al., 2020) . The task of temporal causal discovery can be challenging for several reasons: (1) relations between variables can be non-linear in the real world; (2) with a slow sampling interval, everything happens in between will be aggregated into the same timestamp, i.e. instantaneous effect; (3) the noise may be non-stationary (its distribution depends on the past observations), i.e. history-dependent noise. For example, in stock markets, the announcements of some decisions from a leading company after the market closes may have complex effects (i.e. non-linearity) on its stock price immediately after the market opening (i.e. slow sampling interval and instantaneous effect) and its price volatility may also be changed (i.e. history-dependent noise). Similarly, in education, students that recently earned good marks on algebra tests should also score well on an upcoming algebra exam with little variation (i.e. history-dependent noise). To the best of our knowledge, existing frameworks' performances suffer in many real-world scenarios as they cannot address these aspects in a satisfactory way. Especially, history-dependent noise has been rarely considered in past. A large category of the preceding works, called Granger causality (Granger, 1969) , is based on the fact that cause-effect relationships can never go against time. Despite many recent advances (Wu et al., 2020; Shojaie & Michailidis, 2010; Siggiridou & Kugiumtzis, 2015; Amornbunchornvej et al., 2019; Löwe et al., 2022; Tank et al., 2018; Bussmann et al., 2021; Dang et al., 2018; Xu et al., 2019) , they all rely on the absence of instantaneous effects with a fixed noise distribution. Constraint-based methods have also been extended for time series causal discovery (Runge, 2018; 2020) , which is commonly applied by folding the time-series. This introduced new assumptions and translated the aforementioned requirements to challenges in conditional independence testing (Shah & Peters, 2020) .Additionally, they require a stronger faithfulness assumption and can only identify the causal graph up to a Markov equivalence class without detailed functional relationships. An alternative line of research leverages the development of causal discovery with functional causal models (Hyvärinen et al., 2010; Pamfil et al., 2020; Peters et al., 2013) . They can model both instantaneous and lagged effects as long as they have theoretically guaranteed structural identifiability. Unfortunately, they do not consider history-dependent noise. One central challenge of modelling this dependency is that noise depending on the lagged parents may break the model structural identifiability. For static data, Khemakhem et al. (2021) proves the structural identifiability only when this dependency is restricted to a simple functional form. Thus, the key research question is whether the identifiability can be preserved with complex historical dependencies in the temporal setting. Motivated by these requirements, we propose a novel temporal discovery framework called Rhino (deep causal temporal relationship learning with history dependent noise), which can model nonlinear lagged and instantaneous effects with flexible history-dependent noise. Our contributions are: • A novel causal discovery framework called Rhino, Revision(Q2)-Reviewer hiTaconsisting of a novel functional form of its SEMs and variational training framework, where the proposed form of its SEM combines vector auto-regression and deep learning to model non-linear lagged and instantaneous effects with history-dependent noise. • We prove that Rhino SEMs with the proposed form are structurally identifiable. To achieve this, we provide general conditions for structural identifiability with history-dependent noise, of which the form of Rhino SEMs is a special case. Furthermore, we clarify relations to several previous works. • We conduct extensive synthetic experiments with ablation studies to demonstrate the advantages of Rhino and its robustness across different settings. Additionally, we compare its performance to a wide range of baselines in two real-world discovery benchmarks.

2. BACKGROUND

In this section, we briefly introduce necessary prerequisite knowledge. In particular, we focus on structural equation models, Granger causality (Granger, 1969) and vector auto-regression. For review of more recent related work, please refer to Section 5. Structural Equation Models (SEMs) Consider X ∈ R D with D variables, SEM describes the causal relationships between them given a causal graph G: X i = f i (Pa i G , ϵ i ) (1) where Pa i G are the parents of node i and ϵ i are mutually independent noise variables. Under the context of multivariate time series, X t = X i t i∈V where V is a set of nodes with size D, the corresponding SEM given a temporal causal graph G is X i t = f i,t (Pa i G (< t), Pa i G (t), ϵ i t ), where Pa i G (< t) contains the parent values specified by G in previous time (lagged parents); Pa i G (t) are the parents at the current time t (instantaneous parents). The above SEM induces a joint distribution over the stationary time series {X t } T t=0 (see Assumption 1 in Appendix B for the definition). However, functional causal models with the above general form cannot be directly used for causal discovery due to the structural unidentifiability (Lemma 1, Zhang et al. (2015) One way to solve this is sacrificing the flexibility by restricting the functional class. For example, additive noise models (ANM), (Hoyer et al., 2008 ) X i = f i (Pa G (X i )) + ϵ i , which have recently been used for causal reasoning with non-temporal data (Geffner et al., 2022) . Granger Causality Granger causality (Granger, 1969) has been extensively used for temporal causal discovery. It is based on the idea that the series X j does not Granger cause X i if the history, X j <t , does not help the prediction of X i t for some t given the past of all other time series X k for k ̸ = j, i. Definition 2.1 (Granger Causality (Tank et al., 2018; Löwe et al., 2022) ). Given a multivariate stationary time series {X t } T t=0 and a SEM f i,t defined as X i t = f i,t (Pa i G (< t)) + ϵ i t , X j Granger causes X i if ∃l ∈ [1, t] such that X j t-l ∈ Pa i G (< t) and f i,t depends on X j t-l . Granger causality is equivalent to causal relations for directed acyclic graph (DAG) if there are no latent confounders and instantaneous effects (Peters et al., 2013; 2017) . Apart from the lack of instantaneous effects, it also ignore the history-dependent noise with independent ϵ i t . Vector Auto-regressive Model Another line of research focuses on directly fitting the identifiable SEM to the observational data with instantaneous effects. One commonly-used approach is called vector auto-regression (Hyvärinen et al., 2010; Pamfil et al., 2020) : X i t = β i + K τ =0 D j=1 B τ,ji X j t-τ + ϵ i t (5) where β i is the offset, τ is the model lag, B τ ∈ R D×D is the weighted adjacency matrix specifying the connections at time t -τ (i.e. if B τ,ji = 0 means no connection from X j t-τ to X i t ) and ϵ i t is the independent noise. Under these assumptions, the above linear SEM is structurally identifiable, which is a necessary condition for recovering the ground truth graph (Hyvärinen et al., 2010; Peters et al., 2013; Pamfil et al., 2020) . However, the above linear SEM with independent noise variables is too restrictive to fulfil the requirements described in Section 1. Therefore, the research question is how to design a structurally identifiable non-linear SEM with flexible history-dependent noise.

3. RHINO: RELATIONSHIP LEARNING WITH HISTORY DEPENDENT NOISE

This section introduces Rhino: Section 3.1 describes the novel functional form for Rhino SEMs, allowing for history-dependent noise. Section 3.2 details the associated variational inference framework for causal discovery.

3.1. MODEL FORMULATION

For a multivariate stationary time series {X t } T t=0 , we assume that their causal relations follow a temporal adjacency matrix G 0:K with maximum lag K where G τ ∈[1,K] specifies the lagged effects between X t-τ and X t , G 0 specifies the instantaneous parents. We define G τ,ij = 1 if X i t-τ → X j t and 0 otherwise. 1 We propose a novel functional form for Rhino's SEM that incorporates non-linear relations, instantaneous effects, and flexible history-dependent noise: X i t = f i (Pa i G (< t), Pa i G (t)) + g i (Pa i G (< t), ϵ i t ) ) where f i is a general differentiable non-linear function, and g i is a differentiable transform s.t. the transformed noise has a proper density. Despite of an additive structure, our formulation offers much more flexibility in both functional relations and noise distributions compared to previous works (Pamfil et al., 2020; Peters et al., 2013) . By placing few restrictions on f i , g i , it can capture functional non-linearity through f i and transform ϵ i t through a flexible function g i , depending on Pa i G (< t), to capture the history dependency of the additive noise. Next, we propose flexible functional designs for f i , g i , which must respect the relations encapsulated in G. Namely, if X j t-τ / ∈ Pa i G (< t) ∪ Pa i G (t) , then ∂f i /∂X j t-τ = 0 and similarly for g i . We design f i (Pa i G (< t), Pa i G (t)) = ζ i   K τ =0 D j=1 G τ,ji ℓ τ j X j t-τ   where ζ i and ℓ τ i (i ∈ [1, D] and τ ∈ [0, K]) are neural networks. For efficient computation, we use weight sharing across nodes and lags: ζ i (•) = ζ(•, u 0,i ) and ℓ τ j (•) = ℓ(•, u τ,j ), where u τ,i is the trainable embedding for node i at time t -τ . The design of g i needs to properly balance the flexibility and tractability of the transformed noise density. We choose a conditional normalizing flow, called conditional spline flow (Trippe & Turner, 2018; Durkan et al., 2019; Pawlowski et al., 2020) , with a fixed Gaussian noise ϵ i t for all t, i. The spline parameters are predicted using a hyper-network with a similar form to Eq. ( 7) to incorporate history dependency. The only difference is now τ is summed over [1, K] to remove the instantaneous parents. Due to the invertibility of g i , the noise likelihood conditioned on lagged parents is p gi (g i (ϵ i t )|Pa i G (< t)) = p ϵ (ϵ i t ) ∂g -1 i ∂ϵ i t .

3.2. VARIATIONAL INFERENCE FOR CAUSAL DISCOVERY

Rhino adopts a Bayesian view of causal discovery (Heckerman et al., 2006) , which aims to learn a graph posterior distribution instead of inferring a single graph. For N observed multivariate time series X (1) 0:T , . . . , X (N ) 0:T , the joint likelihood with model parameter θ is p(X (1) 0:T , . . . , X (N ) 0:T , G) = p(G) N n=1 p θ (X (n) 0:T |G). ( ) Graph Prior When designing the graph prior, we combine three components: (1) DAG constraint; (2) graph sparseness prior; (3) domain-specific prior knowledge (optional). Inspired by the NOTEARS (Zheng et al., 2018; Geffner et al., 2022; Morales-Alvarez et al., 2021) , we propose the following unnormalised prior p(G) ∝ exp -λ s ∥G 0:K ∥ 2 F -ρh 2 (G 0 ) -αh(G 0 ) -λ p ∥G 0:K -G p 0:K ∥ 2 F ( ) where h(G) = tr(e G⊙G ) -D is the DAG penalty proposed in (Zheng et al., 2018) and is 0 if and only if G is a DAG; ⊙ is the Hadamard product; G p is an optional domain-specific prior graph, which can be used when partial domain knowledge is available; λ s , λ p specify the strength of the graph sparseness and domain-specific prior terms respectively; α, ρ characterize the strength of the DAG penalty. Since the lagged connections specified in G 1:K can only follow the direction of time, only the instantaneous part, G 0 , can contain cycles. Thus, the DAG penalty is only applied to G 0 . Variational Objective Unfortunately, the exact graph posterior p(G|X (1) 0:T , . . . , X (N ) 0:T ) is intractable due to the large combinatorial space of DAGs. To overcome this challenge, we adopt variational inference (Blei et al., 2017; Zhang et al., 2018) , which uses a variational distribution q ϕ (G) to approximate the true posterior. We choose q ϕ (G) to be a product of independent Bernoulli distributions (refer to Appendix E for details). The corresponding evidence lower bound (ELBO) is log p θ X (1) 0:T , . . . , X (N ) 0:T ≥ E q ϕ (G) N n=1 log p θ (X (n) 0:T |G) + log p(G) + H(q ϕ (G)) ELBO(θ,ϕ) (11) where H(q ϕ (G)) is the entropy of q ϕ (G). From the causal Markov assumption and auto-regressive nature, we can further simplify log p θ (X (n) 0:T |G) = T t=0 D i=1 log p θ (X i,(n) t |Pa i G (< t), Pa i G (t)) and from Rhino's functional form (Eq. ( 6)) proposed in Section 3.1 log p θ (X i,(n) t |Pa i G (< t), Pa i G (t)) = log p gi z i,(n) t |Pa i G (< t) where z i,(n) t = X i,(n) t -f i (Pa i G (< t), Pa i G (t) ) and p gi is defined in Eq. (8) (Appendix A for details). The parameters θ, ϕ are learned by maximizing the ELBO, where the Gumbel-softmax gradient estimator is used (Jang et al., 2016; Maddison et al., 2016) . We also leverage augmented Lagrangian training (Hestenes, 1969; Andreani et al., 2008) , similar as Geffner et al. (2022) , to anneal α, ρ to make sure Rhino only produces DAGs (refer to Appendix B.1 in Geffner et al. (2022) ). Once Rhino is trained, the temporal causal graph can be inferred by G ∼ q ϕ (G). Treatment effect estimation As Rhino learns the causal graph and the functional relationship simultaneously, it can be extended for causal inference tasks such as treatment effect estimation (Geffner et al., 2022) . See Appendix D for details.

4. THEORETICAL CONSIDERATIONS

Here, we show the theoretical guarantees of Rhino including (1) the structural identifiability of Rhino SEMs and (2) soundness of the proposed variational inference framework. Together, they guarantee the validity of Rhino as a causal discovery method. In the end, we clarify relations to existing works.

4.1. STRUCTURAL IDENTIFIABILITY

One of the key challenges for causal discovery with a flexible functional relationship is to show the structural identifiability. Namely, we cannot find two different graphs that induce the same joint likelihood from the proposed functional causal model. In the following, we present a theorem for Rhino SEMs that summarizes our main theoretical contribution. Theorem 1 (Identifiability of Rhino SEMs). Assuming Rhino SEMs satisfy the causal Markov property, minimality, sufficiency, DAGness and the induced joint likelihood has a proper density (see Appendix B for details), and (1) all functions and induced distributions are third-order differentiable; (2) function f i is non-linear and not invertible w.r.t. any nodes in Pa i G (t); (3) the double derivative (log p gi (g i (ϵ i t )|Pa i G (< t))) ′′ w.r.t ϵ i t is zero at most at some discrete points, then the SEM with the form defined in Eq. ( 6) is structural identifiable for both bivariate and multivariate case. Sketch of proof. This theorem is a summary of a collection of theorems proved in Appendix B. The strategy is instead of directly proving the identifiability of the model, we provide identifiability conditions for a general temporal SEMs, followed by showing a generalization of Rhino SEMs satisfies these conditions. The identifiability of Rhino SEMs directly follows from it. Prove bivariate identifiability conditions for general temporal SEMs The first step is to prove the bivariate identifiability conditions that a general temporal SEMs (Eq. ( 2)) should satisfy (refer to Theorem 3 in Appendix B.1). Inspired by the techniques from Peters et al. (2013) , we proved that temporal SEMs are bivariate identifiable if (1) the model for initial conditions is identifiable; (2) the model is identifiable w.r.t. instantaneous parents. Compared to Peters et al. (2013) , we relaxed the identifiable model class condition so that it can be applied to history-dependent noise. In particular, (2) implies we only need to pay attention to instantaneous parents, rather than the entire parents (Peters et al., 2013) , and opens the door for flexible lagged dependency. Identifiability of history-dependent post non-linear model Next, we propose a novel generalization of Rhino SEMs, called history-dependent PNL. Theorem 4 and Corollary 4.1 in Appendix B.2 prove it is bivariate identifiable w.r.t. instantaneous parents (i.e. satisfy the conditions in Theorem 3) with additional assumptions (1), (2) and (3) in Theorem 1. The functional form of history-dependent PNL is defined as X i t = ν it f it Pa i G (< t), Pa i G (t) + g it Pa i G (< t), ϵ it , Pa i G (< t) , where ν is invertible w.r.t. the first argument. Due to its similarity to PNL (Zhang & Hyvarinen, 2012) and ANM (Hoyer et al., 2008) , we combine their techniques to prove our results. The bivariate identifiability of Rhino SEMs directly follows from this with ν being the identity mapping. Generalization to multivariate case In the end, we prove the above bivariate identifiability can be generalized to the multivariate case by adapting the techniques from Peters et al. (2012) and combining it with the proof strategy from step 1. Refer to Theorem 5 in Appendix B.3 for details.

4.2. VALIDITY OF VARIATIONAL OBJECTIVE AND RELATIONS TO OTHER METHODS

Next, we show the validity of the variational objective (Eq. ( 11)) in the sense that optimizing it can lead to the ground truth graph. Theorem 1 in Geffner et al. (2022) justifies the validity of the variational objective under the same set of assumptions as Rhino. Theorem 2 (Validity of variational objective (Geffner et al., 2022) ). Assuming the conditions in Theorem 1 are satisfied, and further assume no model misspecification (see Definition B.1), then the solution (θ ′ , q ′ ϕ (G)) from optimizing Eq. ( 11) with infinite data satisfies q ′ ϕ (G) = δ(G = G ′ ), where G ′ is a unique graph. In particular, G ′ = G * and p θ ′ (X 0:T ; G ′ ) = p(X 0:T ; G * ), where G * is the ground truth graph and p(X 0:T ; G * ) is the true data generating distribution. We emphasize that Theorem 2 guarantees the correctness with the global optimum, rather than characterizing the convergence during optimizing Eq. ( 11). We assume the Rhino only searches in the DAG space. In practice, with augmented Lagrangian (Wei et al., 2020; Ng et al., 2022) , we can only obtain an approximate posterior over DAGs, converging to a local optimum. Relation to other methods Many previous works of using functional causal model for causal time series discovery (Hyvärinen et al., 2010; Pamfil et al., 2020; Tank et al., 2018; Peters et al., 2013) are closely related to Rhino. Since Rhino SEMs incorporate history-dependent noise with flexible function non-linearity, it is the most flexible member of this family. Refer to Appendix C for details.

5. RELATED WORK

Assaad et al. ( 2022) provides a comprehensive overview of causal discovery method for time series. The first is Granger causality, which can be further split into (1) vector auto-regression (Wu et al., 2020; Shojaie & Michailidis, 2010; Siggiridou & Kugiumtzis, 2015; Amornbunchornvej et al., 2019) and (2) deep learning (Löwe et al., 2022; Tank et al., 2018; Bussmann et al., 2021; Dang et al., 2018; Xu et al., 2019) . Granger causality methods cannot handle instantaneous effects, which can be observed in a slow-sampling system. Additionally, they also assume a fixed noise distribution. Using funcitonal causal models can mitigate the aforementioned two problems. VARLiNGaM (Hyvärinen et al., 2010) extends the identifiability theory of linear non-Gaussian ANM (Shimizu et al., 2006) to vector auto-regression for modelling time series data. DYNOTEARS (Pamfil et al., 2020) leverages the recently proposed NOTEARS framework (Zheng et al., 2018) to continuously relax the DAG constraints for fully differentiable DAG structure learning. However, the above approach is still limited to linear functional forms. TiMINo (Peters et al., 2013) provides a general theoretical framework for temporal causal discovery with functional causal models. Our theory leverages some of their proof techniques. Unfortunately, all the aforementioned methods assume no history dependency for the noise. On the other hand, Rhino can model (1) non-linear function relations; (2) instantaneous effect; (3) and history-dependent noise at the same time. The third category is constraint-based approaches. Due to its non-parametric nature, it can handle history-dependent noise. PCMCI (Runge et al., 2019) combines PC (Spirtes et al., 2000) and conditional independence test for discovery from time series. PCMCI + (Runge, 2018; 2020) further extends PCMCI to infer both lagged and instantaneous effects. CD-NOD (Huang et al., 2020) has recently been proposed to handle non-stationary heterogeneous data, where the data distribution can shift across time. Despite their generality, they can only infer MECs; cannot learn the explicit functional forms between variables; and require a stronger assumption than minimality (i.e. faithfulness).

6. EXPERIMENTS

We release the code of Rhino for reproducing the following experiments.foot_1 .

6.1. SYNTHETIC DATA

We evaluate our method on a large set of synthetically generated datasets with known causal graphs. We use the main body of this paper to present the overall performance of our method compared to rel-evant baselines and one ablation study on the robustness to lag mismatch. In Appendix F.3, we conduct extensive analysis, including (1) on different graph type; (2) ablation on history-dependency; (3) ablation study on instantaneous effect. This set of datasets are generated by various settings (e.g. type of graphs, instantaneous/no instantaneous effect, etc.). 5 datasets are generated for each combination of settings with different seeds, yielding 160 datasets in total. In order to comprehensively test Rhino's robustness, we deliberately generated 75% of the datasets that mismatch the Rhino configurations. Details of the data generation can be found in Appendix F.1. We compare Rhino to a wide range of baselines, including VARLiNGaM (Hyvärinen et al., 2010) , PCMCI + (Runge, 2020) and DYNOTEARS (Pamfil et al., 2020) . PCMCI only outputs Markov equivalence classes (MECs). We resolve this by enumerating all DAGs in the MEC. For details on the methods, see Appendix F.2. Additionally, we include two variants of Rhino: (1) Rhino+g, where an independent Gaussian noise is used; (2) Rhino+s, where Gaussian ϵ i is transformed by an independent spline. Figure 1 presents the F 1 score of the lagged, instantaneous and temporal adjacency matrix of all methods aggregated over all datasetsfoot_2 , denoted as 'Lag', 'Inst.' and 'Temporal', respectively. Rhino achieves overall competitive or the best performance in terms of the full temporal adjacency matrix across all possible datasets, especially for lower dimensions. Comparing Rhino's lagged discovery to its two variants, the better score indicates the history-dependent noise is useful to the lagged graph discovery, contributing to the better overall F 1 performance (Appendix F.3 for ablation). Despite of the strong performance from PCMCI + , it can only identify the graph up to MECs without explicit functional relations. Computationally, PCMCI + exceeds the maximum training time of 1 week on 40 nodes (see Appendix F.3), suggesting its computation bottleneck in high dimensions. DYNOTEARS achieves inferior results in general due to limited modelling power from the linear nature. This is much clearer in high dimensions due to the increasing difficulty of the problem. We explore the behaviour of Rhino with different lag parameters other than the ground truth lag 2. From Table 1 , worse training log-likelihoods suggest that Rhino with insufficient history (lag = 1) is unable to correctly model the data and this leads to a decrease in F 1 scores. Interestingly, Rhino is also robust with longer lags. Despite of the slightly better likelihood (lag = 3), it achieves comparable performance to the model with the correct lag. Also, from their similar F 1 Lag score, it suggests the extra adjacency matrix is nearly empty.

6.2. DREAM3 GENE NETWORK

In this section, we evaluate Rhino performance with a real-world biology benchmark called DREAM3 (Prill et al., 2010; Marbach et al., 2009) . These datasets are often used to evaluate Granger causality (Khanna & Tan, 2019; Tank et al., 2018; Nauta et al., 2019; Bussmann et al., 2021) recently adopted for SEM-based method (Pamfil et al., 2020) . The dataset consists in silico measurements of gene expression levels for 5 different networks. Each network contains d = 100 genes. Each time series represents a perturbation trajectory with time length T = 21. For each network, 46 perturbation trajectories are recorded. The goal is to infer the causal structure of each network. We use the area under the ROC curve (AUROC) as the performance metric. We consider the same baselines as in the synthetic experiments (i.e. DYNOTEARS and PCMCI + ) without VARLiNGaM since its default implementation fails when the number of variables (d = 100) is greater than the series length (T = 21). Additionally, we also consider relevant Granger causality methods, including cMLP, cLSTM (Tank et al., 2018) By inspecting the hyperparameters of Rhino in Appendix G.1, instantaneous effects seem to provide no obvious help for in these datasets. It suggests the recording intervals are fast enough to avoid any aggregation effect. This explains why the Granger causality can also perform reasonably well. Unlike the strong performances of DYNOTEARS and PCMCI + in synthetic experiments, they perform poorly in DREAM3. The linear nature of DYNOTEARS seems to harm its performance drastically. PCMCI + suffers from the low independence test power under small training data. Another interesting ablation is to compare with Rhino+g, which performs on par with Rhino and achieves better scores on 2 out of 5 datasets. Although we have no access to the true noise mechanism, we suspect that the added noise is not history-dependent and highly likely to be Gaussian. Despite the model mismatch, Rhino is still one of the best methods for this problem. This further strengthens our belief in the robustness of our model under different setups.

6.3. NETSIM BRAIN CONNECTIVITY

In this section, we evaluate Rhino using fMRI imaging data, which has also been used as a benchmark for temporal causal discovery (Löwe et al., 2022; Khanna & Tan, 2019; Assaad et al., 2022) Table 3 shows the AUROCs for different methods. Remarkably, the proposed Rhino and its variants achieve significantly better AUROC compared to the baselines. Especially, Rhino obtains nearly optimal AUROC, demonstrating its robustness to the small dataset and good balances between true and false positive rates (see Appendix H). By comparing Rhino and Rhino+NoInst., we conclude that modelling instantaneous effects is important in real application, indicating the sampling interval is not frequent enough to explain everything as lagged effects. This can be double confirmed by comparing Rhino+NoInst with Granger causality, where it performs on par with the state-ofthe-art baseline when disabling the instantaneous effect. Last but not least, by comparing Rhino+g with Rhino, we find that historydependent noise is also helpful in this dataset.

7. CONCLUSION

Inferring temporal causal graphs from observational time series is an important task in many scientific fields. Especially, some applications (e.g. education, climate science, etc.) require the modelling of non-linear relationships; instantaneous effects and history-dependent noise distributions at the same time. Previous works fail to offer an appropriate solution for all three requirements. Motivated by this, we propose Rhino, which combines vector auto-regression with deep learning and variational inference to perform causal temporal relationship learning with all three requirements. Theoretically, we prove the structural identifiability of Rhino with flexible history-dependent noise, and clarify its relations to existing works. Empirical evaluations demonstrate its superior performance and robustness when Rhino is misspecified, and the advantages of history-dependent noise mechanisms. This opens an exciting route of extending Rhino to handle non-stationary time-series and unobserved confounders in future work.

8. REPRODUCIBILITY STATEMENT

Theoretical Contributions The main theoretical contribution is summarized in Theorem 1. This theorem is the result from a collection of theorems proved in Appendix B. In Appendix B, we detailed the fundamental assumptions (Assumption 1-Assumption 5) required for the all theorems. The theorem-specific assumptions are mentioned in the statement of the theorem. To ease the understanding of the proof, we also provide the skecth of proof in Theorem 1. Since Theorem 2 is directly cited from (Geffner et al., 2022) without major modification, the proof can be found in Appendix A in Geffner et al. (2022) . Empirical Evaluations For synthetic, DREAM3 and Netsim experiments, we listed the hyperparameters in Appendix F.2, Appendix G.1 and Appendix G.4, respectively. Appendix F.1 explains the synthetic data generation. For DREAM3 and Netsim, the dataset can be found in the public github repo https://github.com/sakhanna/SRU_for_GCI/tree/master/data. The post processing steps for DREAM3 and Netsim evaluations are described in Appendix G.2.

A ELBO AND LIKELIHOOD DERIVATION

The goal is to derive a lower bound for the joint likelihood p θ (X (1) 0:T , . . . , X (N ) 0:T ). p θ (X (1) 0:T , . . . , X (N ) 0:T ) = log p θ X (1) 0:T , . . . , X (N ) 0:T |G p(G)dG = log q ϕ (G) q ϕ (G) p θ X (1) 0:T , . . . , X (N ) 0:T |G p(G)dG ≥ q ϕ (G) log p θ X (1) 0:T , . . . , X (N ) 0:T |G p(G)dG + H(q ϕ (G)) (14) =E q ϕ (G) N n=1 log p θ (X (n) 0:T |G) + log p(G) + H(q ϕ (G)) where Eq. ( 14) is obtained by using Jensen's inequality. We can further simplify the likelihood p θ (X (n) 0:T |G): log p θ (X (n) 0:T |G) = log T t=0 p θ (X (n) t |X (n) <t , G) = T t=0 log p θ X (n) t |X (n) <t , G = T t=0 D i=1 log p θ X i,(n) t |Pa i G (< t), Pa i G (t) where Eq. ( 15) is obtained through Markov factorization (Lauritzen, 1996) .

B STRUCTURAL IDENTIFIABILITY

In this section, we will focus on proving the structural identifiability of Rhino SEMs. Before diving into the details, let us clarify the required assumptions. Definition B.1 (Correctly specified model). For a true data generating mechanism X i = f * i (Pa i , ϵ i ) for i = 1, . . . , D, we say a model with functional space F is correctly specified if there exists a function f i ∈ F s.t. f i = f * i for all i = 1, . . . , D. Here, we emphasize that the above definition of model specification does not require identifiability in parameters space, but in function space instead. Namely, it allows multiple sets of parameters that correspond to the same function. Our definition is more general than some of the previous work, which enforces parameter identifiability (e.g. Def 2.1 in Ma & Zhang (2021) ) Assumption 1 (Causal Stationarity (Runge, 2018) ). The time series process X t with a graph G is called causally stationary over a time index set T if and only if for all links X i t-τ → X j t in the graph X i t-τ ̸ ⊥ ⊥ X j t |X <t \{X i t-τ } holds for all t ∈ T This characterizes the nature of the time-series data generating mechanism, which validates the choice of the auto-regressive model. Assumption 2 (Causal Markov Property (Peters et al., 2017) ). Given a DAG G and a joint distribution p, this distribution is said to satisfy causal Markov property w.r.t. the DAG G if each variable is independent of its non-descendants given its parents. This is a common assumptions for the distribution induced by an SEM. With this assumption, one can deduce conditional independence between variables from the graph. Minimality is also a common assumption for SEMs (Hoyer et al., 2008; Zhang & Hyvarinen, 2012; Peters et al., 2012) , which can be regarded as a weaker version of faithfulness (Peters et al., 2017) . Assumption 4 (Causal Sufficiency). A set of observed variables V is causally sufficient for a process X t if and only if in the process every common cause of any two or more variables in V is in V or has the same value for all units in the population. This assumption implies there are no latent confounders present in the time-series data. Assumption 5 (Well-defined Density). We assume the joint likelihood induced by the Rhino SEM (Eq. ( 6)) is absolutely continuous w.r.t. a Lebesgue or counting measure and | log p(X 0:T ; G)| < ∞ for all possible G. This assumption is to make sure the induced distribution has a well-defined probability density function. It is also required for the equivalence of the global, local Markov property and Markov factorization property (Theorem 6.22 from Peters et al. ( 2017)). Assumption 6 (Rhino in DAG space). We assume the Rhino framework can only return the solutions from DAG space. Namely, the posterior distribution from Rhino can only put weights on DAGs. This assumption regularizes the search space of Rhino to be DAG space, which aligns with assumption 2 and 3 on causal graphs. In the following, we will structure the entire proof into three steps: 1. Prove a general conditions that the bivariate time series model needs to satisfy for structural identifiability. This adapts from the theorem 1 in Peters et al. (2013) . 2. Prove that a generalized form of SEM, modified from the post non-linear (PNL) model (Zhang & Hyvarinen, 2012) , satisfies the conditions mentioned in step 1. The proposed Rhino (Eq. ( 6)) is a special case of the above SEM. 3. In the end, we generalize the above indentifiability to the multivariate case.

B.1 GENERAL IDENTIFIABILITY CONDITIONS

First, we derive the conditions required for identifiability for a general bivariate time series SEM, defined as X i t = f i,t Pa i G (< t), Pa i G (t), ϵ i t . We call the above SEM transition model, since it only defines the transition behavior rather than the initial conditions. We also need to incorporate a source model, which characterizes the initial conditions: X i s = f i,s (Pa i G , ϵ i s ) for s ∈ [0, S], where S is the length for the initial conditions and Pa i G contains the parents for node i. We define p s (X 0:S ) as the induced joint distribution for the initial conditions. Now, we prove the following theorem. Theorem 3 (Identifiability conditions for bivariate time series). Assuming Assumption 1-5 are satisfied, given a bivariate temporal process X 0:T and Y 0:T that are governed by the above SEM (Eq. ( 16)) with source model (Eq. ( 17)), then the above SEM for the bivariate temporal process is structural identifiable if the following conditions are true: 1. Source model f i,s is structural identifiable for all i = 1, . . . , D and s ∈ [0, S]. 2. The transition model (Eq. ( 16)) is bivariate identifiable w.r.t the instantaneous parents. Namely, if graph G induced conditional distributions p(X t , Y t |Pa X,Y G (< t)), then ∄G ′ such that G ̸ = G ′ and the induced conditional p(X t , Y t |Pa X,Y G ′ (< t)) = p for all t ∈ [S + 1, T ]. where Pa X,Y G (< t) is the union of the lagged parents of X t and Y t under G, and Pa X,Y G ′ (< t) is the union of parents under G ′ . Proof. We prove this by contradiction. Assume we have an induced joint distribution p(X 0:T , Y 0:T ) under G, and corresponding p under G ′ . We further assume the above two conditions in the theorem are met and p = p but G ̸ = G ′ . Thus, we have D KL [p∥p] = 0. Due to the temporal nature of the model, we can further decompose it as the following: D KL [p∥p] = p(X 0:T , Y 0:T ) log p(X 0:T , Y 0:T ) p(X 0:T , Y 0:T ) dX 0:T dY 0:T =D KL [p(X 0:S , Y 0:S ) ps ∥ p(X 0:S , Y 0:S ) ps ] + p(X 0:S , Y 0:S )D KL [p(X S+1:T , Y S+1:T |X 0:S , Y 0:S )∥ p(X S+1:T , Y S+1:T |X 0:S , Y 0:S )]dX 0:S dY 0:S =D KL [p s ∥p s ] + T t=S+1 E p(X0:t-1,Y0:t-1) [D KL [p(X t , Y t |X 0:t-1 , Y 0:t-1 )∥p(X t , Y t |X 0:t-1 , Y 0:t-1 )]] =0. This means we have D KL [p s ∥p s ] = 0 and D KL [p(X t , Y t |X 0:t-1 , Y 0:t-1 )∥p(X t , Y t |X 0:t-1 , Y 0:t-1 )] = 0 almost everywhere. Inspired by the strategy used in (Peters et al., 2013) , We consider the following three scenarios: Disagree on initial conditions We assume G and G ′ disagree on the initial conditions. From the condition 1, we know the source model f i,s is identifiable. Namely, we cannot find G ̸ = G ′ with disagreement on initial conditions such that D KL [p s ∥p s ] = 0. This is a contradiction, meaning that G and G ′ must agree on the connections between initial set of nodes.

Disagree on lagged parents only

This means for all t ∈ [S + 1, T ], the instantaneous connections at t for G and G ′ are the same, and ∃t ∈ [S + 1, T ] such that Pa X,Y G (< t) ̸ = Pa X,Y G ′ (< t). We can use a similar argument as the theorem 1 in Peters et al. (2013) . W.l.o.g., we assume under G, we have X t-τ → Y t and there is no connections between them under G ′ . Thus, from Markov conditions, we have Y t ⊥ ⊥ X t-τ |X 0:t-1 ∪ Y 0:t-1 ∪ ND Y t \{Y t , X t-τ } under G ′ , where ND Y t are the non-descendants of node Y t at some time t. However, from the causal minimality and proposition 6.16 in Peters et al. (2017) , we have Y t ̸ ⊥ ⊥ X t-τ |X 0:t-1 ∪ Y 0:t-1 ∪ ND Y t \{Y t , X t-τ } under G. This means under this case, D KL [p(X t , Y t |X 0:t-1 , Y 0:t-1 )∥p(X t , Y t |X 0:t-1 , Y 0:t-1 )] ̸ = 0, which is a contradiction. Disagree also on instantaneous parents This scenarior means ∃t ∈ [S + 1, T ] such that they disagree on instantaneous parents. W.l.o.g. we assume X t → Y t under G and Y t → X t under G ′ . Let's define X 0:t-1 ∪ Y 0:t-1 = h, h Y G ⊆ h contains the values of Pa Y G (< t) under G, hY G ′ ⊆ h contains the parent values under G ′ , and h X G , hX G ′ accordingly. Thus, the induced conditional distributions from SEM (Eq. ( 16)) with G, G ′ are p(X t , Y t |h X G ∪ h Y G ) and p(X t , Y t | hX G ′ ∪ hY G ′ ) From the Markov conditions, we have p(X t , Y t |X 0:t-1 , Y 0:t-1 ) = p(X t , Y t |Pa X,Y G (< t)) Therefore, we have D KL [p(X t , Y t |h)∥p(X t , Y t |h)] =0 =D KL [p(X t , Y t |h X G ∪ h Y G )∥p(X t , Y t | hX G ′ ∪ hY G ′ )] for arbitrary h, which contradicts the strucutral identifiability w.r.t. the instantaneous parents. In summary, with the two conditions, we cannot find G ̸ = G ′ such that the induced joint p(X 0:T , Y 0:T ) = p(X 0:T , Y 0:T ), meaning that the SEMs defined as Eq. ( 16) and Eq. ( 17) are identifiable w.r.t. bivariate time series. Since one can use any identifiable static models to characterize the initial behavior of the time series, we will focus on condition 2 for the transition model. In the following, we will show that a generalization of PNL, called history-dependent PNL, satisfies condition 2 under assumptions.

B.2 IDENTIFIABILITY OF HISTORY-DEPENDENT PNL

First, we propose a generalization of PNL (Zhang & Hyvarinen, 2012 ) so that it can be historydependent. For a multivariate temporal process X 0:T , we propose history-dependent PNL as X i t = ν it f it Pa i G (< t), Pa i G (t) + g it Pa i G (< t), ϵ it , Pa i G (< t) where ν it is an invertible transformation w.r.t. the first argument. The main differences of the above SEM compared to typical PNL are (1) the invertible transformation ν it can be history dependent; (2) the inner noise distribution can also be history-dependent. Next, we show the main theorem about its bivariate identifiability w.r.t. its instantaneous parents. Theorem 4 (History-dependent PNL Bivariate Identifiability). Assume Assumption 1-5 are satisfied, all transformations in Eq. ( 18) and corresponding induced distributions are 3 rd -order differentiable. Given a bivariate temporal process X 0:T , Y 0:T , then the history-dependent PNL defined as Eq. ( 18) is bivariate identifiable w.r.t its instantaneous parents (i.e. satisfy condition 2 in Theorem 3), except for some special cases. Proof. W.l.o.g. at time t ∈ [S + 1, T ], we assume X t → Y t for instantaneous connection under G and Y t → X t under G ′ . We fix a value h for their entire history X 0:t-1 ∪ Y 0:t-1 = h. With h, we further define their lagged parents as Pa X G (< t) = h X G ⊆ h, Pa Y G (< t) = h Y G ⊆ h under G and Pa X G ′ (< t) = hX G ′ ⊆ h, Pa Y G ′ (< t) = hY G ′ under G ′ . Therefore, the SEM at time t can be written as Y t = ν f (h Y G , X t ) + g(h Y G , ϵ Y ), h Y G ( ) and X t = ν f ( hX G ′ , Y t ) + ḡ( hX G ′ , ϵ X ), hX G ′ (20) under G and G ′ , respectively. Let's assume that their induced conditional distributions at time t are equal (i.e. violating the identifiable condition (2) in Theorem 3): log p(X t , Y t |h X G ∪ h Y G ) under G = log p(X t , Y t | hX G ′ ∪ hY G ′ ) under G ′ From the Markov properties, the above equation is equivalent to log p(X t , Y t |h) = log p(X t , Y t |h) with a fixed value h of the entire history. Now, let's define α t = ν-1 (X t ) and β t = ν -1 (Y t ) where we omits the dependence of ν-1 to hX G ′ and ν -1 to h Y G . It is easy to observe that we have an invertible mapping between (X t , Y t ) and (α t , β t ). Thus, from the change of variable formula, we have log p(X t , Y t |h) = log p α,β (α t , β t |h) + log |J | and log p(X t , Y t |h) = log pα,β (α t , β t |h) + log |J | where J is the Jacobian matrix of the transformation. Thus, the equivalence of log p and log p in the (X t , Y t ) space can be translated to (α t , β t ) space. Thus, from Eq. ( 19), we have β t = Φ(α t ) + g(h Y G , ϵ Y ) ) under G. And from Eq. ( 20), we have α t = Ψ(β t ) + ḡ( hX G ′ , ϵ X ) (22) under G ′ . This forms an additive noise model between α t , β t with history-dependent noise. Next, we can use a similar proof techniques as in Hoyer et al. (2008) . Here, Φ(•) = f (h Y G , •) • ν( hX G ′ , •) and Ψ(•) = f ( hX G ′ , •) • ν(h Y G , •). We further define η 1 (α t ) = log p(α t |h) η1 (β t ) = log p(β t |h) η 2 (g(h Y G , ϵ Y )) = log p g (g(h Y G , ϵ Y )|h) η2 (ḡ( hX G ′ , ϵ X )) = log pg (ḡ( hX G ′ , ϵ X )|h ) Thus, under G (i.e. Eq. ( 21)), we have log p(α t , β t |h) = log p(β t |α t , h) + log p(α t |h) =η 2 (β t -Φ(α t )) + η 1 (α t ) Similarly, under G ′ (i.e. Eq. ( 22)), we have log p(α t , β t ) = η2 (α t -Ψ(β t )) + η1 (β t ) Based on Eq. ( 24 Due to the equivalence of log p and log p, we apply the above operations to Eq. ( 23). After some algebraic manipulation, we obtained the following differential equations for η ′′ 2 Φ ′ ̸ = 0: η ′′′ 1 - η ′′ 1 Φ ′′ Φ ′ = η ′ 2 η ′′′ 2 η ′′ 2 -2η ′′ 2 Φ ′′ Φ ′ - η ′′′ 2 η ′′ 2 Φ ′ η ′′ 1 + η ′ 2 Φ ′′′ - (Φ ′′ ) 2 Φ ′ . ( ) Interestingly, this is exactly equivalent to Eq.( 4) in Zhang & Hyvarinen (2012) . The main difference is the definition of variables and transformations in here are all history-dependent. Further, we can also observe that β t ⊥ ⊥ ḡ(h Y G , ϵ Y )|X 0:t-1 ∪ Y 0:t-1 = h. Since β t = Φ(α t ) + g(h Y G , ϵ Y ) and ḡ( hX G ′ , ϵ X ) = α t -Ψ(β t ) , it is trivial to show the determinant of the Jacobian of the transformation (α t , g) to (β t , ḡ) is 1. Thus, by a similar argument in theorem 1 from Zhang & Hyvarinen (2012) , we can derive 1 Ψ ′ = η ′′ 1 + η ′′ 2 (Φ ′ ) 2 -η ′ 2 Φ ′′ η ′′ 2 Φ ′ for η ′′ 2 Φ ′ ̸ = 0. Thus, the above two differential equations has the same form as theorem 1 in Zhang & Hyvarinen (2012) where the main difference is that all distributions and transformations involved in our case depend on history h. Therefore, we can directly cite the theorem 8 from Zhang & Hyvarinen (2012) , which proves that the above differential equations hold true only for 5 types of special cases. One can refer to Table 1 in Zhang & Hyvarinen (2012) for details. Corollary 10 from Zhang & Hyvarinen (2012) validates the choice of using nueral network for the transformation f . For completeness, we include it here with slight modification: Corollary 4.1 (Identifiability with neural netowrk f ). Assuming the assumptions in Theorem 4 are true, and the double derivative (log p g (g(Pa Y G (< t), ϵ Y )|X 0:t-1 ∪ Y 0:t-1 )) ′′ w.r.t ϵ Y is zero at most at some discrete points. If function f is not invertible w.r.t. the instantaneous parents, then, the history-dependent PNL defined as Eq. ( 18) is bivariate identifiable w.r.t. the instantaneous parents (i.e. satisfy condition 2 in Theorem 3). It is clear to see that the form of Rhino SEMs (Eq. ( 6)) is a special case of the history-dependent PNL (Eq. ( 18)), where the outer history-dependent invertible transformation ν is the identity mapping. Thus, we can directly leverage Theorem 3 together with Theorem 4 to show Rhino SEMs are identifiable w.r.t bivariate time series, and Corollary 4.1 to validate our design choice (Eq. ( 7)).

B.3 GENERALIZING TO MULTIVARIATE TIME SERIES

Previously, we prove the identifiability conditions for bivariate time series. In this section, we will generalize it to the multivariate case. Theorem 5 (Generalization to multivariate time series). Assuming the assumptions in Theorem 4 are satisfied, we further assume that the multivariate SEM defined in Eq. ( 18) satisfies: for each pair of node i, j ∈ V , the SEM X i t = ν it   fit   Pa i G (< t), Pa i G (t)\{X j t }, • X j t    + g it Pa i G (< t), ϵ it , Pa i G (< t)    is bivariate identifiable w.r.t. the input, and an identifiable source model is adopted. Then, the history-dependent PNL is identifiable except for some special cases. Proof. For this proof, we can follow the strategy used in Theorem 3 and Peters et al. (2013) . We categorize the difference of the graph G and G ′ into three types. Following the same analysis of the KL divergence of the two induced joint distributions, we can see that (1) D KL [p s ∥p s ] = 0 and D KL [p(X t |X 0:t-1 )∥p(X t |X 0:t-1 )] = 0. Disagree on initial conditions Since we assume that the source model is identifiable, this contradicts D KL [p s ∥p s ] = 0. Disagree on lagged parents only We notice that the analysis used in Theorem 3 for this disagreement can be directly translated to multivariate case. The only difference is that the notation Y t , X t is changed accordingly. Disagree also on instantaneous parents For this case, with a fixed history value h = X 0:t-1 , the aim is to compare the conditionals D KL [p(X t |X 0:t-1 = h)∥p(X t |X 0:t-1 = h)]. Thus, the problem becomes to how to generalize the bivariate identifiability for instantaneous parents to the multivariate case. We leverage the theorem 2 from Peters et al. (2012) , which proves the multivariate identifiability for any models that belongs to IFMOC. It is easy to see that if the assumptions in Theorem 5 are met, the history-dependent PNL belongs to IFMOC w.r.t. the instantaneous parents. It should be noted that the entire history-dependent PNL DOES NOT belong to IFMOC, but this does not affect our results since we only care about the instantaneous parents under this case.

C RELATION TO OTHER METHODS

VARLiNGaM (Hyvärinen et al., 2010) VARLiNGaM (Hyvärinen et al., 2010 ) is a causal discovery method for time series data based on the linear vector auto-regression, which can model both lagged and instantaneous effects. Its SEM is defined as Eq. ( 5), where the noise ϵ i t is an independent non-Gaussian noise. It is easy to observe that this is a special case of Rhino (Eq. ( 6)) by setting f i as the matrix multiplication of the weighted adjacency G 0:K with the nodes, and g i as the identity mapping. For the training objective, VARLiNGaM adopted a two stage training to sidestep the difficulty of directly optimizing the log likelihood. From the Theorem 2 for Rhino, we note that the solution from optimizing the variational objective is equivalent to maximizing the log likelihood under infinite data limit. Therefore, by setting large enough DAGness penalty coefficient α, ρ, the inferred graph from both methods should be equivalent. DYNOTEARS (Pamfil et al., 2020) The formulation of DYNOTEARS is the same as VAR-LiNGaM, which is based on linear vector auto-regression. The main novelty is the usage of the DAGness penalty h(G), which continuously relaxes the DAG constraint. The training objective is the mean square error with augmented Lagrange scheme for DAGness penalty. Thus, it is obvious that DYNOTEARS is a special case of Rhino with linear transformations and identity g i . Similarly, Theorem 2 shows the connections between the variational objective and maximum likelihood, which is equivalent to mean square error if the noise distribution is Gaussian with equal variances. cMLP cMLP (Tank et al., 2018) combines Granger causality with deep neural networks. The model formulation is X i t = f i (X 1 0:t-1 , . . . , X D 0:t-1 ) + ϵ i t where f i is a function based on MLP. Although the input is the entire history, the one that matters is the node that has the connection to X i t (i.e. lagged parents). Therefore, it is easy to see they are closely related to Rhino without instantaneous parents Pa i G (t) and history-dependent noise. Since the training objective of cMLP is based on the mean square error with sparseness constraint, by the same argument as before, the variational objective is equivalent to mean square error with equal variance Gaussian noise and large training data. TiMINo (Peters et al., 2013) TiMINo is most similar to our work among all the aforementioned methods in terms of model formulation. TiMINo proposed a very general formulation based on IF-MOC (Peters et al., 2012) and showed the conditions for structural identifiability. Rhino generalizes the TiMINO in a way such that noise history dependency can be incorporated. Thus, Rhino only belongs to IFMOC w.r.t. the instantaneous parents. Therefore, Rhino without the history-dependent noise is a TiMINo model. The training objective of TiMINo is based on the dependence minimization between the noise residuals and causes, and can only infer summary graph instead of temporal causal graph. Zhang et al. (2015) proved the equivalence of the mutual information minimization to maximum likelihood, which is equivalent to our variational objective under infinite data.

D TREATMENT EFFECT ESTIMATION

We now show how to leverage the fitted Rhino for estimating the conditional average treatment effect (CATE). For simplicity, we only consider a special case of CATE defined as CATE(a, b) = E q ϕ (G) E p(X Y t+τ |X<t,do(X I t =a),G) [X Y t+τ ] -E p(X Y t+τ |X<t,do(X I t =b),G) [X Y t+τ ] We assume the conditioning variable can only be X <t (i.e. the entire history before t), and the intervention and target variable can only be either at current time t or sometime in the future t+τ . We emphasize that this formulation is for simplicity, and Rhino can be easily generalized to more cases as Geffner et al. (2022) . Once fitted, the idea is to draw target samples X Y t+τ from the interventional distribution p(X Y t+τ |X <t , do(X I t ), G) for each graph sample G ∼ q ϕ (G). Then, unbiased Monte Carlo estimation can be used to compute CATE. For sampling from the interventional distribution, we can use the "multilated" graph G do(X I t ) to replace G, where all incoming edges to X I t are removed. The intervention samples can be obtained by simulating the Rhino with history X <t , X I t = a or b and G do(X I t ) .

D.1 CAUSAL INFERENCE RESULTS

Here, we provide the preliminary results for CATE performance of Rhino by calculating the RM-SEs of the estimated CATEs comparing to the true CATE from the interventional samples (lower is better). We present boxplots of the performance in Fig. 2 . All Rhino-based method perform similarly. Surprisingly, the CATE performance seems to have little correlation to the causal discovery performance and warrants further study in the future. 

E VARIATIONAL DISTRIBUTION FORMULATION

Here we provide the detailed formulation of the independent Bernoulli distribution q ϕ (G). Since this distribution is responsible for modelling the temporal adjacency matrix G 0:K , we use Σ k to represents the edge probability in G k . We further split the edge probability matrices into the instantaneous part Σ 0 and lagged parts Σ 1:K . To avoid the constrained optimization of Σ 1:K (i.e. the value needs to be within [0, 1]), we adopt the following formulation: σ k,ij = exp(u k,ij ) exp(u k,ij ) + exp(v k,ij ) where u k,ij ∈ U k , v k,ij ∈ V k and U k , V k ∈ R D×D for all k = 1, . . . , K. Since we do not require lagged adjacency matrix to be a DAG, U k , V k has no constraints during optimization. On the other hand, G 0 needs to be a DAG for instantaneous effect. By smart formulation, we can get rid of the length-1 cycles. The intuition is that for a pair of node i, j, only three mutually exclusive possibilities can exist: (1) i → j; (2) j → i; (3) no edge between them. Thus, instead of using a full probability matrix Σ 0 , we use three lower triangular matrices U 0 , V 0 and E 0 to characterise the above three scenarios. For node i > j, p(i → j) = exp(u ij ) exp(u ij ) + exp(v ij ) + exp(e ij ) p(j → i) = exp(v ij ) exp(u ij ) + exp(v ij ) + exp(e ij ) p(no edge) = exp(e ij ) exp(u ij ) + exp(v ij ) + exp(e ij ) . Thus, by this formulation, the corresponding instantaneous adjacency matrix will not contain length-1 cycles.

F SYNTHETIC EXPERIMENTS

F.1 DATA GENERATION We create the synthetic datasets in a four step process: 1) generate random Erdös-Rényi (ER) or scale-free (SF) graphs that specify the lagged and instantaneous causal relationships; 2) drawing random MLPs for the functional relationships as well as a random conditional spline transformation to modulate the scale of the Gaussian noise variables ϵ; 3) sample initial starting conditions and follow Eq. ( 2) with the additive noise to simulate the temporal progression; 4) removing the burn-in period and return stable timeseries. We consider four different axes of variation for the data generation: number of nodes N nodes ∈ [5, 10, 20, 40] ; ER or SF graphs; instantaneous or no instantaneous effects; and history-dependent or history-independent noise (i.e. Gaussian noise). All combinations are generated with 5 different seeds, yielding 160 different datasets. Datasets with instantaneous effects have 4 × N nodes edges in the instantaneous adjacency matrix. All datasets have 2 × N nodes connections in the lagged adjacency matrices. The MLPs for the functional relationships are fullyconnected with two hidden layers,64 units and ReLU activation. In case of history-independent noise, we are using Gaussian as the base distribution. The history dependency is modelled as a product of a scale variable obtained by the transformation of the averaged lagged parental values through a random-sampled quadratic spline, and Gaussian noise variable. The datasets with 40 nodes are generated with a series length of 400 steps, a burn-in period of 100 steps, and 100 training series. All other datasets are generated with a time-series length of 200, burnin period of 50 steps and 50 training series. We generate random interventions for all the datasets by setting the treatment variable to 10 for intervention and -10 for reference. 5000 ground-truth intervention samples are used to estimate the true treatment effect.

F.2 METHODS

All benchmarks for the synthetic experiments are run by using publicly available libraries: VARLiNGaM Hyvärinen et al. (2010) is implemenented in the lingamfoot_4 python package. PCMCI + (Runge, 2020) is implemented in Tigramitefoot_5 . We use the implementation in causalnexfoot_6 to run DYNOTEARS (Pamfil et al., 2020) . We use the default parameters for all these baselines. For PCMCI + , we enumerate all graphs in the Markov equivalence class to evaluate the causal discovery performance (see Appendix G.2 for details). For Rhino and its variants, we use the same set of hyper-parameters for all 160 datasets to demonstrates our robustness. By default, we allow Rhino and its variants to model instantaneous effect; set the model lag to be the ground truth 2 except for ablation study; the q ϕ (G) is initialized to favour sparse graphs (edge probability< 0.5); quadratic spline flow is used to for history-dependent noise. For the model formulation, we use 2 layer fully connected MLPs with 64 (5 and 10 nodes), 80 (10 nodes) and 160 (40 nodes) for all neural networks in Rhino-based methods. We also apply layer normalization and residual connections to each layer of the MLPs. For the gradient estimator, we use the Gumbel softmax method with a hard forward pass and a soft backward pass with temperature of 0.25. All spline flows uses 8 bins. The embedding sizes for transformation (i.e. Eq. ( 7) and conditional spline flow) is equal to the node number. For the sparseness penalty λ s in Eq. ( 10), we use 9 for Rhino and Rhino+s, and 5 for Rhino+g. We set ρ = 1 and α = 0 for all Rhino-based methods. with only DYNOTEARS (Pamfil et al., 2020) as an exception. We note that the PCMCI + runs on SF graphs with 40 nodes exceed our maximum run time of 1 week, showing its computational limitation in high dimensions. Nevertheless, Rhino achieves consistent performance throughout all graph settings. Ablation: history dependency Figure 4 explores the performance difference of all methods on data generated with/without history-dependent noise. Interestingly, most methods perform better on the history-dependent datasets than the history-independent ones. The possible reasons are (1) the difficulty of the discovery also depends on the randomly sampled functions; (2) the default hyperparameters of all methods are initially chosen to favor the datasets with history-dependent noise and instantaneous effects. We find that PCMCI + is the most robust across both settings, followed by Rhino and DYNOTEARS. On the other hand, the two variants of Rhino seems to be less robust. When the Rhino is correctly specified, it achieves the best performance. In summary, Rhino demonstrates reasonable robustness to history-dependency mismatch and achieves the best when correctly specified. Ablation: instantaneous effect We investigate the impact of instantaneous effects in the data. Figure 5 shows the F 1 score averaged over all possible setting combinations other than instantaneous effect. All methods seem to be robust across both settings with PCMCI + and Rhino performing the best. The score of the instantaneous adjacency matrix when instantaneous effects are disabled is not defined and therefore not plotted. For tuning the hyper-parameters of Rhino, its variants and DYNOTEARS, we split each of the 5 datasets into 80%/20% training/validation. We tune Rhino and its variants based on the validation likelihoods, and DYNOTEARS based on the validation RMSE error. For PCMCI + , we use the default settings recommended in the Tigramite package (https://github.com/jakobrunge/ tigramite). For other Granger causality baselines, refer to Table 7 -11 in Khanna & Tan (2019) . Other than the hyper-parameters reported in Table 4 , we use 1-layer MLPs with 10 hidden units for both ℓ τ,j , ζ i in Eq. ( 7) and the hyper-network for conditional spline flow (8 bins). All the MLPs use residual connections and layer-norm at every hidden layer. We use linear conditional spline flow (Dolatabadi et al., 2020) instead of the original quadratic version (Durkan et al., 2019) for better training stability. We also initialise the Bernoulli probability q ϕ (G) to favour dense graphs (i.e. edge probability > 0.5). For prior p(G), we set the initial value ρ = 1 and α = 0. For the gradient estimator, we use the Gumbel softmax method with a hard forward pass and a soft backward pass with temperature of 0.25. It is worth noting that unlike DREAM3 experiment, where the results and hyperparameters of Granger causality baselines can be directly taken from Khanna & Tan (2019) . Their setup of Netsim experiment is different from ours, where they train the baselines using a single subject and compute the corresponding AUROC, followed by averaging over subjects 2-6. Our setup is to train all methods using the entire data from subject 2-6 before computing AUROC. Thus, the hyperparameters for Granger causality are slightly different, and the AUROC increases for the baselines compared to those reported in Khanna & Tan (2019) . Rhino The hyperparameters are the same as DREAM3, except for the following: we initialise the Bernoulli probability of q ϕ (G) to have no preference (i.e. edge probability= 0.5); the λ s = 25; we use 2 layer MLPs with 64 hidden units for both functional model (Eq. ( 7)) and hyper-network with embedding size 15; the augmented Lagrangian step is 5. For Rhino variants, we use the above settings as well. DYNOTEARS, PCMCI + and VARLiNGaM For DYNOTEARS, we set lag to be 2, λ a = 0.5 and λ w = 0.5. For PCMCI + , we use parcorr independence test with lag 3. For VARLiNGaM, we use lag 2 with default settings as https://lingam.readthedocs.io/en/latest/. Granger Causality For computing AUROC, we follow the same method as Khanna & Tan (2019) ; Tank et al. (2018) by sweeping through a range of hyperparameters. Specifically, we use the same hyperparameters for SRU and eSRU as (Khanna & Tan, 2019) . For cMLP, we choose the ridge penalty as 0.43 and sweep through the group sparse penalty in range [0.1, 1]. For cLSTM, we set the ridge penalty to be 0.045, and sweep the group sparse penalty in range [0.1, 1].For TCDF, we sweep through the threshold in range [-1, 2] for the attention scores. Other than the above hyperparameters, everything else follows the setup as in Khanna & Tan (2019) .

G.5 ADDITIONAL NETSIM RESULTS

Figure 7 shows the ROC curve plot for Rhino and other baselines. It is clear that Rhino achieves significantly better TPR-FPR trade-offs compared to others. Table 8 shows additional discovery metrics of Rhino and baselines for Netsim dataset. We can observe that F1 score and SHD in general agree with AUROC reported in Table 3 , where Rhino-based methods outperform the baselines, apart from Rhino+NoInst. This again confirms the necessity of modelling instantaneous effect for realworld challenges. Rhino outperforms Rhino+g on two out of three metrics (including AUROC), which shows the advantage of history-dependent noise.

H AUROC METRIC

AUROC metric is a one of the standard metrics for evaluating the causal discovery, which measures the trade-off between the true positive rate (TPR) and false positive rate (FPR). However, during the experiments, we found out that AUROC does not necessarily correlate well with other discovery metrics. From Fig. 8 , it is clear that the F 1 score continues to increase whereas AUROC and validation likelihood starts to decrease after few steps. Since the dataset of Netsim is relatively small, this indicates the possible overfitting. This disagreement originates from the different aspects these metrics care about. For AUROC, it cares about the trade-off between TPR and FPR with various decision thresholds, and it penalizes the wrong decisions with certainty harshly. On the other hand, F 1 score cares about the final inferred binary adjacency matrix with a fixed decision threshold. For example, if we multiply the Bernoulli probability matrix by a small factor (e.g. 10 -5 ), the AUROC score will remain the same but the F 1 score will tends to 0 with the default decision threshold 0.5. Thus, model overfitting tends to drive the edge probabilities towards 1 or 0, which may help the F 1 score but these extreme decisions can result in a large decrease in the AUROC score. Thus, for small dataset, we believe AUROC is a better metric than F 1 , which also agrees with validation likelihood. In addition, the Bayesian setup of Rhino may also help with better AUROC for small dataset. From the same figure, even the large decrease of validation likelihood suggests potential model overfitting, the AUROC still maintains a reasonable value. This may be due to the Bayesian view of the causal graph, where the posterior edge probability does not converge to extreme values.



In the following, we interchange the usage of the notation G and G0:K for brevity. https://github.com/microsoft/causica/tree/v0.0.0 We note that we run each method on different dataset settings for all possible numbers of nodes. see https://lingam.readthedocs.io see https://jakobrunge.github.io/tigramite/ see https://causalnex.readthedocs.io/en/latest/



Figure 1: F 1 -scores of Rhino (light yellow) compared to all baseline methods. The different subplots show the performance for dataset with different number of nodes. 'L=2' refers to models with lag 2.

Figure 2: Comparison of the RMSE of the average treatment effects (CATEs) of the different instantiations of Rhino depending on the dimensionality. E[CATE] refers to RMSE of the expected CATE over the posterior graph distribution (i.e. G ∼ q ϕ (G)). ML ATE uses the most likely graph to calculate the ATE. These results are obtained by averaging 160 datasets, similar to the discovery setup.

Figure 3: Comparison of the F 1 score of the different baseline methods as well as Rhino (light yellow) depending on the dimensionality and the graph type. Inst. refers to the performance on the instantaneous adjacency matrix, Lag refers to the lagged adjancency matrices and temporal considers the full temporal matrix.

Figure4: Comparison of the F 1 score of the different baseline methods as well as Rhino (light yellow) depending on the dimensionality and whether the data is generated with history-depence or not. Inst. refers to the performance on the instantaneous adjacency matrix, Lag refers to the lagged adjancency matrices and temporal considers the full temporal matrix.

Figure5: Comparison of the F 1 score of the different baseline methods as well as Rhino (light yellow) depending on the dimensionality and whether the data is generated with instantaneous effects or not. Inst. refers to the performance on the instantaneous adjacency matrix, Lag refers to the lagged adjancency matrices and temporal considers the full temporal matrix.

Figure 8: The curves of orientation F 1 , AUROC and validation likelihood during training. Each curve is obtained by averaging over 5 random seeds. The validation curve agrees well with the AUROC curve, but shows an opposite trends as F 1 curve. This potentially indicates model overfitting in the later stage of training.

Comparison of the causal discovery performance of Rhino with different lag-parameters (L ∈ [1, 3]). Apart from the 3 F 1 scores, LL shows the log-likelihood of the training data.

The AUROC of the aggregated adjacency matrix for 5 DREAM3 datasets without selfconnections. DYNO. means DYNOTEARS. For Rhino and PCMCI



AUROCs of the summary graph. Rhino+NoInst is Rhino without instantaneous effects. For Rhino, VARLiNGaM, PCMCI + , results are obtained by averaging over 5 different runs.

Assumption 3 (Causal Minimality). Consider a distribution p and a DAG G, we say this distribution satisfies causal minimality w.r.t. G if it is Markovian w.r.t. G but not to any proper subgraph of G.

The first study is to test our model robustness to different types of graphs. Fig.3shows the discovery performance over ER or SF graph averaged over all other possible data setting combinations. Most methods perform better on ER graphs than on SF graphs,



The hyperparameter setup for Rhino. Node embedding is the dimensionality of u τ,i below Eq. (7); Instantaneous eff. specifies whether it models the instantaneous effect or not; Node Embed. (flow) represents the dimensionality of the node embedding for the hypernetwork used for conditional spline flow g i since the hyper-network shares the similar structure as Eq. (7); lag defines the model lag order; and λ s is the sparseness penalty in the prior (Eq. (10)); Auglag is the number of augmented Lagrangian steps, each step consists of 2000 training iterations.a nonlinear independence test based on GP, called GPDC. However, PCMCI + with GP DC is too slow to finish the training.

SHD of DREAM3 datasets.G.4 NETSIM HYPERPARAMETER SETTINGFor the Netsim experiment, we extract subject 2-6 in Sim-3.mat to form the training data and use subject 7-8 as validation dataset. Following the same settings as DREAM3 (Appendix G.1), we tune the hyperparameters of Rhino and its variants based on the validation log likelihood; DYNOTEARS with MSE on validation dataset; and use default settings of PCMCI + from Tigramite package.

annex

The ground truth graphs for DREAM3 and Netsim datasets are summary graph, which is essentially the temporal graph aggregated over time. We provide a formal definition of summary graph:Definition G.1 (Causal summary graph (Assaad et al., 2022) ). Let X t be a multivariate temporal process, and G = (V , E) be a summary graph. The edge p → q exists if and only if there exists some time t and some lag τ such that X p t-τ causes X q t at time t with a lag 0 ≤ i for p ̸ = q and with a time lag of 0 < i for p = q.Unlike the some of the Granger causality baselines, Rhino (and its variants), DYNOTEARS, VAR-LiNGaM produces the temporal adjacency matrix after training. For DREAM3 and Netsim datasets, this creates the incompatibility during evaluation. Thus, we need to aggregate the temporal graph into a summary graph before comparing to the ground truth. For binary adjacency matrix, we sum over the time steps followed by a step function, i.e. step( k G k ). Thus, there will be an edge i → j in summary graph as long as there is a connection from i to j at any timestamp. For the Bernoulli probability matrix from Rhino and its variants, we take a max(•) over the timestamp to generate the probability matrix for the summary graph.An exception is PCMCI + , which can only produce MECs for the instantaneous adjacency matrix. In such case, we will enumerate up to 10000 possible instantaneous DAGs from the MECs. Together with the lagged adjacency matrix, we will perform the above post-processing step to generate the corresponding aggregated adjacency matrix. We also estimate the corresponding edge probabilities by taking the average over all possible DAGs.For DREAM3 experiments, we ignore the self-connections by setting the diagonal of the aggregated adjacency matrix to be 0.For Netsim, self-connections are not ignored, following the same settings as Khanna & Tan (2019) .

G.3 ADDITIONAL DREAM3 RESULTS

Here, Fig. 6 shows the additional ROC curve plots for all 5 datasets in DREAM3. For the visualization purpose, we only select a single run for Rhino and this will not affect the curve much due to small standard error in Table 2 .In addition, we provides the additional metrics (Orientation F1 and SHD) for DREAM3 datasets in Table 6 and Table 7 . These results are obtained by using the same hyperparameters mentioned in Appendix G.1. In particular, we tune the threshold for rounding the continuous-valued adjacency matrix to binary adjacency matrix for both Rhino and baselines. It can be observed that the F1 and SHD agree with the trend of AUROC, where Rhino and Rhino+g achieve the best performance compared to baselines. This further supports the advantages of our proposed methods. 

