CUTS: NEURAL CAUSAL DISCOVERY FROM IRREGULAR TIME-SERIES DATA

Abstract

Causal discovery from time-series data has been a central task in machine learning. Recently, Granger causality inference is gaining momentum due to its good explainability and high compatibility with emerging deep neural networks. However, most existing methods assume structured input data and degenerate greatly when encountering data with randomly missing entries or non-uniform sampling frequencies, which hampers their applications in real scenarios. To address this issue, here we present CUTS, a neural Granger causal discovery algorithm to jointly impute unobserved data points and build causal graphs, via plugging in two mutually boosting modules in an iterative framework: (i) Latent data prediction stage: designs a Delayed Supervision Graph Neural Network (DSGNN) to hallucinate and register irregular data which might be of high dimension and with complex distribution; (ii) Causal graph fitting stage: builds a causal adjacency matrix with imputed data under sparse penalty. Experiments show that CUTS effectively infers causal graphs from irregular time-series data, with significantly superior performance to existing methods. Our approach constitutes a promising step towards applying causal discovery to real applications with non-ideal observations.

1. INTRODUCTION

Causal interpretation of the observed time-series data can help answer fundamental causal questions and advance scientific discoveries in various disciplines such as medical and financial fields. To enable causal reasoning and counterfactual prediction, researchers in the past decades have been dedicated to discovering causal graphs from observed time-series and made large progress (Gerhardus & Runge, 2020; Tank et al., 2022; Khanna & Tan, 2020; Wu et al., 2022; Pamfil et al., 2020; Löwe et al., 2022; Runge, 2021) . This task is called causal discovery or causal structure learning, which usually formulates causal relationships as Directed Acyclic Graphs (DAGs). Among these causal discovery methods, Granger causality (Granger, 1969; Marinazzo et al., 2008) is attracting wide attentions and demonstrates advantageous due to its high explainability and compatibility with emerging deep neural networks (Tank et al., 2022; Khanna & Tan, 2020; Nauta et al., 2019) ). In spite of the progress, actually most existing causal discovery methods assume well structured time-series, i.e., completely sampled with an identical dense frequency. However, in real-world scenarios the observed time-series might suffer from random data missing (White et al., 2011) or be with non-uniform periods. The former is usually caused by sensor limitations or transmission loss, while the latter occurs when multiple sensors are of distinct sampling frequencies. Robustness to such data imperfections is urgently demanded, but has not been well explored yet so far. When confronted with unobserved data points, some straightforward solutions fill the points with zero padding, interpolation, or other imputation algorithms, such as Gaussian Process Regression or neural-network-based approaches (Cini et al., 2022; Cao et al., 2018; Luo et al., 2018) . We will show in the experiments section that addressing missing entries via performing such trivial data imputation in a pre-processing manner would lead to hampered causal conclusions. To push causal discovery towards real applications, we attempt to infer reliable causal graphs from irregular time-series data. Fortunately, for data that are assumed to be generated with certain causal structural models (Pamfil et al., 2020; Tank et al., 2022) , a well designed neural network can fill a small proportion of missing entries decently given a plausible causal graph, which would conversely improve the causal discovery, and so forth. Leveraging this benefit, we propose to conduct causal discovery and data completion in a mutually boosting manner under an iterative framework, instead of sequential processing. Specifically, the algorithm alternates between two stages, i.e., (a) Latent data prediction stage that hallucinates missing entries with a delayed supervision graph neural network (DSGNN) and (b) Causal graph fitting stage inferring causal graphs from filled data under sparse constraint utilizing the extended nonlinear Granger Causality scheme. We name our algorithm Causal discovery from irregUlar Time-Series (CUTS), and the main contributions are listed as follows: • We proposed CUTS, a novel framework for causal discovery from irregular time-series data, which to our best knowledge is the first to address the issues of irregular time-series in causal discovery under this paradigm. Theoretically CUTS can recover the correct causal graph with fair assumptions, as proved in Theorem 1. • In the data imputation stage we design a deep neural network DSGNN, which successfully imputes the unobserved entries in irregular time-series data and boosts the subsequent causal discovery stage and latter iterations. • We conduct extensive experiments to show our superior performance to state-of-the-art causal discovery methods combined with widely used data imputation methods, the advantages of mutually-boosting strategies over sequential processing, and the robustness of CUTS (in Appendix Section A.4).

2. RELATED WORKS

Causal Structural Learning / Causal Discovery. Causal Structural Learning (or Causal Discovery) is a fundamental and challenging task in the field of causality and machine learning, which can be categorized into four classes. (i) Constraint-based approaches which build causal graphs by conditional independence tests. Two most widely used algorithms are PC (Spirtes & Glymour, 1991) and Fast Causal Inference (FCI) (Spirtes et al., 2000) which is later extended by Entner & Hoyer (2010) to time-series data. Recently, Runge et al. propose PCMCI to combine the above two constraint-based algorithms with linear/nonlinear conditional independence tests (Gerhardus & Runge, 2020; Runge, 2018b) and achieve high scalability on large scale time-series data. (ii) Scorebased learning algorithms based on penalized Neural Ordinary Differential Equations (Bellot et al., 2022) or acyclicity constraint (Pamfil et al., 2020) . (iii) Convergent Cross Mapping (CCM) firstly proposed by Sugihara et al. (2012) that tackles the problems of nonseparable weakly connected dynamic systems by reconstructing nonlinear state space. Later, CCM is extended to situation of synchrony (Ye et al., 2015) , confounding (Benkő et al., 2020) or sporadic time series (Brouwer et al., 2021) . (iv) Approaches based on Additive Noise Model that infer causal graph based on additive noise assumption (Shimizu et al., 2006; Hoyer et al., 2008) . Recently Hoyer et al. (2008) extend ANM to nonlinear models with almost any nonlinearities. (v) Granger causality approach proposed by Granger (1969) which has been widely used to analyze the temporal causal relationships by testing the aid of a time-series on predicting another time-series. Granger causal analysis originally assumes that linear models and the causal structures can be discovered by fitting a Vector Autoregressive (VAR) model. Later, the Granger causality idea was extended to nonlinear situations (Marinazzo et al., 2008) . Thanks to its high compatibility with the emerging deep neural network, Granger causal analysis is gaining momentum and is used in our work for incorporating a neural network imputing irregular data with high complexities. Neural Granger Causal Discovery. With the rapid progress and wide applications of deep Neural Networks (NNs), researchers begin to utilize RNN (or other NNs) to infer nonlinear Granger causality. Wu et al. (2022) used individual pair-wise Granger causal tests, while Tank et al. (2022) inferred Granger causality directly from component-wise NNs by enforcing sparse input layers. Building on Tank et al. (2022) 's idea, Khanna & Tan (2020) explored the possibility of inferring Granger causality with Statistical Recurrent Units (SRUs, Oliva et al. (2017) ). Later, Löwe et al. (2022) extends the neural Granger causality idea to causal discovery on multiple samples with different causal relation-ships but similar dynamics. However, all these approaches assume fully observed time-series and show inferior results given irregular data, which is shown in the experiments section. In this work, we leverage this Neural Granger Causal Discovery idea and build a two-stage iterative scheme to impute the unobserved data points and discover causal graphs jointly. Causal Discovery from Irregular Time-series. Irregular time-series are very common in real scenarios, causal discovery addressing such data remains somewhat under-explored. When confronted with data missing, directly conducting causal inference might suffer from significant error (Runge, 2018a; Hyttinen et al., 2016) . Although joint data imputation and causal discovery has been explored in static settings (Tu et al., 2019; Gain & Shpitser, 2018; Morales-Alvarez et al., 2022; Geffner et al., 2022) , it is still under explored in time series causal discovery. There are mainly two solutions-either discovering causal relations with available observed incomplete data (Gain & Shpitser, 2018; Strobl et al., 2018) or filling missing values before causal discovery (Wang et al., 2020; Huang et al., 2020) . To infer causal graphs from partially observed time-series, several algorithms are proposed, such as Expectation-Maximization approach (Gong et al., 2015) , Latent Convergent Cross Mapping (Brouwer et al., 2021) , Neural-ODE based approach (Bellot et al., 2022) , Partial Canonical Correlation Analysis (Partial CCA), Generalized Lasso Granger (GLG) (Iseki et al., 2019) , etc. Some other researchers introduce data imputation before causal discovery and have made progress recently. For example, Cao et al. (2018) learn to impute values via iteratively applying RNN and Cini et al. (2022) use Graph Neural Networks, while a recently proposed data completion method by Chen et al. (2022) uses Gaussian Process Regression. In this paper, we use a deep neural network similar to Cao et al. (2018) 's work, but differently, we propose to impute missing data points and discover causal graphs jointly instead of sequentially. Moreover, these two processes mutually improve each other and achieve high performance.

3.1. NONLINEAR STRUCTURAL CAUSAL MODELS WITH IRREGULAR OBSERVATION

Let us denote by X = {x 1:L,i } N i=1 a uniformly sampled observation of a dynamic system, in which x t represents the sample vector at time point t and consists of N variables {x t,i }, with t ∈ {1, ..., L} and i ∈ {1, ..., N }. In this paper, we adopt the representation proposed by Tank et al. (2022) and Khanna & Tan (2020) , and assume each sampled variable x t,i be generated by the following model x t,i = f i (x t-τ :t-1,1 , x t-τ :t-1,2 , ..., x t-τ :t-1,N ) + e t,i , i = 1, 2, ..., N. (1) Here τ denotes the maximal time lag. In this paper, we focus on dealing with causal inference from irregular time series, and use a bi-value observation mask o t,i to label the missing entries, i.e., the observed vector equals to its latent version when o t,i equals to 1: x t,i ∆ = x t,i • o t,i . In this paper we consider two types of recurrent data missing in practical observations: Random Missing. The ith data point in the observations are missing with a certain probability p i , here in our experiments the missing probability follows Bernoulli distribution o t,i ∼ Ber(1 -p i ). Periodic Missing. Different variables are sampled with their own periods T i . We model the sampling process for ith variable with an observation function o t,i = ∞ n=0 δ(t -nT i ), T i = 1, 2, ... with δ(•) denoting the Dirac's delta function.

3.2. NONLINEAR GRANGER CAUSALITY

For a dynamic system, time-series i Granger causes time-series j when the past values of time-series x i aid in the prediction of the current and future status of time-series x j . The standard Granger causality is defined for linear relation scenarios, but recently extended to nonlinear relations: Definition 1 Time-series i Granger cause j if and only if there exists x ′ t-τ :t-1,i ̸ = x t-τ :t-1,i , f j (x t-τ :t-1,1 , ..., x ′ t-τ :t-1,i , ..., x t-τ :t-1,N ) ̸ = f j (x t-τ :t-1,1 , ..., x t-τ :t-1,i , ..., x t-τ :t-1,N ) i.e., the past data points of time-series i influence the prediction of x t,j . Granger causality is highly compatible with neural networks (NN). Considering the universal approximation ability of NN (Hornik et al., 1989) , it is possible to fit a causal relationship function with component-wise MLPs or RNNs. Imposing a sparsity regularizer onto the weights of network connections, as mentioned by Tank et al. (2022) and Khanna & Tan (2020) , NNs can learn the causal relationships among all N variables. The inferred pair-wise Granger causal relationships can then be aggregated into a Directed Acyclic Graph (DAG), represented as an adjacency matrix A = {a ij } N i,j=1 , where a ij = 1 denotes time-series i Granger causes j and a ij = 0 means otherwise. This paradigm is well explored and shows convincing empirical evidence in recent years (Tank et al., 2022; Khanna & Tan, 2020; Löwe et al., 2022) . Although Granger causality is not necessarily the true causality, Peters et al. (2017) provide justification of (time-invariant) Granger causality when assuming no unobserved variables and no instantaneous effects, as is mentioned by Löwe et al. (2022) and Vowels et al. (2021) . In this paper, we propose a new inference approach to successfully identify causal relationships from irregular time-series data.

4. IRREGULAR TIME-SERIES CAUSAL DISCOVERY CUTS implements the causal graph as a set of Causal Probability Graphs (CPGs

) G = ⟨X , {M τ } τmax τ =0 ⟩ where the element m τ,ij ∈ M τ represents the probability of causal influence from x t-τ,i to x t,j , i.e. m τ,ij = p(x t-τ,i → x t,j ). Since we assume no instantaneous effects, timeseries i Granger cause j if and only if there exist causal relations on at least one time lag, we define our discovered causal graph Ã to be the maximum value across all time lags τ ∈ {1, ..., τ max } ãi,j = max (m 1,ij , ..., m τmax,ij ) . (3) Specifically, if ãi,j is penalized to zero (or below certain threshold), we deduce that time-series i does not influence the prediction of time-series j, i.e., i does not Granger cause j. During training, we alternatively learn the prediction model and CPG matrix, which are respectively implemented by Latent data prediction stage and Causal graph fitting stage. Besides, proper learning strategies are designed to facilitate convergence.

4.1. LATENT DATA PREDICTION STAGE

The proposed Latent data prediction stage is designed to fit the data generation function for timeseries i with a neural network f ϕi , which takes into account its parent nodes in the causal graph. Here we propose Delayed Supervision Graph Neural Network (DSGNN) for imputing the missing entries in the observation. The inputs to DSGNN include all the historical data points (with a maximum time lag τ max ) x t-τ :t-1,i and the discovered CPGs. During training we sample the causal graph with Bernoulli distribution, in a similar manner to Lippe et al. (2021) 's work, and the prediction x is the output of the neural network f ϕi xt,i = f ϕi (X ⊙ S) = f ϕi (x t-τ :t-1,1 ⊙ s 1:τ,1i , ..., x t-τ :t-1,N ⊙ s 1:τ,N i ), (4) where S = {S τ } τ =τmax τ =1 , and s τ,ij ∼ Ber(1 -m τ,ij ) and ⊙ denotes the Hadamard product. S is sampled for each training sample in a mini-batch. The fitting is done under supervision from the observed data points. Specifically, we update the network parameters ϕ i by minimizing the following loss function L pred X , X , O = N i=1 ⟨L 2 ( x1:L,i , x1:L,i ) , o 1:L,i ⟩ 1 L ⟨o 1:L,i , o 1:L,i ⟩ (5) where o i denotes the observation mask, ⟨•⟩ is the dot product, and L 2 represents the MSE loss function. Then, the data imputation is performed with the following equation x(m+1) t,i = (1 -α)x (m) t,i + αx (m) t,i o t,i = 0 and m ≥ n 1 x0 t,i o t,i = 1 or m < n 1 (6) Here m indexes the iteration steps, and x(0) t,i denotes the initial data (unobserved entries filled with zero order holder). α is selected to prevent the abrupt change of imputed data. For the missing points, their predicted value x(m) t,i is unsupervised with L but updated to x(m) t,i to obtain a "delayed" error in causal graph inference. Moreover, we impute the missing values with the help of discovered CPG G (sampled with Bernoulli Distribution), as illustrated in Figure 1 (b), which is proved to significantly improve performance in experiments.

4.2. CAUSAL GRAPH DISCOVERY STAGE

After imputing the missing time-series, we proceed to learn CPG in the Causal graph fitting stage, to determine the causal probability p(x t-τ,i → x t,j ) = m τ,ij , we model this likelihood with m τ,ij = σ(θ τ,ij ) where σ(•) denotes the sigmoid function and θ is the learned parameter set. Since we assume no instantaneous effect, it is unnecessary to learn the edge direction in CPG. In this stage we optimize the graph parameters θ by minimizing the following objective L graph X , X , O, θ = L pred X , X , O + λ||σ(θ)|| 1 , where L pred is the squared error loss penalizing prediction error defined in Equation ( 5) and || • || 1 being the L 1 regularizer to enforce sparse connections on the learned CPG. If ∀τ ∈ [1, τ max ], θ τ,ij are penalized to -∞ (and m τ,ij → 0), then we deduce that time-series i does not Granger cause j.

4.3. THE LEARNING STRATEGY.

The overall learning process consists of n = n 1 + n 2 + n 3 epochs, which is illustrated in Figure 1 (a): in the first n 1 epochs DSGNN and CPG are optimized without data imputation (missing entries are set with initial guess); in the next n 2 epochs the iterative model learning continues with data imputation, but the imputed data are not used for model supervision; for the last n 3 epochs the learned CPG is refined based on supervision from all the data points (including the imputed ones). Fine-tuning. The main training process is the alternation between Latent data prediction stage and Causal graph fitting stage. Considering that after sufficient iterations (here n 1 + n 2 ) the unobserved data points can be reliably imputed with the discovery of causal relations, and we can incorporate these predicted points to supervise the model and fine-tune the parameters to improve the performance further. In the last n 3 epochs CPG is optimized with the loss function L f t X , X = L 2 ( x1:L,i , x1:L,i ) + λ||σ(θ)|| 1 . ( ) Parameter Settings. During training the τ value for Gumbel Softmax is initially set to a relatively high value and annealed to a low value in the first n 1 +n 2 epochs and then reset for the last n 3 epochs. The learning rates for Latent data prediction stage and Causal graph fitting stage are respectively set as lr data and lr graph and gradually scheduled to 0.1lr data and 0.1lr graph during all n 1 + n 2 + n 3 epochs. The detailed hyperparameter settings are listed in Appendix Section A.3.

4.4. CONVERGENCE CONDITIONS FOR GRANGER CAUSALITY.

We show in Theorem 1 that under certain assumptions, the discovered causal adjacency matrix will converge to the true Granger causal matrix. Theorem 1 Given a time-series dataset X = {x 1:L,i } N i=1 generated with Equation 1, we have 1. ∃λ, ∀τ ∈ {1, .., τ max }, causal probability matrix element m τ,ij = σ(θ τ,ij ) converges to 0 if time-series i does not Granger cause j, and 2. ∃τ ∈ {1, .., τ max }, m τ,ij converges to 1 if time-series i Granger cause j, if the following two conditions hold: 1. DSGNN f ϕi in Latent data prediction stage model generative function f i with an error smaller than arbitrarily small value e NN,i ; 2. ∃λ 0 , ∀i, j = 1, ..., N, ∥f ϕj (X ⊙ S τ,ij=1 ) -f ϕj (X ⊙ S τ,ij=0 )∥ 2 2 > λ 0 , where S τ,ij=l is set S with element s τ,ij = l. The implications behind these two conditions can be intuitively explained. Assumption 1 is intrinsically the Universal Approximation Theorem (Hornik et al., 1989) of neural network, i.e., the network is of an appropriate structure and fed with sufficient training data. Assumption 2 means there exists a threshold λ 0 to binarize ∥f ϕi (X ⊙ S τ,ij=1 ) -f ϕi (X ⊙ S τ,ij=0 )∥, serving as an indicator as to whether time-series j contributes to prediction of i. The proof of Theorem 1 is detailed in Appendix Section A.1. Although the convergence condition is relevant to the appropriate setting of λ, we will show in Appendix Section A.4.6 that our algorithm is robust to the setting changes of λ over a wide range.

Datasets.

We evaluate the performance of the proposed causal discovery approach CUTS on both numerical simulation and real-scenario inspired data. The simulated datasets come from a linear Vector Autoregressive (VAR) model and a nonlinear Lorenz-96 model (Karimi & Paul, 2010) , while the real-scenario inspired datasets are from NetSim (Smith et al., 2011) , an fMRI dataset describing the connecting dynamics of 15 human brain regions. The irregular observations are generated according to the following mechanisms: Random Missing (RM) is simulated by sampling over a uniform distribution with missing probability p i ; Periodic Missing (PM) is simulated with sampling period T i randomly chosen for each time-series with the maximum period being T max . For statistical quantitative evaluation of different causal discovery algorithms, we take average over multiple p i and T i in our experiments.

Baseline Algorithms.

To demonstrate the superiority of our approach, we compare with five baseline algorithms: 

5.1. VAR SIMULATION DATASETS

VAR datasets are simulated following x t = τmax τ =1 A τ x t-τ + e t , where the matrix A τ is the sparse autoregressive coefficients for time lag τ . Time-series i Granger cause time-series j if ∃τ ∈ {1, ..., τ max } , a τ,ij > 0. The objective of causal discovery is to reconstruct the non-zero elements in causal graph A (where each element of A a ij = max(a 1,ij , ..., a τmax,ij )) with Ã. We set τ max = 3, N = 10 and time-series length L = 10000 in this experiment. For missing mechanisms, we set p = 0.3, 0.6, respectively for Random Missing and T max = 2, 4 respectively for Periodic Missing. Experimental results are shown in the upper half of Table 1 . We can see that CUTS beats PCMCI, NGC, and eSRU combined with ZOH, GP, and GRIN in most cases, except for the case of VAR with random missing (p = 0.3) where PCMCI + GRIN is better by only a small margin (+0.0012). The superiority is especially prominent when with a larger percentage of missing values (p = 0.6 for random missing and T max = 4 for periodic missing). Differently, data imputation algorithms GP and GRIN provide performance gain in some scenarios but fail to boost causal discovery in others. This indicates that simply combining previous data imputation algorithms with causal discovery algorithms cannot give stable and promising results, and is thus less practical than our approach. We also beat LCCM and NGM which originally tackles the irregular time series problem by a clear margin. This hampered performance may be attributed to the fact that LCCM and NGM both utilize Neural-ODE to model the dynamics and do not cope with VAR datasets well.

5.2. LORENZ-96 SIMULATION DATASETS

Lorenz-96 datasets are simulated according to dx t,i dt = -x t,i-1 (x t,i-2 -x t,i+1 ) -x t,i + F, where -x t,i-1 (x t,i-2 -x t,i+1 ) is the advection term, x t,i is the diffusion term, and F is the external forcing (a larger F implies a more chaotic system). In this Lorenz-96 model each time-series x i is affected by historical values of four time-series x i-2 , x i-1 , x i , x i+1 , and each row in the ground truth causal graph A has four non-zero elements. Here we set the maximal time-series length L = 1000, N = 10, force constant F = 10 and show experimental results for F = 40 in the Appendix Section A.4.7. From the results in the lower half of Table 1 , one can draw similar conclusions to those on VAR datasets: CUTS outperforms baseline causal discovery methods either with or without data imputation. 6, i.e., α = 0. In other words, Causal graph fitting stage is performed with just the initially filled data (Appendix Section A.3.2), with the results shown as "No Imputation" in Table 6 . Compared with the first row, we can see that introducing data imputation boosts AUROC by 0.0032 ∼ 0.0499. We further replace our data imputation module with baseline modules (ZOH, GP, GRIN) to show the effectiveness of our design. It is observed that our algorithm beats "ZOH for Imputation", "GP for Imputation", "GRIN for Imputation" in most scenarios.

5.3. NETSIM DATASETS

Finetuning Stage Raises Performance. We disable the finetuning stage and find that the performance drops slightly, as shown in the "No Finetuning Stage" row in Table 6 . In other words, the finetuning stage indeed helps to refine the causal discovery process.

5.5. ADDITIONAL EXPERIMENTS

We further conduct additional experiments in Appendix to show experiments on more datasets (Appendix Section A.4.1), ablation study for choice of epoch numbers (Appendix Section A.4.3), ablation study results on VAR and NetSim datasets (Appendix Section A.4.2), performance on 3dimensional temporal causal graph (Appendix Section A.4.4), CUTS's performance superiority on regular time-series (Appendix Section A.4.5), robustness to different noise levels (Appendix Section A.4.8), robustness to hyperparameter settings (Appendix Section A.4.6), and results on Lorenz-96 with forcing constant F = 40 (Appendix Section A.4.7). We further provide implementation details and hyperparameters settings of CUTS and baseline algorithms in Appendix Section A.3, and the pseudocode of our approach in Appendix Section A.5.

6. CONCLUSIONS

In this paper we propose CUTS, a time-series causal discovery method applicable for scenarios with irregular observations with the help of nonlinear Granger causality. We conducted a series of experiments on multiple datasets with Random Missing as well as Periodic Missing. Compared with previous methods, CUTS utilizes two alternating stages to discover causal relations and achieved superior performance. We show in the ablation section that these two stages mutually boost each other to achieve an improved performance. Moreover, our CUTS is widely applicable for timeseries with different lengths, scales well to large sets of variables, and is robust to noise. Our code is publicly available at https://github.com/jarrycyx/unn. In this work we assume no latent confounder and no instantaneous effect for Granger causality. Our future works includes: (i) Causal discovery in the presence of latent confounder or instantaneous effect. (ii) Time-series imputation with causal models.

REPRODUCIBILITY STATEMENT

For the purpose of reproducibility, we include the source code in the supplementary files, and will published on GitHub upon acceptance. Datasets generation process is also included in source code. Moreover, we provide all hyperparameters used for all methods in Appendix Section A.4.6. The experiments are deployed on a server with Intel Core CPU and NVIDIA RTX3090 GPU. where s τ,ij ∼ Ber(σ(θ τ,ij )), c i = L ⟨o 1:L,i ,o 1:L,i ⟩ . We use the REINFORCE (Williams, 1992) trick and m τ,ij ′ s gradient is calculated as ∂ ∂θ τ.ij E sτ,ij [L graph ] = E sτ,ij [c i o t,i (x t,j -f ϕj (X ⊙ S)) 2 ∂ ∂θ τ,ij log p sτ,ij ] + λσ ′ (θ τ,ij ) = λσ ′ (θ τ,ij ) + σ(θ τ,ij )c i o t,i (x t,j -f ϕj (X ⊙ S τ,ij=1 )) 2 1 σ(θ τ,ij ) σ ′ (θ τ,ij ) + (1 -σ(θ τ,ij ))c i o t,i (x t,j -f ϕj (X ⊙ S τ,ij=0 )) 2 1 σ(θ τ,ij ) -1 σ ′ (θ τ,ij ) = σ ′ (θ τ,ij )(c i o t,i (x t,j -f ϕj (X ⊙ S τ,ij=1 )) 2 -c i o t,i (x t,j -f ϕj (X ⊙ S τ,ij=0 )) 2 + λ). Where S τ,ij=l denotes S = {S τ } τmax τ =1 with s τ,ij set to l, and f ϕj (X ⊙S τ,ij=1 ) = f ϕj (x t-τ :t-1,1 ⊙ s 1:τ,1i , ..., x t-τ :t-1,N ⊙ s 1:τ,N i ). According to Definition 1, time-series i does not Granger cause j if ∀τ ∈ {1, ..., τ max }, x t-τ,i is invariant of the prediction of x t,j . Then we have ∀τ ∈ {1, ..., τ max }, f ϕj (..., x t-τ,i , ...) = f ϕj (..., 0, ...), i.e., f ϕj (X ⊙S τ,ij=1 ) = f ϕj (X ⊙S τ,ij=0 ). Applying additive noise model (ANM, Equation 1) we can derive that ∂ ∂θ τ,ij E sτ,ij [L graph ] = σ ′ (θ τ,ij )(c i o t,i (e 2 t,j -e 2 t,j )) = λσ ′ (θ τ,ij ) > 0. This is a sigmoidal gradient, whose convergence is analyzed in Section A.1.3. Likewise, we have ∃τ ∈ {1, ..., τ max }, f ϕj (X ⊙ S τ,ij=1 ) ̸ = f ϕj (X ⊙ S τ,ij=0 ) if time-series i Granger cause j, and ∃τ satisfying ∂ ∂θ τ.ij E sτ,ij [L graph ] = σ ′ (θ τ,ij )(c i o t,j ((x t,j -f ϕj (X ⊙ S τ,ij=1 )) 2 -(x t,j -f ϕj (X ⊙ S τ,ij=0 )) 2 ) + λ). Assuming that f ϕj (•) accurately models causal relations in f i (•) (i.e., DSGNN f ϕi in Latent data prediction stage model generative function f i with an error smaller than arbitrarily small value e NN,i ), applying Equation 1 we have ∂ ∂θ τ,ij E sτ,ij [L graph ] = σ ′ (θ τ,ij )(c i o t,j e 2 t,j -(x t,j -f ϕj (X ⊙ S τ,ij=0 )) 2 + λ) = σ ′ (θ τ,ij ) c i o t,j (e 2 t,j -(e t,j + ∆f i,j ) 2 ) + λ = σ ′ (θ τ,ij )(c i o t,j (-2e t,j ∆f i,j -∆ 2 f i,j ) + λ), where noise term e t,i ∼ N (0, σ), ∆f i,j = f ϕj (X ⊙ S τ,ij=1 ) -f ϕj (X ⊙ S τ,ij=0 ). This gradient is expected to be negative when ∀i, j = 1, ..., N, E(c i ∆ 2 f i,j ) ≥ pλ 0 > λ, where p is the missing probability, i.e., E[c i ] = p (here we only consider the random missing scenario). Since we can certainly find a λ satisfying the above inequality, θ τ,ij will go towards +∞ with a properly chosen λ and m τ,ij → 1. Moreover, we show in Appendix Section A.4.6 that CUTS is robust to a wide range of λ values. When applies to real data we use Gumbel Softmax estimator for improved performance (Jang et al., 2016) .

A.1.2 THE EFFECTS OF DATA IMPUTATION

To show why data imputation boosts causal discovery, we suppose x t-τ ′ ,j , a parent node of x t,i is unobserved and imperfectly imputed with as xt-τ ′ ,j ̸ = x t-τ ′ ,j . If time-series i Granger cause j, then f (..., xt-τ ′ ,j , ...) ̸ = f (..., x t-τ ′ ,j , ...). Let δ τ ′ ,ij = f (..., x t-τ ′ ,j , ...) -f (..., xt-τ ′ ,j , ...), and ∂ ∂θ τ.ij E sτ,ij [L graph ] = σ ′ (θ τ,ij )(c i o t,i ((e t,i + δ τ ′ ,ij ) 2 -(e t,i + δ τ ′ ,ij + ∆f i,j ) 2 ) + λ) = σ ′ (θ τ,ij )(c i o t,i (-2(e t,i + δ τ ′ ,ij )∆f i,j -∆ 2 f i,j ) + λ) The expectation E et,i ∂ ∂θ τ.ij E sτ,ij [L graph ] = σ ′ (θ τ,ij )(c i o t,i (-2δ τ ′ ,ij ∆f i,j -∆ 2 f i,j ) + λ) As a result, if we cannot find a lower bound for δ τ ′ ,ij , gradient for θ τ,ij is not guaranteed to be positive or negative and the true Granger causal relation cannot be recovered. On the other hand, if x t-τ ′ ,j is appropriately imputed with |δ τ ′ ,ij | ≤ δ < λ 2 0 , we can find λ < pλ -pδ to insure negative gradient and θ τ,ij will go towards +∞.

A.1.3 CONVERGENCE OF SIGMOIDAL GRADIENTS

We now analyze the descent algorithm for sigmoidal gradients with learning rate α (for simplicity we denote θ τ,ij as θ): θ k = θ k-1 + αλσ ′ (θ k-1 ) This is a monotonic increasing sequence. We show that this sequence converges to +∞, ∀α > 0. If this is not the case, ∃M > 0,s.t. ∀i > 0, we have θ i ≤ M , since this sequence is monotonic increasing, we have θ k+1 = θ k + αλ e -θ k (1 + e -θ k ) 2 ≥ θ k + αλ e -θ k (1 + e -θ0 ) ≥ θ k + αλ e -M (1 + e -θ0 ) then ∃k, s.t. θ k > M , this contradicts with "∀i > 0, θ i ≤ M ", then we have θ k → +∞ and for any finite number M , θ k can converge to ≥ M in finite steps. And likewise sequence θ k = θ k-1 -αλσ ′ (θ k-1 ) converges to ≤ -M in finite steps. This enables us to choose a threshold to classify causal and non-causal edges.

A.2 AN EXAMPLE FOR IRREGULAR TIME-SERIES CAUSAL DISCOVERY

In this section we provide a simple example for irregular causal discovery and show that our algorithm is capable of recovering causal graphs from irregular time-series. Suppose we have a dataset with 3 time-series x 1 , x 2 , x 3 , which are generated with x t,1 = e t,1 , x t,2 = f 2 (x t-1,2 ) + e t,2 , x t,3 = f 3 (x t-1,1 , x t-1,2 ) + e t,3 , where e 1 , e 2 , e 3 are the noise terms and follow N (0, σ). We assume only x 2 is randomly sampled with missing probability p 2 o t,1 = 1, o t,2 ∼ Ber(1 -p 2 ), o t,3 = 1, where Ber(•) denotes the Bernoulli distribution. Then the groundtruth causal relations can be illustrated in Figure 3 (left). We use a DSGNN f ϕ2 to fit f 2 supervised on observed data points of x 2 , i.e., min ϕ2 L 2 (x t,2 , f ϕ2 (x t-1,1 )), ∀t, s.t. o t,2 = 1. Given f ϕ2 , the unobserved values of x 2 can be imputed with xt,2 = f ϕ2 (x t,1 ) and we fit f 3 (•) with f ϕ3 (•) in Latent data prediction stage: arg min ϕ3 L 2 (x t,3 , f ϕ3 (x t-1,1 , xt-1,2 )) = arg min ϕ3 L 2 (x t,3 , f ϕ3 (x t-1,1 , f ϕ2 (x t-2,1 ))), and CPGs M τ is optimized in Causal graph fitting stage with arg min M1 L 2 (x t,3 s 1,13 , f ϕ3 (x t-1,1 s 1,23 , f ϕ2 (x t-2,1 ), x t-1,3 s 1,33 )) + λ 3 i=1 σ(s 1,i3 ), where s 1,ij is sampled with Gumbel Softmax technique denoted with Equation 21. Since x t-1,3 is invariant to the prediction of x t,3 given x t,1 and x t,2 , s 1,33 can be penalized to zero with a proper λ. Here we conduct an experiment to verify this example. We set L = 10000, random missing probability p 2 = 0.2. The illustration of the discovered causal relations is Figure 3 . Results show that CUTS without data imputation tends to ignore causal relations from x 2 (with missing values) to other time-series. This causal relation x 2 → x 3 are instead "replaced" by x 3 → x 3 , which leads to incorrect causal discovery results. In our proposed CUTS, causal relations are modeled with Causal Probability Graph (CPGs), which describe the possibility of Granger causal relations. However, the distributions of CPGs are discrete and cannot be updated directly with neural networks in Causal graph fitting stage. To achieve a continuous approximation of the discrete distribution, we leverage Gumbel Softmax technique (Jang et al., 2016) , which can be denoted as s τ,ij = exp((log(m τ,ij ) + g)/τ ) exp((log(m τ,ij ) + g)/τ ) + exp((log(1 -m τ,ij ) + g)/τ ) , where g = -log(-log(u)), u ∼ Uniform(0, 1). The parameter τ is set according to the "Gumbel tau" item in Table 4 . During training we first set a relatively large value of τ and decrease it slowly.

A.3.2 INITIAL DATA FILLING

The missing data points are filled with Zero-Order Holder (ZOH) before the iterative learning process to provide an initial guess x(0) . An intuitive solution for initial filling is Linear Interpolation, but it would hamper successive causal discovery. For example, if x t-2,i and x t,i are observed and x t-1,i is missing, x t-1,i is filled as x(0) t-1,i = 1 2 (x t-2,i + x t,i ), then x t,i can be directly predicted with 2 x(0) t-1,i -x t-2,i and other time-series cannot help the prediction of x t,i even if there exists Granger causal relationships. To show the limitation of filling with linear interpolation, we conducted ablation study on VAR datasets with Random Missing (p = 0.6). In this experiment, initial data filling with ZOH achieves AUROC of (0.9766 ± 0.0074) while that with Linear interpolation achieves an inferior accuracy (0.9636 ± 0.0145). This validates that Zero-order Holder is a better option than linear interpolation as an initial filling implementation.

A.3.3 HYPERPARAMETERS SETTINGS

To fit data generation function f i we use a DSGNN f ϕi for each time-series i. Each DSGNN contains a Multilayer Perceptron (MLP). The layer numbers and hidden layer feature numbers are shown in Table 4 . For activation function we use LeakyReLU (with negative slope of 0.05). During training we use Adam optimizer and different learning rate for Latent data prediction stage and Causal graph fitting stage (shown as "Stage 1 Lr" and "Stage 2 Lr" in Table 4 ) with learning rate scheduler. The input step for f ϕi also denotes the chosen max time lag for causal discovery. For VAR and Lorenz-96 datasets we already know the max time lag of the underlying dynamics (τ max = 3), while for NetSim datasets this parameter is chosen empirically. For baseline algorithm we choose parameters mainly according to the original paper or official repository (PCMCIfoot_2 , eSRUfoot_3 , NGCfoot_4 , GRINfoot_5 ). For fair comparison, we applied parameter searching to determine the key hyperparameters of the baseline algorithms with best performance. Tuned parameters are listed in Table 5 .  -2 → 10 -3 10 -2 → 10 -3 10 -2 → 10 -3 10 -2 → 10 -3 Gumbel τ 1 → 0.1 1 → 0.1 1 → 0.1 1 → 0.1 λ 0.1 0.3 5 5



https://github.com/jakobrunge/tigramite Shared at https://www.fmrib.ox.ac.uk/datasets/netsim/sims.tar.gz https://github.com/jakobrunge/tigramite https://github.com/sakhanna/SRU for GCI https://github.com/iancovert/Neural-GC https://github.com/Graph-Machine-Learning-Group/grin



Figure 1: Illustration of the proposed CUTS, with a 3-variable example. (a) Illustration of our learning strategy described in Section 4.3, with three groups of iterations being of the same alternation scheme shown in (b) but different settings in data imputation and supervised model learning. (b) Illustration of each iteration in CUTS. The dynamics reflected by the observed time-series x 1 and x 2 are described by DSGNN in the Latent data prediction stage (left). With the modeled dynamics, unobserved data points are imputed (center) and fed into the Causal graph fitting stage for an improved graph inference (right).

Figure 2: Examples of our simulated VAR and Lorenz-96 datasets, with two of the total 10 generated time-series from the groundtruth CPG plotted as orange and blue solid lines, while the nonuniformly sampled points are labeled with scattered points.of the baseline algorithms with the best performance. For baseline algorithms unable to handle irregular time-series data, i.e., NGC, PCMCI, and eSRU, we imputed the irregular time-series before feeding them to causal discovery modules, and use three data imputation algorithms, i.e., Zeroorder Holder (ZOH), Gaussian Process Regression (GP), and Multivariate Time Series Imputation by Graph Neural Network (GRIN,Cini et al. (2022)).

Figure 3: An three-time-series example demonstrating the advantages of introducing data imputation, with the groundtruth causal graph in the left column. The recovered causal graph without data imputation (middle column) shows some false positive and false negative edges, while CUTS (right column) exhibits perfect results.

Figure 4: Average MSE curve of imputed data on VAR datasets with Random Missing / Periodic Missing (top), Lorenz-96 datasets under Random Missing / Periodic Missing (middle), and NetSim datasets with Random Missing (bottom).

Performance comparison of CUTS with (i) PCMCI, eSRU, NGC combined with imputation method ZOH, GP, GRIN and (ii) LCCM, NGM which do not need data imputation. Experiments are performed on VAR and Lorenz-96 datasets in terms of AUROC. Results are averaged over 10 randomly generated datasets.

Quantitative results on NetSim dataset. Results averaged over 10 human brain subjects.

Quantitative results of ablation studies. "CUTS (Full)" denotes the default settings in this paper. Here we run experiments on Lorenz-96 datasets. Ablation study results on other datasets are provided in Appendix Section A.4.2. stage and then xt,i is predicted with all time-series instead of only the parent nodes. This experiment is shown as "Remove CPG for Imput." in Table 6. It is observed that introducing CPGs in data imputation is especially helpful with large quantities of missing values (p = 0.6 for Random Missing or T max = 4 for Periodic Missing). Comparing with the scores in the first row, we can see that introducing CPGs in data imputation boosts AUROC by 0.0011 ∼ 0.0170. Data Imputation Boosts Causal Discovery. To show that Causal graph fitting stage helps Latent data prediction stage, we disable data imputation operation defined in Equation

Hyperparameters settings of CUTS in the aforementioned experiments. "a 1 → a 2 " means parameters exponentially increase/decrease from a 1 to a 2 .

Hyperparameters settings of the baseline causal discovery and data imputation algorithms.

Quantitation results of ablation studies on VAR dataset. "CUTS (Full)" denotes the default settings in this paper. The highest scores (or multiple ones with ignorable gaps) of each column are bolded for clearer illustration.

Quantitative comparison on learning step numbers, in terms of AUROC. We set n 1 , n 2 , n 3 proportional to original settings, e.g., if original settings is n 1 = 50, n 2 = 250, n 3 = 200 then "50% Steps" means n 1 = 25, n 2 = 125, n 3 = 100.

Accuracy of CUTS on Lorenz-96 datasets with different noise levels. The accuracy is calculated in terms of AUROC. MSE) of the imputed time-series, imputed time-series without the help of causal graph, and the groundtruth time-series during the whole training process are shown in Figure4. We can see that under all configurations our approach successfully imputes missing values with significantly lower MSE compared to initially filled values. Furthermore, in most settings imputing time-series without the help of causal graph are prone to overfit. The imputed time-series then boost the subsequent causal discovery module, and discovered causal graph help to prevent overfit in imputation.

Quantitative comparison for 3-dimensional temporal causal graph discovery on VAR datasets, in terms of AUROC.

Accuracy of CUTS and five other baseline causal discovery algorithms on VAR, Lorenz-96, NetSim, and DREAM-3 datasets without missing values. The accuracy is calculated in terms of AUROC.

Accuracy of causal discovery results of CUTS under different hyperparameters λ and τ max settings.

Comparison of CUTS with (i) PCMCI, eSRU, NGC combined with imputation method ZOH, GP, GRIN and (ii) LCCM, NGM which does not need data imputation. Results are averaged over 4 randomly generated datasets. Causal graph fitting stage Input: Time series dataset {x 1:L,1 , ..., x 1:L,N }; observation mask {o 1:L,1 , ..., o 1:L,N }; Adam optimizer Adam(•); Gumbel softmax function Gumbel(•) described with Equation21Output: Causal probability m τ,ji , ∀j = 1, ..., Nfor i = 1 to N do xt,i ← f ϕi (x t-τ :t-1,i ⊙ s 1:τ,ij ), s τ,ij = Gumbel(1 -m τ,ij ) L graph ( X , X , O, θ) = L pred ( X , X , O) + λ||σ(θ)|| 1 θ ← Adam(θ,L graph ) end for Algorithm 3 Causal Discovery from Irregular Time-series (CUTS) Input: Time series dataset {x 1:L,1 , ..., x 1:L,N } with time-series length L; observation mask {o 1:L,1 , ..., o 1:L,N }; Zero-order holder (ZOH) imputation algorithm ZOH(•); Adam optimizer Adam(•) Output: Discovered causal graph Initialize x(0) 1:L,i = ZOH(x 1:L,i ), Causal Probability Graphs M τ = 0, ∀τ = 1, ..., τ max # Warming up for n 1 iterations do Update {ϕ 1 , ..., ϕ N } with Algorithm 1 Update M τ with Algorithm 2 end for # Causal discovery with data imputation for n 2 iterations do Update {ϕ 1 , ..., ϕ N } with Algorithm 1 Update M τ with Algorithm 2 for i = 1 to N do Reset o t,i ← 1, ∀t = 1, ..., T, i = 1, ..., N for n 3 iterations do Update {ϕ 1 , ..., ϕ N } with Algorithm 1 Update M τ with Algorithm 2 end for for i = 1 to N do for j = 1 to N do ãi,j = max (m 1,ij , ..., m τmax,ij ) end for end for return Discovered causal adjacency matrix Â where each elements is ãi,j .

ACKNOWLEDGMENTS

This work is jointly funded by Ministry of Science and Technology of China (Grant No. 2020AAA0108202), National Natural Science Foundation of China (Grant No. 61931012 and 62088102), Beijing Natural Science Foundation (Grant No. Z200021), and Project of Medical Engineering Laboratory of Chinese PLA General Hospital (Grant No. 2022SYSZZKY21).

annex

We proved that in Theorem 1 our CUTS can discover the correct Granger causality with the following assumptions:1. DSGNN f ϕi in Latent data prediction stage model generative function f i with an error smaller than arbitrarily small value e NN,i ;In Causal graph fitting stage the loss functionPublished as a conference paper at ICLR 2023 A.4 ADDITIONAL EXPERIMENTS A.4.1 DREAM-3 EXPERIMENTS DREAM-3 (Prill et al., 2010) is a gene expression and regulation dataset mentioned in many causal discovery works as quantitative benchmarks (Khanna & Tan, 2020; Tank et al., 2022) . This dataset contains 5 models, each representing measurements of 100 gene expression levels. Each measured trajectory has a time length of T = 21. This is too low to perform random missing or periodic missing experiments, so with DREAM-3 we only compare our approach with baselines in regular time-series scenarios. The results are shown in Table 11 .

A.4.2 ABLATION STUDY ON VAR AND NETSIM DATASETS

Besides the ablation studies on Lorenz-96 datasets shown in Table 3 , we additionally show those on VAR and NetSim in Tables 6 and 7 . In Table 6 , one can see that "CUTS (Full)" beats other configurations in most scenarios, and the advantage is more obvious with higher missing percentage (p = 0.6 for Random Missing and T max = 4 for Periodic Missing). On the NetSim datasets with a too small data size L = 200, "CUTS (Full)" beats other configurations at a small missing probability (p = 0.1).

A.4.3 ABLATION STUDY FOR EPOCH NUMBERS

In our proposed CUTS, each step can be recognized as a refinement of causal discovery, with builds upon previous imputation results. Since the data imputation and causal discovery mutually boost each other, the performance may be affected by different settings of learning steps. In Table 8 we conduct experiments to show the impact of different epoch numbers on VAR, Lorenz-96, and Netsim datasets. We set n 1 , n 2 , n 3 proportional to original settings.

A.4.4 PERFORMANCE ON TEMPORAL CAUSAL GRAPH

In the previous experiments, we calculate causal summary graphs with ãi,j = max{m τ,ij } τmax τ =1 , i.e., maximal causal effects along time axis. Our CUTS also supports discovery of 3-dimensional temporal graph {m τ,ij }. We conduct experiments to investigate our performance for temporal causal graph discovery. The results are shown in Table 10 .

A.4.5 CAUSAL DISCOVERY WITH STRUCTURED TIME-SERIES DATA

We show in this section that CUTS is able to recover causal relations not only with irregular timeseries but also with regular time-series, which is widely used for performance comparison in previous works. We again tested our algorithm on VAR, Lorenz-96, and NetSim datasets, and the results are shown in Table 11 . It is observed that our algorithm shows superior performance to baseline methods.

A.4.6 ROBUSTNESS TO HYPERPARAMETERS SETTINGS

We show that CUTS is robust to changes of hyperparameters settings, with experiment results listed in Table 12 . For existing Granger-causality based methods such as NGC (Tank et al., 2022) and eSRU (Khanna & Tan, 2020) , parameters λ and the maximum time lag τ max are often required to be tuned precisely. Empirically, λ is chosen to balance between the sparsity of the inferred causal relationship and data prediction accuracy, and τ max is chosen according to the estimated maximum time lag. In this work we find our CUTS gives similar causal discovery results across a wide range of λ (0.01 ∼ 0.3) and τ max (3 ∼ 9).

A.4.7 LORENZ-96 DATASETS WITH F=40

We further conducted experiments with external forcing constant F = 40 on Lorenz-96 datasets instead of F = 10 in Section 5.2. We show that our approach produces promising results with p = 0.3 for random missing and T max = 2 for periodic missing, as shown in Table 13 with AUROC score higher than 0.9. We experimentally show that CUTS is robust to noise, as shown in Table 9 . We choose the nonlinear Lorenz-96 datasets for this experiment (L = 1000, F = 10) and set additive Gaussian white noise with standard deviation σ = 0.1, 0.3, 1, respectively.

A.5 PSEUDOCODE FOR CUTS

We provide the pseudocode of two boosting modules of the proposed CUTS in Algorithm 1 and 2 respectively, and the whole iterative framework in 3. Detailed implementation is provided in supplementary materials and will be uploaded to GitHub soon. 

