CAUSAL ATTENTION TO EXPLOIT TRANSIENT EMER-GENCE OF CAUSAL EFFECT Anonymous

Abstract

We propose a causal reasoning mechanism called causal attention that can improve performance of machine learning models on a class of causal inference tasks by revealing the generation process behind the observed data. We consider the problem of reconstructing causal networks (e.g., biological neural networks) connecting large numbers of variables (e.g., nerve cells), of which evolution is governed by nonlinear dynamics consisting of weak coupling-drive (i.e., causal effect) and strong self-drive (dominants the evolution). The core difficulty is sparseness of causal effect that emerges (the coupling force is significant) only momentarily and otherwise remains quiescent in the neural activity sequence. Causal attention is designed to guide the model to make inference focusing on the critical regions of time series data where causality may manifest. Specifically, attention coefficients are assigned autonomously by a neural network trained to maximise the Attentionextended Transfer Entropy, which is a novel generalization of the iconic transfer entropy metric. Our results show that, without any prior knowledge of dynamics, causal attention explicitly identifies areas where the strength of coupling-drive is distinctly greater than zero. This innovation substantially improves reconstruction performance for both synthetic and real causal networks using data generated by neuronal models widely used in neuroscience.

1. INTRODUCTION

In this work, our task is to infer causal relationships between observed variables based on time series data and reconstruct the causal network connecting large numbers of these variables. Assume the time series x it record the time evolution of variable i governed by coupled nonlinear dynamics, as represented by a general differential equation ẋit = g(x it ) + B ij f (x it , x jt ), where g and f are self-and coupling functions respectively. The parent variable influences the dynamic evolution of the child variable via the coupling function f . Note that these two functions are hidden and usually unknown for real systems. The asymmetric adjacency matrix B represents the causal, i.e., directional coupling relationship between variables. Hence, the goal is to infer matrix B from observed time series x it , i = 1, 2, . . . , N where N is the number of variables in the system. If B ij = 1, the variable i is a coupling driver (parent variable) of variable j, otherwise it is zero. The key challenge is that the causal effect in neural dynamics (e.g., biological neural systems observed via neuronal activity sequences) is too weak to be detected, rendering powerless classic unsupervised techniques of causal inference across multiple research communities Granger (1969) ; Schreiber (2000) ; Sugihara et al. (2012) ; Sun et al. (2015) ; Nauta et al. (2019) ; Runge et al. (2019) ; Gerhardus & Runge (2020) ; Tank et al. (2021) ; Mastakouri et al. (2021) . This difficulty manifests in three aspects. First, the dynamics contains self-drive and coupling-drive. The strength of coupling f (x it , •) is usually many orders of magnitude smaller than self-drive g(x it ), and the latter dominates evolution. Second, the behavior of the coupling-drive is chaotic, unlike in linear models Shimizu et al. (2006) ; Xie et al. (2020) . The resulting unpredictability and variability of system state means that coupling force can be significant momentarily and otherwise almost vanish, as illustrated in Figure 3 (gray lines). This dilutes the information in time series that can be useful for inferring the causal relationship. Third, in the heterogeneous networks common in applications, some variables are hubs coupled with many parent variables, among which it is difficult to distinguish individual causes. When causal effects are weak, we do not observe clearly the principle of Granger Causality, whereby the parent variable can help to explain the future change in its child variable Pfister et al. (2019) . Rather, when we train a machine learning model Nauta et al. (2019) ; Tank et al. (2021) for prediction task on the neuronal activity sequences, the model only exploits the historical information of the child variable itself and that from parent variables is ignored. We posit that coupling-drive makes a negligible contribution to dynamic evolution in the majority of samples of time series data. In other words, only in a small fraction of samples is the information of parent variables effective in predicting the evolution of child variables. Taking as an example the gradient algorithm to minimise the regression error over all samples t (x it -xit ) 2 , the adjustment of model parameters from the tiny samples corresponding to significant coupling force is negligible, but these are the only samples which could induce the model to exploit causal effects in reducing regression error. Similarly, for transfer entropy Schreiber (2000) , which measures the reduction in uncertainty which a potential parent variable provides to a potential child variable, there is no significant difference in measured value between ordered pairs of variables with and without causality. To overcome the difficulty, we propose a causal reasoning mechanism -causal attention -to identify the moments when causal effect emerges. We design an objective function, Attention-extended Transfer Entropy (AeTE), comprising a weighted generalisation of transfer entropy. In order to maximize AeTE, the causal attention mechanism trains neural networks to autonomously allocates high attention coefficients a t at times t where information of parent variables effectively reduces the uncertainty of child variables, and ignores other positions by setting a t close to zero. If we consider each value in a time series as a feature, the operation of attention allocation is also equivalent to removing the non-causal features Kusner et al. (2017) ; Hu et al. (2021) . However, noise in empirical samples may also produce high transfer entropy regions, which leads to spurious causal effects even when using causal attention. We add a binary classification model to perform more sophisticated inference under the guidance of causal attention to focus on these critical regions and recognize different patterns between noisy and sparse emergence of causal effect. We deal with this class of causal inference task by way of small sample supervised learning. Although training and test data have a distribution shift in the setting of small samples, they arise through an identical underlying generation process. Thus, if the model provides an insight into the underlying dynamics -the coupling-drive for causal inference -then the understanding acquired from small samples can be effectively utilised in the test environment Bareinboim & Pearl (2014) ; Battaglia et al. (2016) ; Makhlouf et al. (2020) ; Pessach & Shmueli (2022) . The role of causal attention is to help the classification model gain this insight. Our contributions are summarized as follows: 1. We introduce causal attention, a causal reasoning mechanism to identify the positions of time series at which causal effect emerges and guide a classification model to infer causality focusing on these critical positions. Without any prior knowledge of dynamics, the mechanism determines the areas where the coupling force is substantially different from zero. 2. By formulating Transfer Entropy as the difference between two types of mutual information, and based on the dual representation of Kullback-Liebler (KL) divergence, we design a differentiable metric, Attention-extended Transfer Entropy, as the objective function of the proposed causal attention mechanism. 3. Our method significantly improves performance on synthetic and real causal networks using the data generated by five well-known neural dynamic models, and the number of labels required is very small compared to the size of the causal networks. Our methodology has limitations (i.e., cases for which performance improvement is less): 1. Dense networks, where a variable is coupled with many driving variables such that their causal effects overlap and are harder to distinguish. 2. Intense noise, which makes the casual attention mechanism falsely identify high transfer entropy regions. The downstream classifier then extracts non-causal features, leading to the reduction of its generalization. 3. Strongly coupled system, which is dominated by synchronization phenomena in which the dynamic behaviors of all variables are similar.

2.1. DEFINITION OF TRANSFER ENTROPY

The transfer entropy, an information-theoretic causality measure, is able to detect information flow between time series X and Y . Transfer Entropy measures the degree of non-symmetric dependence of Y on X, defined as Schreiber (2000) : T E(X → Y ) = p y t+1 , y (k) t , x (l) t log p y t+1 | y (k) t , x (l) t p y t+1 | y (k) t , where x (l) t = (x t , ..., x t-l+1 ) and y (k) t = (y t , ..., y t-k+1 ) and k, l are lengths. For an uncoupled system (X and Y independent) that can be approximated by a Markov process of order k, the conditional probability to find Y in state y t+1 at time t + 1 satisfies p y t+1 | y (k) t , x (l) t = p y t+1 | y (k) t .

2.2. MUTUAL INFORMATION NEURAL ESTIMATION

The mutual information is equivalent to the KL divergence between the joint distribution P XY and the product of the marginal distributions P X ⊗ P Y Nowozin et al. (2016) ; Hjelm et al. (2018) . The KL divergence D KL admits the neural dual representation Donsker & Varadhan (1983); Belghazi et al. (2018) : M I(X, Y ) = D KL (P XY ∥P X , P Y ) ≥ sup θ∈Θ E P XY [f θ ] -log E P X ⊗P Y e f θ , where the supremum is taken over parameter space Θ and f θ is the family of functions parameterized by the neural network with parameters θ ∈ Θ. The mutual information neural estimator is strongly consistent and can approximate the actual value with arbitrary accuracy Belghazi et al. ( 2018). By the conditional Bayes formula and adding a marginal distribution of Y , we derive the transfer entropy as the difference between two types of mutual information. An intuitive description is provided in Figure 1 , and the derivation is placed in Appendix A. (3) t . By connecting Eq. 4 and Eq. 2, we define a differentiable estimator of transfer entropy as:

3. METHOD

T E(X → Y ) = p y t+1 , y (k) t , x =M I Y t+1 , Y (k) t , X (l) t -M I Y t+1 , Y (k) t . T E(X → Y ) = sup Θ E P Yt+1,Y (k) t ,X (l) t [f θ ] -log E P (Yt+1)⊗P Y (k) t ,X (l) t e f θ -sup Φ E P Yt+1,Y (k) t [f ϕ ] -log E P (Yt+1)⊗P (Y (k) t ) e f ϕ . (5) Transfer entropy, and even mutual information, is difficult to compute Paninski (2003) , especially for high-dimensional or noisy data. In Appendix B, we offer a theoretical proof for the consistency and convergence properties of Transfer Entropy Neural Estimation, and examine its bias on a linear dynamic system where the true values of transfer entropy can be determined analytically.

3.2. ATTENTION-EXTENDED TRANSFER ENTROPY

The main difficulty in our task is that the causal effect in certain nonlinear dynamical systems is too weak to be recognized by classic techniques. We discuss the limitation of the iconic transfer entropy in detail that it works well when the three true distributions, i.e., one joint distribution and two conditional distributions in Eq. 1, can be estimated perfectly. However, sparse causal effects are easily masked if the estimated probability density deviates even slightly from the real distribution. These momentary sources of evidence of coupling drive are like outliers in the total distribution of a time series dominated by self-drive. In order to make the transfer entropy provide a clear distinction between causal and non-causal pairs, we need to highlight the positions where p y t+1 | y (k) t , x (l) t > p y t+1 | y (k) t and filter out other times by adjusting a t in Eq. 6, all while avoiding the problem of distribution approximation. We do so by defining AeTE as: AeT E(X → Y ) = a t • p y t+1 , y (k) t , x (l) t log p y t+1 | y (k) t , x (l) t p y t+1 | y (k) t (6) = M I Y t+1 , Y (k) t , X (k) t | A -M I Y t+1 , Y (k) t | A . In this expression, a t ∈ [0, 1] is the attention coefficient at time step t and the collection A of attention coefficients is the attention series. Comparison of Eq.1 and Eq.6 reveals that the transfer entropy can be viewed as a simplified version of AeTE in which attention coefficients are uniformly set to one: ∀t, a t = 1. Because each position has an equal contribution to estimation, the value of transfer entropy is dominated by the majority of positions where causal effect is negligible, i.e., where p y t+1 | y (k) t , x (l) t ≈ p y t+1 | y (k) t . Similarly to transfer entropy, AeTE is derived as the difference of two mutual informations, but AeTE incorporates the scheme of attention assignment. By connecting Eq. 7 and Eq. 2, we define a differentiable estimator of AeTE as: AeT E(X → Y ) = sup Θ 1 L a t • f θ y t+1 , y (k) t , x (l) t -log 1 L a t • e f θ yr,y (k) t ,x (l) t -sup Φ 1 L a t • f ϕ y t+1 , y (k) t -log 1 L a t • e f ϕ yr,y (k) t , ( ) where T is the total number of time steps and L = T -max(k, l). The expectation on the distribution of variables is adapted into the mean over time series.

3.3. CAUSAL ATTENTION MECHANISM

The overall framework of our model is presented in Figure 2 . In addition to two neural networks f θ and f ϕ for mutual information estimation, we employ another neural network g α for causal attention assignment. Rather than approximating distributions, the neural network g α learns to maximize AeTE given by Eq. 8 via gradient descent. However, the occurrence of high transfer entropy regions may appear due to noise in empirical samples. For more sophisticated inference, we augment our method with a binary classifier h η guided by causal attention to focus on high transfer entropy regions and recognize different patterns between noise and sparse emergence of causal effect. The classifier takes causality and non-causality as class labels. Then, the training process is divided into two independent stages: causal attention learning and classification learning. The objectives in the first stage are: θ, ϕ ← argmax θ,ϕ|α L 1 + L 2 (9) α ← argmax α|θ,ϕ L 1 -L 2 , where L 1 , L 2 is the expectation of the first and second sup term of Eq.8 on training set respectively. We update (f θ , f ϕ ) and g α alternately. A small learning rate is required to maintain training stability, otherwise the g α may fall into a trivial solution where attention is almost zero throughout the time series. The objective in the second stage is: η ← argmin η|α L 3 , where L 3 is the binary cross entropy and the notation η | α indicates that causal attention remains fixed during the second stage of training. The downstream classifier is sensitive to the upstream scheme of attention assignment. In experiments, there exists an optimal loss interval for training the attention model g α . We stop the first stage training when the loss value of objective Eq. 10 is stable in this interval, and then the downstream classifier h η usually obtains the best generalization performance. For different dynamics, their optimal intervals are different and we find them empiricallyfoot_0 . Details on the implementation of causal attention mechanism are provided in Appendix C.

4. EXPERIMENT

We describe our experiment setup and extensively evaluate the causal attention mechanism on neuronal dynamics coupled on both synthetic and real causal networks.

4.1. SETUP

Causal networks. For synthetic networks, we generate ten groups of Erdős-Rényi (ER) and scalefree (SF) directed networks with one hundred nodes (i.e., variables) uniformly and with mean degree varying from 5 to 41 by adjusting the probability for edge creation in ER and the total number of edges in SF. Symmetric links (both x i → x j and x j → x i ) can exist. For each set of network parameters we consider five independently generated instances. For real networks, we select five neurological connectivity datasets as presented in  B ij = 1 if variable i is the parent of variable j, otherwise B ij = 0. In these expressions, Γ describes the coupling-drive, while other terms represent self-drive. The detailed configuration of dynamical parameters is provided in Appendix D.

Dynamics Equations

Hindmarsh-Rose ṗi = qi -ap 3 i + bp 2 i -n + Iext + Γ qi = c -dp 2 i -qi, ṅi = r [s (pi -p0) -ni] Γ = gc (Vsyn -pi) N j=1 Bij /(1 + exp(-λ (pj -Θsyn))) Morris-Lecar C V =I -g L (V -V L ) -gCam∞(V ) (V -VCa) -g K n (V -V K ) + Γ ṅ =λ(V ) (n∞(V ) -n) , Γ = gc N j=1 Bij (nj -ni) Izhikevich vi = 0.04v 2 i + 5vi + 140 -ui + I + Γ ui = a(bvi -ui), Γ = gc N j=1 Bij uj Rulkov F1(ui, wi) = β 1 + u 2 i + wi + Γ (uj ) , F2(ui, wi) = wi -νui -σ Γ (uj ) = gc N j=1 Bij / (1 + exp(λ (uj -Θs))) FitzHugh-Nagumo v = a + bv + cv 2 + dv 3 -u + Γ u = ε(ev -u), Γ = -gc N j=1 Bij (vj -vi) . Dynamic models. We use five dynamic models for neural activity simulation widely used in the field of neuroscience: Hindmarsh-Rose (HR), Morris-Lecar (Morris), Izhikevich (Izh), Rulkov and FitzHugh-Nagumo (FHN). Dynamic equations are provided in Table 2 , and segments of generated time series are represented in Figure 3 . Evaluation metrics. We measure the following metrics: (1) the area under the receiver operating characteristic curve (AUROC); and (2) the area under the precision-recall curve (AUPRC). Baselines. We compared our method with seven baselines: (1) Granger causality test (Ganger) Granger ( 1969 Training details. We employ the convolutional neural network for model g α and h η , and the fullyconnected neural network for model f θ and f ϕ . We use the ADAM Kingma & Ba (2014) optimizer with the initial learning rate of 10 -4 for classifier h η and 10 -5 for the others. The batch size is 10. For synthetic networks, we select randomly twenty ordered pairs of variables as a training/validation set and four hundred ordered pairs as a test set. For real networks, the sample set scheme is provided in Table 3 . All sets are composed of equal samples with causality and without causality. The total time step T of time series is 50,000. Gaussian measurement noise is added with mean zero and standard deviation 1%/10% that of the original time series for synthetic/real networks respectively.We run all experiments in this work on a local machine with two NVIDIA Tesla V100 32GB GPUs. 

4.2. RESULT

Insight into the coupling-drive of underlying dynamics. The gray lines in Figure 3 represent the change of coupling force from parent to child variable over time, and are generated by the coupling term Γ in Table 2 . The absolute value of the coupling force rises (the gray lines spike) at occasional moments when the behavior of a parent variable substantially influences the evolution of its child variable, and remains almost zero at other times. The orange lines representing the causal attention keep in step with the gray lines, indicating that the causal attention mechanism recognizes the effect of coupling force in reducing the uncertainty of the child variable and pays attention to the areas where coupling force is significant. In Figure 3(d) , the causal attention focuses on two separated regions where the coupling forces have concentrated bursts. In contrast, the light pink lines representing the traditional attention remain close to their maximum value, indicating it is insensitive to changes in coupling force. This leads its classifier to extract features throughout the whole time series (instead of focusing on causal features). The traditional attention are not designed for causal reasoning and cannot accommodate the selection of features that correspond to the causal information. Performance on test sets. Compared with the baselines, our method usually substantially improves reconstruction performance on both synthetic and real causal networks, as shown in Figure 4 and Table 3 . In contrast, the classifier with traditional attention mechanism (TA) obtains low losses on training sets but has poor generalization on test sets, highlighting that mere statistical correlation for causal inference is unstable and can be spurious Cui & Athey (2022) . The performance of classical unsupervised methods, for which all positions in the time series are treated equally, is also limited by the paucity of causal effects. These patterns demonstrate the importance of identifying and focusing on critical regions, which we achieve via the causal attention mechanism. In conclusion, our method slightly increases cost, due to the need for label collection, but obtains a substantial boost in performance compared with those unsupervised methods in this class of causal network reconstruction tasks. The performance of all methods tends to decrease as the average network degree grows. Networks with larger average degree are more likely to exhibit synchronization of variables which makes it harder to distinguish cause and effect. Furthermore, a single variable in these denser networks can have many parent variables, and substantial coupling forces can emerge from distinct parents at overlapping times, making individual drivers harder to distinguish. In this circumstance, a slight variance in the scheme of causal attention assignment may cause fluctuations in the performance of the downstream classifier. Robustness of the proposed method to measurement noise and sequence length is presented in Appendix F. 

6. CONCLUSION

The problem of reconstructing causal networks from observational data is fundamental in multiple disciplines of science including neuroscience, since it is a prerequisite foundation for the research about structure analysis and behavior control in causal networks. Especially, several countries have recently launched grand brain projects, and one important goal is to map the connectomes (i.e., directed links between neurons) of different species. We proposed a novel mechanism, causal attention, to guide machine learning models to infer causal relationships while focusing on the specific areas where casual effect may emerge. We showed that this mechanism identifies weak causal effects ignored by classical techniques, and helps machine learning models gain insight into the coupling dynamics underlying time series data. Our method needs a small set of samples (i.e., a small number of known causal links), and thus raises an open problem worthy of future pursuit: for large complex systems, how to select the small number of ordered pairs of nodes that offer general pattern for identification of sparse causal effects.

A DERIVATION

Here we present a derivation showing that the transfer entropy equals the difference between two types of mutual information: T E(X → Y ) = p y t+1 , y (k) t , x (l) t log p y t+1 | y (k) t , x (l) t p y t+1 | y (k) t . ( ) Applying the conditional Bayes formula p (y | x) = p(y,x) p(x) on the numerator and denominator in the log term of equation 12: T E(X → Y ) = p y t+1 , y (k) t , x (l) t log p yt+1,y (k) t ,x (l) t p y (k) t ,x (l) t p yt+1,y (k) t p y (k) t . ( ) Adding the marginal distribution of time series Y to the numerator and denominator simultaneously:  T E(X → Y ) = p y t+1 , y (k) t , x = M I Y t+1 , Y (k) t , X (l) t -M I Y t+1 , Y (k) t . ( ) In these expressions, y r is sampled from time series Y randomly each time step and independently of the time step t.

B NEURAL ESTIMATOR FOR TRANSFER ENTROPY B.1 CONSISTENCY

Definition. A neural estimator S(X, Y ) n which uses n samples from the data distribution to estimate a statistic S(X, Y ) on variables X, Y is strongly consistent if for any ϵ > 0, there exists a positive integer N and a choice of neural network such that: ∀n ≥ N, | S(X, Y ) -S(X, Y ) n |≤ ϵ, almost everywhere (a.e.) The Mutual Information Neural Estimator (MINE) depends on a choice of a neural network and the number of samples n from the data distribution Belghazi et al. (2018) . Let f θ be the family of functions parameterized by the neural network with parameters θ ∈ Θ. MINE is defined as: M I(X, Y ) n = sup θ∈Θ E P (n) XY [f θ ] -log E P (n) X ⊗P (n) Y e f θ . Theorem 1 Belghazi et al. (2018) . MINE is strongly consistent. The Transfer Entropy Neural Estimator (TENE) consists of two independent MINE and depends on choice of neural network and sample number n. TENE is defined as: T E(X → Y ) n = M I Y t+1 , Y (k) t , X (l) t n -M I Y t+1 , Y (k) t n . ( ) We use M I [1] , M I [2] , M I [1] n , M I [2] n as abbreviations of M I Y t+1 , Y (k) t , X (l) t and M I Y t+1 , Y (k) t , M I Y t+1 , Y (k) t , X (l) t n and M I Y t+1 , Y (k) t n respectively. We will prove the following: Theorem 2. TENE is strongly consistent. Proof. Let ϵ > 0. By Theorem 1, we can choose neural networks and integers N 1 , N 2 and such that ∀n ≥ N 1 , M I [1] -M I [1] n ≤ ϵ/2, a.e. ( ) ∀n ≥ N 2 , M I [2] -M I [2] n ≤ ϵ/2, a.e. ( ) Letting N = max {N 1 , N 2 }, for n ≥ N and for some neural network we have, a.e., ∀n ≥ N, T E(X → Y ) -T E(X → Y ) n = (M I [1] -M I [2] ) -( M I [1] n -M I [2] n ) (22) = (M I [1] -M I [1] n ) -(M I [2] -M I [2] n ) (23) ≤ (M I [1] -M I [1] n ) + (M I [2] -M I [2] n ) (24) ≤ ϵ/2 + ϵ/2 = ϵ. ( ) The proof is complete.

B.2 VARIATION OF BIAS VARY WITH DIMENSION AND NOISE

We examine the performance of TENE for the considered class of neural networks on linear dynamic system, consisting of variables X and Y defined as: x t+1 = αx t + ε x (26) y t+1 = βy t + g c x t + ε y (27) We set α = β = 0.5 and ε x = ε x ∼ N (0, σ 2 ). The true values of transfer entropy T E(X → Y ) in this simple coupled system can be determined analytically Kaiser & Schreiber (2002) . We can increase the dimension of the system by considering multiple independent copies of variables X and Y , in which case the mutual information and transfer entropy scale linearly with the dimension of the system. For each considered dimension, standard deviation σ, and coupling strength g c in an interval from -0.4 to 0.4, we generate a time series of length 50,000. We also consider an alternative non-parametric estimator of mutual information, the Kraskov estimator Kraskov et al. (2004) c,d) shows that the amplitude of the driving Gaussian noise has little influence on estimates. Interestingly, as coupling strength g c grows small, i.e., as X and Y become more independent, the Kraskov estimator can suggest a negative value of the mutual information, i.e., we estimate that M I n (Y t+1 , Y t , X t ) < M I n (Y t+1 , Y t ). We deduce that irrelevant information about the nearly independent variable X t interferes with the estimation of the mutual information by the Kraskov estimator.

C ALGORITHM

Details on the implementation of causal attention mechanism are provided in Algorithm 1. Update parameters θ, ϕ ← θ + ∇ θ L 1 , ϕ + ∇ ϕ L 2 12: Recompute L 1 , L 2 on S 13: Update parameters α ← α + ∇ α (L 1 -L 2 ) 14: repeat until L 3 convergence 15: Compute L 3 on S 16: Update parameters η ← η -∇ η L 3 D MODEL BRAIN DYNAMICS Here we give detailed information about five neuronal dynamics applied to modeling membrane potential and relevant quantities in biological connectomes. We input to each causal discovery algorithm the coordinate corresponding to the neuron membrane voltage potential, because this variable is most likely to be experimentally accessible.

D.1 HINDMARSH-ROSE DYNAMICS

The spikes of activity in neurons are considered an important part of the brain's information processing Borges et al. (2018); Rabinovich et al. (2006) . Hindmarsh and Rose Hindmarsh & Rose (1984) (HR) proposed a phenomenological neuron model that is a simplification of the Hodgkin-Huxley model Hodgkin & Huxley (1952) . The HR model is described by ṗ = q -ap 3 + bp 2 -n + I ext q = c -dp 2 -q ṅ = r [s (p -p 0 ) -n] where p(t) is the action potential of the membrane, q(t) is related to the fast current and n(t) is associated with the slow current. Presynaptic neurons with an action potential p j coupled by chemical synapses to neurons i modifying its action potential p i according to ṗi = q i -ap 3 i + bp 2 i -n + I ext + Γ Γ = g c (V syn -p i ) N j=1 B ij 1 + exp(-λ (p j -Θ syn )) where i, j = 1, . . . , N , N is the number of neurons, g c is the chemical coupling strength and B ij describes neurons' chemical connections. The chemical synapse function is modeled by the above sigmoidal function, with Θ syn = 1.0. We use parameters a = 1, b = 3, c = 1, u = 5, s = 4, r = 0.005, p 0 = -1.60, coupling strength g c = 0.1, V syn = 2, λ = 10, and external current I ext = 3.24, for which HR neurons exhibits a chaotic burst behavior. D.2 MORRIS-LECAR DYNAMICS Morris and Lecar Morris & Lecar (1981) suggested a simple two variable model to describe oscillations in a barnacle's giant muscle fiber. The Morris-Lecar model has became quite popular in computational neuroscience community due to its biophysically meaningful and measurable parameters, which consist of a membrane potential u receiving an instantaneously activated Ca current and a more slowly activated K current n evolving according to: C V =I -g L (V -V L ) -g Ca m ∞ (V ) (V -V Ca ) -g K n (V -V K ) + Γ(V ) ṅ =λ(V ) (n ∞ (V ) -n) where m ∞ (V ) = 1 2 1 + tanh (V -V 1 ) V 2 n ∞ (V ) = 1 2 1 + tanh (V -V 3 ) V 4 λ(V ) = λ cosh (V -V 3 ) (2V 4 ) with the coupling term Γ(V i ) = g c N j=1 B ij (n j -n i ) , with parameters C = 20µF/cm 2 , g L = 2mmho/cm 2 , V L = -50mV, g Ca = 4mmho/cm 2 , V Ca = 100mV, g K = 8mmho/cm 2 , V K = -70mV, V 1 = 0mV, V 2 = 15mV, V 3 = 10mV, V 4 = 10mV, λ = 0.1 s -1 , and applied current I = 34µA/cm 2 .

D.3 IZHIKEVICH DYNAMICS

Izhikevich dynamics reproduce spiking and bursting behavior of known types of cortical neurons, and combine the biological plausibility of Hodgkin-Huxley-type dynamics and the computational efficiency of integrate-and-fire neurons Izhikevich (2003) . The equations governing Izhikevich spike dynamics are: v = 0.04v 2 + 5v + 140 -u + I + g c B ij u j u = a(bv -u) with the auxiliary after-spike resetting if v ≥ +30mV, then v ← c u ← u + d . Here, variable v represents the membrane potential of the neuron and u represents a membrane recovery variable, which accounts for the activation of K + ionic currents and inactivation of Na + ionic currents, and it provides negative feedback to v. Here, we use the parameters a = 0.2, b = 2, c = -56, d = -16, I = -99. After the spike reaches its apex (+30mV), the membrane voltage and the recovery variable are reset. If v skips over 30 , then it first is reset to 30 , and then to c so that all spikes have equal magnitudes.

D.4 RULKOV DYNAMICS

The Rulkov model is a map-based neuron model with a surprising abundance of features, such as periodic and chaotic spiking, and bursting. The Rulkov map is an abstract mathematical model, although it shares some specific features with others neuron models closer to experimental observations. We use synthetic time series where each neuron is simulated using the Rulkov model Eroglu et al. (2020) , which has two variables, u and w, evolving at different timescales as described by x(t + 1) = (u(t + 1), v(t + 1)) = F (x(t)) = (F 1 (u(t), w(t)), F 2 (u(t), w(t))), with F 1 (u, w) = β 1 + u 2 + w + Γ (u) and F 2 (u, w) = w -νu -σ. The two variables reflect the two important time scales of a neuron model. The variable u represents the fast dynamics of the system and usually models the membrane voltage of the neuron, whereas w is the slow variable and represents the variations of the ionic recovery currents. Different combinations of parameters σ and β give rise to different dynamical states of the neuron, such as resting, tonic spiking, and chaotic bursts. As for the coupling, we consider chemical synaptic coupling, that is, H (x i , x j ) = (h (u i , u j ) , 0) with h (u i , u j ) = (u i -V s ) Γ (u j ), where Γ (u j ) = 1 1 + exp {λ (u j -Θ s )} and electrical synaptic coupling, H (x i , x j ) = (h (u i , u j ) , 0), with h (u i , u j ) = u j -u i . In the chemical coupling, V s is a parameter called the reverse potential. Here, we use the parameters with β = 4.4, σ = ν = 0.001,, V s = 20, Θ s = -0.25 and λ = 10.

D.5 FITZHUGH-NAGUMO DYNAMICS

A FitzHugh-Nagumo neuron comprises a two-dimensional system of smooth ODEs, so cannot exhibit autonomous chaotic dynamics and bursting. Adding noise allows for stochastic bursting FitzHugh (1961) . The equations governing the FitzHugh-Nagumo neuronal network dynamics are v = a + bv + cv 2 + dv 3 -u + Γ u = ε(ev -u) with the coupling term Γ(v i ) = -g c N j=1 B ij (v j -v i ) . The FitzHugh-Nagumo dynamics capture the firing behaviors of neurons with two components. The first component v represents the membrane potential, which contains self-and interaction dynamics, and the second component u represents a recovery variable. To simulate the shape of each spike, the time step in the model must be relatively small, e.g., τ = 0.25 ms. Here, we use the parameters a = 0.28, b = 1, c = 0, d = -1, ε = 0.04, e = 12.5. Moreover, the parameters in the FitzHugh-Nagumo model can be tuned so that the model describes spiking dynamics of many resonator neurons.

D.6 TIME SERIES GENERATION

To obtain the time series from above neural dynamics, we use Runge-Kutta method with variablestep to solve the ordinary differential equation of Hindmarsh-Rose and Morris-Lecar dynamics with sample interval τ = 0.1. Izhikevich dynamics are solved by the Euler formula with time step h = 0.05. For the Rulkov map we consider a unit sample interval. The total time step of time series T = 50, 000 in both synthetic and real networks.

E REAL BRAIN CONNECTOMES INFORMATION E.1 CAT CONNECTOME

The cat connectivity dataset comprises a description of cortical connections in the cat brain Scannell et al. (1995) , a connectivity set resulting from a comprehensive literature search of anatomical tracing studies in the cat cortex. Detailed information on the delineated regions, including information on the used parcellation scheme, abbreviations and possible overlap with other parcellation schemes, as well as information on the physiological characteristics of these regions, is given in the appendix of the original study Ref. Scannell et al. (1995) . The connectivity dataset incorporates data of one hemisphere, including 65 regions and 1139 interregional macroscopic axonal projections de Reus & van den Heuvel (2013).

E.2 MACAQUE CONNECTOME

The macaque connectivity data set used in this study comprises anatomical data from 410 tract tracing studies collated in the online neuroinformatics data base CoCoMac (http://cocomac.org), first analyzed and made publicly available in Ref. Modha & Singh (2010) . In the present study they focused primarily on an analysis of the connectivity among regions of the cerebral cortex. The cortical connection matrix was extracted from the primary connection data by removing all subcortical (thalamus, basal ganglia, brainstem) regions. In addition, regions that did not maintain at least one incoming and one outgoing connection were also removed to ensure that the network was strongly connected. The remaining connection data set used in this study consisted of 242 regions and 4090 directed projections represented in binary format (connection present = 1, connection absent = 0) Harriger et al. (2012) .

E.3 MOUSE CONNECTOME

The Allen Mouse Brain Connectivity Atlas uses enhanced green fluorescent protein (EGFP)expressing adeno-associated viral vectors to trace axonal projections from defined regions and cell types, and high-throughput serial two-photon tomography to image the EGFP-labelled axons throughout the brain. This systematic and standardized approach allows spatial registration of individual experiments into a common three dimensional (3D) reference space, resulting in a wholebrain connectivity matrix. The Allen Mouse Brain Connectivity Atlas is a freely available, foundational resource for structural and functional investigations into the neural circuits that support behavioural and cognitive processes in health and disease Oh et al. (2014) .

E.4 WORM CONNECTOME

All the chemical and gap junction synapses, the connectome, in the posterior nervous system of the C. elegans adult male are identified by serial section electron microscopy Jarrell et al. (2012) . The feasibility of comprehensive synapse-level nervous system reconstruction by this method was a primary reason for the initial selection of C. elegans as an experimental model. They developed a PC-based software platform to facilitate assembly of a connectome from electron micrographic images. The connectome is of a single adult animal and was produced from a series of 5000 serial thin sections of 70 to 90 nm encompassing the posterior half of the body.

E.5 RAT CONNECTOME

Because resliceable 3D brain models for relating systematically and topographically different parcellation schemes are still in the first phases of development, it is necessary to rely on qualitative comparisons between regions and tracts that are either inserted directly by neuroanatomists or trained annotators, or are extracted or inferred by collators from the available literature. To address these challenges, Ref. Bota et al. (2012) developed a publicly available neuroinformatics system, the Brain Architecture Knowledge Management System, including an exemplar for constructing interrelated connectomes at different levels of the mammalian central nervous system organization, and presented the latest version of the BAMS rat macroconnectome. Information about the above datasets is summarized in Table 4 . Taking Cat and Mouse connectome as examples, we show the results of our method trained on the time series data with different lengths, i.e. total time steps, in Figure 6 . The size of training and test sets is the same as the scheme in the main text Table 4 . Overall, the AUROC/AUPRC scores tend to get higher as the length increases, but this tendency is not significant. It indicates that the causal attention mechanism extracts sufficient causal features for causal inference even over short time series.

F.3 ROBUSTNESS OF CAUSAL ATTENTION MECHANISM AGAINST THE INTENSITY OF NOISE

We show the results of our method trained on time series data added different intensities of noise (measurement noise rather than intrinsic noise of dynamics) in Figure 7 . Except for dynamics of Izh and FHN on Cat connectome, the AUROC/AUPRC scores are stable within the range 2%-10% of standard deviation. It implies that causal attention mechanism is robust by the sample noise level. The causal attention mechanism helps the classifier reveal the generation processing underlying the data, i.e., coupling-drive in problem of causal inference and alleviate the dilemma of distribution shift in scene of small samples. The causal attention mechanism refines the content of samples (critical regions) and thus reduces the distribution dimension of the entire dataset. To quantify this shortened distance, we ask, how many additional training samples does traditional machine learning need to achieve the same level of generalization as our method? Taking the Cat Connectome as example, we train the classifier with traditional attention mechanism by gradually expanding the size of training set, and provide the results in Figure8. The green horizontal lines represent the AUROC value of our method using ten ordered pairs with causality (0.8% of edges in Cat Connectome) while, to achieve the same performance, the traditional classifier needs approximately 10%/15%/20%/40% samples for HR/Izh/Morris/Rulkov dynamics respectively (the blue lines cross the green lines). It also indicates that our method provides significant saving in labels collection, which is significant given that the procedure for checking connections in organisms is cumbersome in practice. 



An alternative design, which we have not yet implemented, would involve joining the first and second stages. Attention model gα would be trained by not only maximizing AeTE but also responding to feedback from the classifier, and would find the balance between Eq. 10 and Eq. 11 automatically.



Figure 1: Visual interpretation of transfer entropy and its attention-extended version. The Transfer Entropy is derived as the difference of two types of mutual information: M I (Y t+1 , (Y t , X t )) (blue area) quantifies the reduction in uncertainty of future state y t+1 from knowing current states (y t , x t ), and M I (Y t+1 , Y t ) (green area) is same but only y t is known. The attention coefficients a t (yellow area) are assigned to each position of time series by the causal attention mechanism to maximize the Attention-extended Transfer Entropy. For brevity, k = l = 1 here.

Figure 2: Graphical illustration of causal attention mechanism framework. An input sample is the time series on an order pair of variables with shape [2, L]. Stage 1: the neural network g α assigns the attention coefficients {a i } L i=1 . The neural networks f θ and f ϕ forming transfer entropy estimator estimate mutual information M I 1 and M I 2 (first and second terms in Eq. 7). Stage 2: the inferred probability of causality is Sigmoid( 1 L

); (2) Transfer Entropy (TE), as in Eq. 5; (3) Convergent cross mapping (CCM) Sugihara et al. (2012); (4) Latent convergent cross mapping (Latent CCM) De Brouwer et al. (2020); (5) PCMCI Runge et al. (2019) and (6) PCMCI + Runge (2020) using partial correlation to quantify causal strength; (7) Classification model with Convolutional Block Attention Module (TA) Woo et al. (2018).

( a ) Hi n d ma r s h -R o s e ( b ) I z h i k e v i c h ( d ) R u l k o v ( c ) F i t z Hu g h -Na g u mo

Figure 3: Insight into the coupling drive of dynamics. Top panel in each subplot: segment of time series from a ordered pair of variables with a causal relationship (blue is parent and green is child). Bottom panel in each subplot: the absolute value of coupling force (gray line), traditional attention (light pink line), and causal attention (orange line). (a) Hindmarsh-Rose; (b) Izhikevich; (c) FitzHugh-Nagumo; (d) Rulkov.

Figure 4: Comparison of classifiers on synthetic causal networks. CA: Causal Attention. Examples of dynamical model: (a) HR on ER; (b) Izh on ER; (c) HR on SF; (d) Izh on SF.

with k = 5 nearest neighbours. In Fig 5 we compare the results of MINE with the analytic formula and the Kraskov estimator. MINE shows marked improvement over the Kraskov estimator, especially when variables are high-dimensional. Comparing Fig 5(a,b) or (

Figure 5: True and estimated transfer entropy versus coupling strength g c . The dimension and standard deviation (std) σ of system noise is indicated in the titles of subplots.

Figure 6: AUROC/AUPRC scores of the causal attention mechanism trained on time series data of different lengths. The blue bar is AUPRC, the green bar is AUROC. The x-axis indicates the scores and y-axis represents the total time step of time series.

Figure 7: AUROC/AUPRC scores of the causal attention mechanism trained on time series data added different intensity of noise. The blue bar is AUPRC, the green bar is AUROC, and the y-axis indicates the percentage of standard deviation of Gaussian measurement noise in the original time series.

( a ) HR ( b ) I z h ( c ) Mo r r i s ( d ) R u l k o v

Figure 8: Size of training set that the traditional classifier requires to achieve same level of generalization as our method. The x-coordinate indicates the percentage of the ordered pairs with causality in training set to the total edges in causal network. Examples of dynamical model: (a) HR; (b) Izh; (c) Morris; (d) Rulkov.

In these expressions, y r is sampled from Y randomly and independently of the time step t. The first term M I Y t+1 , Y

Table 1, each from a different species: Cat, Macaque, Mouse, Worm and Rat. Equations of the five dynamical models considered. B is the asymmetrical adjacency matrix of the causal network, recording causal relationships between nodes.

Performance comparison on real causal networks. Values in the first column of each dataset are AUROC, and values in the second column are AUPRC. Each point contains the mean and standard deviation computed in five experiments with randomly sampled training/validation/test set in Cat (top, 10/10/500) and Mouse (bottom, 10/10/90) connectomes. Results on other three connectomes are shown in Appendix F.

the design of condition-selection strategies or choice of conditional independence test.Granger Causality Granger (1969)  is extended to nonlinear dynamics by using neural networks to represent nonlinear casual relationshipsTank et al. (2021);Nauta et al. (2019). Many methods of causal discovery assume that the causal network is a directed acyclic graph. However, directed cyclic graphs are common in real systems. To address this non-separability issue, Convergent-cross mappingSugihara et al. (2012) and its variationsClark et al. (2015); DeBrouwer et al. (2020) measure the extent to which the historical record of child can reliably estimate states of the parent in reconstructed state space. However, sparse causal effect in neuronal dynamics, particularly in the presence of noise, may lead parent and child time series to appear statistically independent, so that their contribution to state estimation is hard to recognize. as well as sample size, and is also trainable and strongly consistent. They also discussed another version of MINE based on the f -divergence representationNguyen et al. (2010);Nowozin et al. (2016). Using the technique of Noise-Contrastive Estimation (NCE)Gutmann & Hyvärinen (2010), based on comparing target and randomly chosen negative samples, Van den Oord et al.Van den Oord et al. (2018) proposed InfoNCE loss, minimization of which maximizes a mutual information lower bound. An important application of this contrastive learning approach has been extracting high-level representations of different data modalitiesChen et al. (2020);Woo et al.

Statistical information of six real networks: dataset name, type of network, number of nodes, number of edges, mean degree ⟨k⟩, and data acquisition method. Rat connectome are provided in Table5/Table 6/Table 7as the supplement of main text Table3. Classic methods have limited performance across various neural dynamics unfolding on real causal networks especially in a noisy environment: we add Gaussian measurement noise with mean zero and standard deviation 10% that of the original time series. As we discuss in main text Sec. 3.1, sparse causal effects are easily masked in metric of iconic transfer entropy when noise causes estimated probability densities to deviate even slightly from the true distributions.

Performance comparison on Macaque connectome. The sample number in train/validation/test set is 50/50/500.

Performance comparison on C.elegans connectome. The sample number in train/validation/test set is 50/50/500.

Performance comparison on Rat connectome. The sample number in train/validation/test set is 100/100/500.

