CAUSAL ATTENTION TO EXPLOIT TRANSIENT EMER-GENCE OF CAUSAL EFFECT Anonymous

Abstract

We propose a causal reasoning mechanism called causal attention that can improve performance of machine learning models on a class of causal inference tasks by revealing the generation process behind the observed data. We consider the problem of reconstructing causal networks (e.g., biological neural networks) connecting large numbers of variables (e.g., nerve cells), of which evolution is governed by nonlinear dynamics consisting of weak coupling-drive (i.e., causal effect) and strong self-drive (dominants the evolution). The core difficulty is sparseness of causal effect that emerges (the coupling force is significant) only momentarily and otherwise remains quiescent in the neural activity sequence. Causal attention is designed to guide the model to make inference focusing on the critical regions of time series data where causality may manifest. Specifically, attention coefficients are assigned autonomously by a neural network trained to maximise the Attentionextended Transfer Entropy, which is a novel generalization of the iconic transfer entropy metric. Our results show that, without any prior knowledge of dynamics, causal attention explicitly identifies areas where the strength of coupling-drive is distinctly greater than zero. This innovation substantially improves reconstruction performance for both synthetic and real causal networks using data generated by neuronal models widely used in neuroscience.

1. INTRODUCTION

In this work, our task is to infer causal relationships between observed variables based on time series data and reconstruct the causal network connecting large numbers of these variables. Assume the time series x it record the time evolution of variable i governed by coupled nonlinear dynamics, as represented by a general differential equation ẋit = g(x it ) + B ij f (x it , x jt ), where g and f are self-and coupling functions respectively. The parent variable influences the dynamic evolution of the child variable via the coupling function f . Note that these two functions are hidden and usually unknown for real systems. The asymmetric adjacency matrix B represents the causal, i.e., directional coupling relationship between variables. Hence, the goal is to infer matrix B from observed time series x it , i = 1, 2, . . . , N where N is the number of variables in the system. If B ij = 1, the variable i is a coupling driver (parent variable) of variable j, otherwise it is zero. The key challenge is that the causal effect in neural dynamics (e.g., biological neural systems observed via neuronal activity sequences) is too weak to be detected, rendering powerless classic unsupervised techniques of causal inference across multiple research communities Granger (1969); Schreiber ( 2000 (2021) . This difficulty manifests in three aspects. First, the dynamics contains self-drive and coupling-drive. The strength of coupling f (x it , •) is usually many orders of magnitude smaller than self-drive g(x it ), and the latter dominates evolution. Second, the behavior of the coupling-drive is chaotic, unlike in linear models Shimizu et al. (2006); Xie et al. (2020) . The resulting unpredictability and variability of system state means that coupling force can be significant momentarily and otherwise almost vanish, as illustrated in Figure 3 (gray lines). This dilutes the information in time series that can be useful for inferring the causal relationship. Third, in the heterogeneous networks common in applications, some variables are hubs coupled with many parent variables, among which it is difficult to distinguish individual causes. When causal effects are weak, we do not observe clearly the principle of Granger Causality, whereby the parent variable can help to explain the future change in its child variable Pfister et al. 2021) for prediction task on the neuronal activity sequences, the model only exploits the historical information of the child variable itself and that from parent variables is ignored. We posit that coupling-drive makes a negligible contribution to dynamic evolution in the majority of samples of time series data. In other words, only in a small fraction of samples is the information of parent variables effective in predicting the evolution of child variables. Taking as an example the gradient algorithm to minimise the regression error over all samples t (x it -xit ) 2 , the adjustment of model parameters from the tiny samples corresponding to significant coupling force is negligible, but these are the only samples which could induce the model to exploit causal effects in reducing regression error. Similarly, for transfer entropy Schreiber (2000) , which measures the reduction in uncertainty which a potential parent variable provides to a potential child variable, there is no significant difference in measured value between ordered pairs of variables with and without causality. To overcome the difficulty, we propose a causal reasoning mechanism -causal attention -to identify the moments when causal effect emerges. We design an objective function, Attention-extended Transfer Entropy (AeTE), comprising a weighted generalisation of transfer entropy. In order to maximize AeTE, the causal attention mechanism trains neural networks to autonomously allocates high attention coefficients a t at times t where information of parent variables effectively reduces the uncertainty of child variables, and ignores other positions by setting a t close to zero. If we consider each value in a time series as a feature, the operation of attention allocation is also equivalent to removing the non-causal features Kusner et al. (2017); Hu et al. (2021) . However, noise in empirical samples may also produce high transfer entropy regions, which leads to spurious causal effects even when using causal attention. We add a binary classification model to perform more sophisticated inference under the guidance of causal attention to focus on these critical regions and recognize different patterns between noisy and sparse emergence of causal effect. We deal with this class of causal inference task by way of small sample supervised learning. Although training and test data have a distribution shift in the setting of small samples, they arise through an identical underlying generation process. Thus, if the model provides an insight into the underlying dynamics -the coupling-drive for causal inference -then the understanding acquired from small samples can be effectively utilised in the test environment Bareinboim & Pearl (2014) 1. We introduce causal attention, a causal reasoning mechanism to identify the positions of time series at which causal effect emerges and guide a classification model to infer causality focusing on these critical positions. Without any prior knowledge of dynamics, the mechanism determines the areas where the coupling force is substantially different from zero. 2. By formulating Transfer Entropy as the difference between two types of mutual information, and based on the dual representation of Kullback-Liebler (KL) divergence, we design a differentiable metric, Attention-extended Transfer Entropy, as the objective function of the proposed causal attention mechanism. 3. Our method significantly improves performance on synthetic and real causal networks using the data generated by five well-known neural dynamic models, and the number of labels required is very small compared to the size of the causal networks. Our methodology has limitations (i.e., cases for which performance improvement is less): 1. Dense networks, where a variable is coupled with many driving variables such that their causal effects overlap and are harder to distinguish. 2. Intense noise, which makes the casual attention mechanism falsely identify high transfer entropy regions. The downstream classifier then extracts non-causal features, leading to the reduction of its generalization. 3. Strongly coupled system, which is dominated by synchronization phenomena in which the dynamic behaviors of all variables are similar.

2.1. DEFINITION OF TRANSFER ENTROPY

The transfer entropy, an information-theoretic causality measure, is able to detect information flow between time series X and Y . Transfer Entropy measures the degree of non-symmetric dependence



); Sugihara et al. (2012); Sun et al. (2015); Nauta et al. (2019); Runge et al. (2019); Gerhardus & Runge (2020); Tank et al. (2021); Mastakouri et al.

(2019). Rather, when we train a machine learning modelNauta et al. (2019); Tank et al. (

; Battaglia et al. (2016); Makhlouf et al. (2020); Pessach & Shmueli (2022). The role of causal attention is to help the classification model gain this insight. Our contributions are summarized as follows:

