AMORTIZED CAUSAL DISCOVERY: LEARNING TO INFER CAUSAL GRAPHS FROM TIME-SERIES DATA Anonymous

Abstract

Standard causal discovery methods must fit a new model whenever they encounter samples from a new underlying causal graph. However, these samples often share relevant information -for instance, the dynamics describing the effects of causal relations -which is lost when following this approach. We propose Amortized Causal Discovery, a novel framework that leverages such shared dynamics to learn to infer causal relations from time-series data. This enables us to train a single, amortized model that infers causal relations across samples with different underlying causal graphs, and thus makes use of the information that is shared. We demonstrate experimentally that this approach, implemented as a variational model, leads to significant improvements in causal discovery performance, and show how it can be extended to perform well under hidden confounding.

1. INTRODUCTION

Inferring causal relations in observational time-series is central to many fields of scientific inquiry (Berzuini et al., 2012; Spirtes et al., 2000) . Suppose you want to analyze fMRI data, which measures the activity of different brain regions over time -how can you infer the (causal) influence of one brain region on another? This question is addressed by the field of causal discovery (Glymour et al., 2019) . Methods within this field allow us to infer causal relations from observational data -when interventions (e.g. randomized trials) are infeasible, unethical or too expensive. In time-series, the assumption that causes temporally precede their effects enables us to discover causal relations in observational data (Peters et al., 2017) ; with approaches relying on conditional independence tests (Entner and Hoyer, 2010), scoring functions (Chickering, 2002) , or deep learning (Tank et al., 2018) . All of these methods assume that samples share a single underlying causal graph and refit a new model whenever this assumption does not hold. However, samples with different underlying causal graphs may share relevant information such as the dynamics describing the effects of causal relations. fMRI test subjects may have varying brain connectivity but the same underlying neurochemistry; social networks may have differing structure but comparable interpersonal relationships; different stocks may relate differently to one another but obey similar market forces. Despite a range of relevant applications, inferring causal relations across samples with different underlying causal graphs is as of yet largely unexplored. In this paper, we propose a novel causal discovery framework for time-series that embraces this aspect: Amortized Causal Discovery (Fig. 1 ). In this framework, we learn to infer causal relations across samples with different underlying causal graphs but shared dynamics. We achieve this by separating the causal relation prediction from the modeling of their dynamics: an amortized encoder predicts the edges in the causal graph, and a decoder models the dynamics of the system under the predicted causal relations. This setup allows us to pool statistical strength across samples and to achieve significant improvements in performance with additional training data. It also allows us to infer causal relations in previously unseen samples without refitting our model. Additionally, we show that Amortized Causal Discovery allows us to improve robustness under hidden confounding by modeling the unobserved variables with the amortized encoder. Our contributions are as follows: • We formalize Amortized Causal Discovery (ACD), a novel framework for causal discovery in time-series, in which we learn to infer causal relations from samples with different underlying causal graphs but shared dynamics.



connections between two variables at the same time step

Amortized Causal Discovery Previous Approaches

Figure 1 : Amortized Causal Discovery. We propose to train a single model that infers causal relations across samples with different underlying causal graphs but shared dynamics. This allows us to generalize across samples and to improve our performance with additional training data. In contrast, previous approaches (Section 2) fit a new model for every sample with a different underlying causal graph.• We propose a variational model for ACD, applicable to multi-variate, non-linear data.• We present experiments demonstrating the effectiveness of this model on a range of causal discovery datasets, both in the fully observed setting and under hidden confounding.

2. BACKGROUND: GRANGER CAUSALITY

Granger causality (Granger, 1969) is one of the most commonly used approaches to infer causal relations from observational time-series data. Its central assumption is that causes precede their effects: if the prediction of the future of time-series Y can be improved by knowing past elements of time-series X, then X "Granger causes" Y . Originally, Granger causality was defined for linear relations; we follow the more recent definition of Tank et al. ( 2018) for non-linear Granger causality: Definition 2.1. Non-Linear Granger Causality: Given N stationary time-series x = {x 1 , ...x N } across time-steps t = {1, ..., T } and a non-linear autoregressive function g j , such thatwhere x ≤t j = (..., x t-1 j , x t j ) denotes the present and past of series j and ε t j represents independent noise. In this setup, time-series i Granger causes j, if g j is not invariant toGranger causal relations are equivalent to causal relations in the underlying directed acyclic graph if all relevant variables are observed and no instantaneous 1 connections exist (Peters et al., 2013; 2017, Theorem 10.1) .Many methods for Granger causal discovery, including vector autoregressive (Hyvärinen et al., 2010) and more recent deep learning-based approaches (Khanna and Tan, 2020; Tank et al., 2018; Wu et al., 2020) , can be encapsulated by a particular framework: 2. Fit f θ to x by minimizing some loss L: θ = argmin θ L(x, f θ ).3. Apply some fixed function h (e.g. thresholding) to the learned parameters to produce the Granger causal graph estimate for x: Ĝx = h(θ ). For instance, Tank et al. ( 2018) infer the Granger causal relations through examination of the weights θ : if all outgoing weights w ij between time-series i and j are zero, then i does not Granger-cause j.

