CONTINUAL LIFELONG CAUSAL EFFECT INFERENCE WITH REAL WORLD EVIDENCE Anonymous authors Paper under double-blind review

Abstract

The era of real world evidence has witnessed an increasing availability of observational data, which much facilitates the development of causal effect inference. Although significant advances have been made to overcome the challenges in causal effect estimation, such as missing counterfactual outcomes and selection bias, they only focus on source-specific and stationary observational data. In this paper, we investigate a new research problem of causal effect inference from incrementally available observational data, and present three new evaluation criteria accordingly, including extensibility, adaptability, and accessibility. We propose a Continual Causal Effect Representation Learning method for estimating causal effect with observational data, which are incrementally available from non-stationary data distributions. Instead of having access to all seen observational data, our method only stores a limited subset of feature representations learned from previous data. Combining the selective and balanced representation learning, feature representation distillation, and feature transformation, our method achieves the continual causal effect estimation for new data without compromising the estimation capability for original data. Extensive experiments demonstrate the significance of continual causal effect inference and the effectiveness of our method.

1. INTRODUCTION

Causal effect inference is a critical research topic across many domains, such as statistics, computer science, public policy, and economics. Randomized controlled trials (RCT) are usually considered as the gold-standard for causal effect inference, which randomly assigns participants into a treatment or control group. As the RCT is conducted, the only expected difference between the treatment and control groups is the outcome variable being studied. However, in reality, randomized controlled trials are always time-consuming and expensive, and thus the study cannot involve many subjects, which may be not representative of the real-world population the intervention would eventually target. Nowadays, estimating causal effects from observational data has become an appealing research direction owing to a large amount of available data and low budget requirements, compared with RCT (Yao et al., 2020) . Researchers have developed various strategies for causal effect inference with observational data, such as tree-based methods (Chipman et al., 2010; Wager & Athey, 2018) , representation learning methods (Johansson et al., 2016; Li & Fu, 2017; Shalit et al., 2017; Chu et al., 2020) , adapting Bayesian algorithms (Alaa & van der Schaar, 2017), generative adversarial nets (Yoon et al., 2018 ), variational autoencoders (Louizos et al., 2017) and so on. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods only focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from the only one source. This assumption is unsubstantial in practice due to two reasons. The first one is based on the characteristics of observational data, which are incrementally available from non-stationary data distributions. For instance, the number of electronic medical records in one hospital is growing every day, or the electronic medical records for one disease may be from different hospitals or even different countries. This characteristic implies that one cannot have access to all observational data at one time point and from one single source. The second reason is based on the realistic consideration of accessibility. For example, when the new observational are available, if we want to refine the model previously trained by original data, maybe the original training data are no longer accessible due to a variety of reasons, e.g., legacy data may be unrecorded, proprietary, too large to store, or subject to privacy constraint (Zhang et al., 2020) . This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face the new challenges in causal inference with observational data: the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups in one source, and the accessibility for a huge amount of data. Existing causal effect inference methods, however, are unable to deal with the aforementioned new challenges, i.e., extensibility, adaptability, and accessibility. Although it is possible to adapt existing causal inference methods to address the new challenges, these adapted methods still have inevitable defects. Three straightforward adaptation strategies are described as follows. (1) If we directly apply the model previously trained based on original data to new observational data, the performance on new task will be very poor due to the domain shift issues among different data sources; (2) If we utilize newly available data to re-train the previously learned model, adapting changes in the data distribution, old knowledge will be completely or partially overwritten by the new one, which can result in severe performance degradation on old tasks. This is the well-known catastrophic forgetting problem (McCloskey & Cohen, 1989; French, 1999) ; (3) To overcome the catastrophic forgetting problem, we may rely on the storage of old data and combine the old and new data together, and then re-train the model from scratch. However, this strategy is memory inefficient and time-consuming, and it brings practical concerns such as copyright or privacy issues when storing data for a long time (Samet et al., 2013) . Our empirical evaluations in Section 4 demonstrate that any of these three strategies in combination with the existing causal effect inference methods is deficient. To address the above issues, we propose a Continual Causal Effect Representation Learning method (CERL) for estimating causal effect with incrementally available observational data. Instead of having access to all previous observational data, we only store a limited subset of feature representations learned from previous data. Combining the selective and balanced representation learning, feature representation distillation, and feature transformation, our method preserves the knowledge learned from previous data and update the knowledge by leveraging new data, so that it can achieve the continual causal effect estimation for new data without compromising the estimation capability for previous data. To summarize, our main contributions include: • Our work is the first to introduce the continual lifelong causal effect inference problem for the incrementally available observational data and three corresponding evaluation criteria, i.e., extensibility, adaptability, and accessibility. Each unit in the observational data received one of two treatments. Let t i denote the treatment assignment for unit i; i = 1, ..., n. For binary treatments, t i = 1 is for the treatment group, and t i = 0 for the control group. The outcome for unit i is denoted by y i t when treatment t is applied to unit i; that is, y i 1 is the potential outcome of unit i in the treatment group and y i 0 is the potential outcome of unit i in the control group. For observational data, only one of the potential outcomes is observed. The observed outcome is called the factual outcome and the remaining unobserved potential outcomes are called counterfactual outcomes. In this paper, we follow the potential outcome framework for estimating treatment effects (Rubin, 1974; Splawa-Neyman et al., 1990) . The individual treatment effect (ITE) for unit i is the difference



• We propose a new framework for continual lifelong causal effect inference based on deep representation learning and continual learning. • Extensive experiments demonstrate the deficiency of existing methods when facing the incrementally available observational data and our model's outstanding performance. 2 BACKGROUND AND PROBLEM STATEMENT Suppose that the observational data contain n units collected from d different domains and the d-th dataset D d contains the data {(x, y, t)|x ∈ X, y ∈ Y, t ∈ T } collected from d-th domain, which contains n d units. Let X denote all observed variables, Y denote the outcomes in the observational data, and T is a binary variable. Let D 1:d = {D 1 , D 2 , ..., D d } be the set of combination of d dataset, separately collected from d different domains. For d datasets {D 1 , D 2 , ..., D d }, they have the common observed variables but due to the fact that they are collected from different domains, they have different distributions with respect to X, Y , and T in each dataset.

