SYNCTWIN: TRANSPARENT TREATMENT EFFECT ESTIMATION UNDER TEMPORAL CONFOUNDING

Abstract

Estimating causal treatment effects using observational data is a problem with few solutions when the confounder has a temporal structure, e.g. the history of disease progression might impact both treatment decisions and clinical outcomes. For such a challenging problem, it is desirable for the method to be transparent -the ability to pinpoint a small subset of data points that contribute most to the estimate and to clearly indicate whether the estimate is reliable or not. This paper develops a new method, SyncTwin, to overcome temporal confounding in a transparent way. SyncTwin estimates the treatment effect of a target individual by comparing the outcome with its synthetic twin, which is constructed to closely match the target in the representation of the temporal confounders. SyncTwin achieves transparency by enforcing the synthetic twin to only depend on the weighted combination of few other individuals in the dataset. Moreover, the quality of the synthetic twin can be assessed by a performance metric, which also indicates the reliability of the estimated treatment effect. Experiments demonstrate that SyncTwin outperforms the benchmarks in clinical observational studies while still being transparent.

1. INTRODUCTION

Estimating the causal individual treatment effect (ITE) on patient outcomes using observational data (observational studies) has become a promising alternative to clinical trials as large-scale electronic health records become increasingly available (Booth & Tannock, 2014) . Figure 1 illustrates a common setting in medicine and it will be the focus of this work (DiPietro, 2010) : an individual may start the treatment at some observed time (black dashed line) and we want to estimate the ITE on the outcomes over time after the treatment starts (shaded area). The key limitation of observational studies is that treatment allocation is not randomised but typically influenced by prior measurable static covariates (e.g. gender, ethnicity) and temporal covariates (e.g. all historical medical diagnosis and conditions, squares in Figure 1 ). When the covariates also modulate the patient outcomes, they lead to the confounding bias in the direct estimation of the ITE (Psaty et al., 1999) . Although a plethora of methods overcome the confounding bias by adjusting for the static covariates (Yoon et al., 2018; Yao et al., 2018; Louizos et al., 2017; Shalit et al., 2017; Li & Fu, 2017; Alaa & van der Schaar, 2017; Johansson et al., 2016) , few existing works take advantage of the temporal covariates that are measured irregularly over time (Figure 1 ) (Bica et al., 2020; Lim et al., 2018; Schulam & Saria, 2017; Roy et al., 2017) . Overcoming the confounding bias due to temporal covariates is especially important for medical research as clinical treatment decisions are often based on the temporal progression of a disease. Transparency is highly desirable in such a challenging problem. Although transparency is a general concept, we will focus on two specific aspects (Arrieta et al., 2020) . ( 1) Explainability: the method should estimate the ITE of any given individual (the target individual) based on a small subset of other individuals (contributors) whose amount of contribution can be quantified (e.g using a weight between 0 and 1). Although the estimate of different target individuals may depend on different contributors, the method can always shortlist the few contributors for the expert to understand the rationale for each estimate. (2) Trustworthiness: the method should identify the target individuals whose ITE cannot be reliably estimated due to violation of assumptions, lack of data, or other failure modes. Being transparent about what the method cannot do improves the overall trustworthiness because it guides the experts to only use the method when it is deemed reliable. Inspired by the well-established Synthetic Control method in Statistics and Econometrics (Abadie et al., 2010; Abadie, 2019) , we propose SyncTwin, a transparent ITE estimation method which deals with temporal confounding. Figure 2 A illustrates the schematics of SyncTwin. SyncTwin starts by encoding the irregularly-measured temporal covariates as representation vectors. For each treated target individual, SyncTwin selects and weights few contributors from the control group based on their representation vectors and the sparsity constraint. SyncTwin proceeds to construct a synthetic twin whose representation vector and outcomes are the weighted average of the contributors. Finally, the ITE is estimated as the difference between the target individual's and the Synthetic Control's outcomes after treatment. The difference in their outcomes before treatment indicates the quality of the synthetic twin and whether the model assumptions hold. If the target individual and synthetic twin do not match in pre-treatment outcomes, the estimated ITE should not be considered trustworthy. Transparency of SyncTwin. SyncTwin achieves explainability by selecting only a few contributors for each target individual. It achieves trustworthiness because it quantifies the confidence one should put into the estimated ITE as the difference between the target and synthetic pre-treatment outcomes.

2. PROBLEM SETTING

We consider a clinical observational study with N individuals indexed by i ∈ [N ] = {1, . . . , N }. Let a i ∈ {0, 1} be the treatment indicator with a i = 1 if i started to receive the treatment at some time and a i = 0 if i never initiated the treatment. We realign the time steps such that all treatments were initiated at time t = 0. Let I 1 = {i ∈ [N ] | a i = 1} and I 0 = {i ∈ [N ] | a i = 0} be the set of the treated and the control respectively. Denote N 0 = |I 0 | and N 1 = |I 1 | as the sizes of the groups. The time t = 0 is of special significance because it marks the initiation of the treatment (black dashed line in Figure 1 ). We call the period t < 0 the pre-treatment period and the period t ≥ 0 the treatment period (shaded area in Figure 1 ). Temporal covariates are observed during the pre-treatment period only and may influence the treatment decision and the outcome. Let X i = [x is ] s∈[Si] be the sequence of covariates x is ∈ R D , which includes S i ∈ N observations taken at times t ∈ T i = {t is } s∈[Si] , where all t is ∈ R and t is < 0. Note that x is may also include static covariates whose values are constant over time. To allow the covariates to be sampled at different frequencies, let m is ∈ {0, 1} D be the masking vector with m isd = 1 indicating the d th element in x is is sampled. The outcome of interest is observed both before and after the treatment. In many cases, the researchers are interested in the outcomes measured at regular time intervals (e.g. the monthly average blood pressure). Hence, let T -= {-M, . . . , -1} and T + = {0, . . . , H -1} be the observation times before and after treatment initiation. In this work, we focus on real-valued outcomes y it ∈ R observed at t ∈ T -∪ T + . We arrange the outcomes after treatment into a H-dimensional vector denoted as y i = [y it ] t∈T + ∈ R H . Similarly define pre-treatment outcome vector y - i = [y it ] t∈T -∈ R M . Using the potential outcome framework (Rubin, 2005) , let y it (a i ) ∈ R denote the potential outcome at time t in a world where i received the treatment as indicated by a i . Let y i (1) = [y it (1)] t∈T + ∈ R H , and y - i (1) = [y it (1)] t∈T -∈ R M , similarly for y i (0) and y - i (0). The individual treatment effect (ITE) is defined as τ i = y i (1) -y i (0) ∈ R H . Under the consistency assumption (discussed later in details), the factual outcome is observed y i (a i ) = y i , which means for any i ∈ [N ] only the unobserved counterfactual outcome y i (1a i ) needs to be estimated in order to estimate the ITE. To simplify the notations, we focus on estimating the ITE for the treated, i.e. τi = y i (1) -ŷi (0) for i ∈ I 1 , though the same approach applies to the control i ∈ I 0 and new units i / ∈ [N ] without loss of generality (A.5). SyncTwin relies on the following assumptions. (1) Consistency, also known as Stable Unit Treatment Value Assumption (Rubin, 1980)  : y it (a i ) = y it , ∀i ∈ [N ], t ∈ T -∪ T + . (2) No anticipation, also known as causal systems (Abbring & Van den Berg, 2003; Dash, 2005) : y it = y it (1) = y it (0), ∀t ∈ T -, i ∈ [N ]. (3) Data generating model: the assumed directed acyclic graph is visualized in Figure 2 B (Pearl, 2009) , where we introduce two variables c i ∈ R K and v i ∈ R U in addition to the previously defined ones. The latent variable c i is the common cause of y it (0) and x is , and it indirectly influences a i through x is . As we show later, SyncTwin tries to learn and construct a synthetic twin that has the same c i as the target. The variable v i is an unobserved confounder. Although SyncTwin, like all other ITE methods, works better without unobserved confounders (i.e. v i = 0, ∀i ∈ [N ]), we develop a unique checking procedure in Equation ( 4) to validate if there exists v i = 0. We also demonstrate that under certain favourable conditions, SyncTwin can overcome the impact of the v i . To establish the theoretical results, we further assume y it (0) follows a latent factor model with c i , v i as the latent "factors" (Bai & Ng, 2008) : y it (0) = q t c i + u t v i + ξ it , ∀t ∈ T -∪ T + , where q t ∈ R K , u t ∈ R U are weight vectors and ξ it is the white noise. We require the weight vectors to have Xu, 2017) , which does not reduce the expressiveness of the model. We further require the dimensionality of the latent factor to be smaller than the number of time steps before or after treatment, i.e. K < min(M, H). Furthermore, let Q -= [q t ] t∈T -∈ R M ×K and Q = [q t ] t∈T + ∈ R H×K denote the matrices that stack all the weight vectors q's before and after treatment as rows respectively. The latent factor model assumption may seem restrictive but as we show in Appendix A.4 it is applicable to many scenarios. In the simulation study (5.1) we further show SyncTwin performs well even when the data is not generated using model (1) but instead using a set of differential equations. We compare our assumptions with those used in the related works in Appendix A.3. ||q t || = 1, ∀t ∈ T -∪ T + (

3.1. SYNTHETIC CONTROL

Similar to SyncTwin, Synthetic control (SC) (Abadie, 2019) and its extensions (Athey et al., 2018; Amjad et al., 2018) estimate ITE based on Synthetic Control outcomes. However, when applied to temporal confounding, SC will flatten the temporal covariates [x is ] s∈[Si] into a fixed-sized (highdimensional) vector x i and use it to construct the twin. As a result, SC does not allow the covariates to be variable-length or sampled at different frequencies (otherwise x i 's dimensionality will vary across individuals). In contrast, SyncTwin can gracefully handle these irregularities because it constructs the twin using the representation vectors. Moreover, the covariates x i may contain observation noise and other sources of randomness that do not relate to the outcome or the treatment. Enforcing the target and the twin to have similar x i will inject these irrelevant noise to the twin, a situation we call over-match (because it resembles over-fit). Over-match undermines ITE estimation as we show in the simulation study in Section 5.1. Finally, SC assumes y it (0) = q t x i + u t v i + ξ it , i.e. the flattened covariates x i linearly predicts y it (0), which is a special case of our assumption (1) and unlikely to hold for many medical applications.

3.2. COVARIATE ADJUSTMENT WITH DEEP LEARNING

In the static setting, the covariate adjustment methods fit two functions (deep neural networks) to predict the outcomes with and without treatment i.e. ŷi (0) = f 0 (x i ) and ŷi (1) = f 1 (x i ) (Johansson et al., 2016; Shalit et al., 2017; Yao et al., 2018; Yoon et al., 2018) . The ITE is then estimated as τi = f 1 (x i ) -f 0 (x i ). Under this framework, various methods have been proposed to address temporal confounding (Lim et al., 2018; Bica et al., 2020) . However, these methods generally lack transparency because the black-box neural networks cannot easily pinpoint the contributors for each prediction. Moreover, the prediction accuracy before treatment cannot directly measure the confidence or trustworthiness for the predictions after treatment because the network is very nonlinear and non-stationary. Lastly, Bica et al. (2020) and Lim et al. (2018) are applicable to a more general setting where the treatment can be turned on and off over time whereas SyncTwin assumes the outcomes will continue to be influenced by the treatment after the treatment starts. Works with similar terminology. Several works in the literature use similar terms such as "twin" while most of them are not related to SyncTwin. We discuss these works in Appendix A.6.

4. TRANSPARENT ITE ESTIMATION VIA SYNCTWIN

To explain when and why SyncTwin gives a valid ITE estimate, let us assume that we have learned a representation ci that approximates the latent variable c i , ∀i ∈ [N ] in Equation 1. For a target individual i ∈ I 1 , let b i = [b ij ] j∈I0 ∈ R N0 be a vector of weights, each associated with a control individual. A synthetic twin can be generated using b i as ĉi = j∈I0 b ij cj , ŷit (0) = j∈I0 b ij y jt (0) = j∈I0 b ij y jt , ∀t ∈ T -∪ T + , where ĉi is the synthetic representation and ŷit (0) is the synthetic outcome under no treatment. The last equality follows from the consistency assumption. Let ŷi (0) = [ŷ it (0)] t∈T + be the posttreatment synthetic outcome vector, and similarly ŷi = [ŷ it (0)] t∈T -. The ITE of i can be estimated as τi = y i (1) -ŷi (0) = y i - j∈I0 b ij y j , where again the last equality follows from the consistency assumption. We should highlight that y i and y j , ∀j ∈ I 0 in the equation above are the observed outcomes. Hence, b i is the only free parameter that influences the ITE estimator τi . The following two distances are central to the training and inference procedure: d c i = ĉi -ci , d y i = ŷ- i -y - i 1 , where || • || is the vector 2 -norm and || • || 1 is the vector 1 -norm. Minimizing d c i to construct synthetic twins. d c i indicates how well the synthetic twin matches the target individual in representations. Intuitively, we should seek to construct a twin who closely matches the target by minimizing d c i . This intuition is verified in Proposition 1 (proved in A.1.1). Proposition 1 (Bias bound on ITE with no unobserved confounders). Suppose that v i = 0, ∀i ∈ [N ] and d c i = 0 for some i ∈ I 1 (v i and d c i are defined in Equation 1 and 4 respectively), the absolute value of the expected difference in the true and estimated ITE of i is bounded by: |E[τ i ] -E[τ i ]| ≤ |T + | j∈I0 b ij c j -c i ≤ |T + | j∈I0 c j -cj + c i -ci . (5) Here we show that when d c i is minimized at zero and there is no unobserved confounder, the bias on the ITE estimate only depends on how close the learned representation c is to the true latent variable c. We will use use representation learning to uncover the latent variable c in the next section.

Using d y

i to measure trustworthiness. By definition, d y i indicates how well the synthetic pretreatment outcomes match the target individual's outcomes. Intuitively, matching the outcomes before treatment is a prerequisite for a good estimate of the ITE after treatment (Equation 2 and 3). We formalize this intuition in Proposition 2, which is proved in Appendix A.1.1. Proposition 2 (Trustworthiness of SyncTwin under no hidden confounders). Suppose that all the outcomes are generated by the model in Equation 1with the unobserved confounders equal to zero s.t. v i = 0, ∀i ∈ [N ], and that we reject the estimate τi if the pre-treatment error d y i on T -is larger than δ|T -|/|T + |, the post-treatment ITE estimation error on T + is below δ. Here we show that if we would like to ensure the ITE error to fall below a pre-specified threshold δ, we should reject the estimate τi when the distance d y i > δ|T -|/|T + | assuming no unobserved confounder. In other words, d y i can be used as an evaluation metric to access whether the estimated ITE is trustworthy. Situation with unobserved confounders. In presence of the unobserved confounders v i = 0, SyncTwin cannot guarantee to correctly estimate the ITE. However, d y i can still indicate whether v i has a significant impact on the pre-treatment outcomes, i.e. the unobserved confounders may exist but only weakly influence the outcomes before treatment. We discuss unobserved confounders in detail in Appendix A.1.2.

4.1. LEARNING TO REPRESENT TEMPORAL COVARIATES

In this section, we show how SyncTwin learns the representation ci as a proxy for the latent variable c i using a sequence-to-sequence architecture as depicted in Figure 3 (A) and discussed below. Architecture. SyncTwin is agnostic to the exact choice of architecture as long as the network translates the covariates into a single representation vector (encode) and reconstructs the covariates from that representation (decode). For this reason, we use the well-proven sequence-to-sequence architecture (Seq2Seq) (Sutskever et al., 2014) with an encoder similar to the one proposed in Bahdanau et al. (2015) and a standard LSTM decoder (Hochreiter & Schmidhuber, 1997) . The encoder first obtains a sequence of representations at each time step using a recurrent neural network. Instead of using the bi-directional LSTM as in Bahdanau et al. (2015) , we use a GRU-D network because it is designed to encode irregularlly-sampled temporal observations (Che et al., 2018) . This gives us the sequence h is = GRU-D(h i,s-1 , x is , m is , t is ), ∀s ∈ [S i ]. Since our goal is to obtain a single representation vector rather than a sequence of representations, we aggregate the sequence of h is using the same attentive pooling method as in Bahdanau et al. (2015) . The final representation vector ci is obtained as: ci = s∈[Si] α is h is , where α is = r h is / √ K is the attention weight and r ∈ R K is the attention parameter (Vaswani et al., 2017) . The decoder uses the representation ci to reconstruct x is at time t is ∀s ∈ [S i ]. Since the timing information t is may be lost in ci due to aggregation, we reintroduce it to the decoder by first obtaining a sequence of time representations o is = k 0 + w 0 t is , where o is , k 0 , w 0 ∈ R K , and then concatenating each with ci to obtain: e is = ci ⊕ o is ∈ R 2K . Reintroducing timing information during decoding is a standard practice in Seq2Seq models for irregular time-series (Rubanova et al., 2019; Li & Marlin, 2020) . Furthermore, using time representation o is instead of time values t is is inspired by the success of positional encoding in the self-attention architecture (Vaswani et al., 2017; Gehring et al., 2017) . The decoder then applies a LSTM autoregressively on the time-aware representations e is to decode g is = LSTM(g i,s-1 , e is ), ∀s ∈ [S i ], where g is ∈ R K . Finally, it uses a linear layer to obtain the reconstructions: xis = k 1 + W 1 g is , where k 1 ∈ R D , W 1 ∈ R D×K . Loss functions. We train the networks with the weighted sum of the supervised loss L s and the reconstruction loss L r (Figure 3 A): L s (D 0 ) = i∈D0 || Q • ci -y i (0)||, L r (D 0 , D 1 ) = i∈D0∪D1 s∈[Si] ||(x is -x is ) m is ||, (6) where D 0 ⊆ I 0 , D 1 ⊆ I 1 , m is is the masking vector (Section 2), represents element-wise product and Q ∈ R H×K is a trainable parameter and || • || is the L 2 norm. Intuitively, the supervised loss L s ensures that the learned representation ci to be a linear predictor of the outcomes under no treatment y i (0). Here a linear function ỹi (0) := Q • ci is used to be consistent with the data generating model (1). Using a nonlinear function here might lead to smaller L s , but it will not uncover the latent variable c i as desired. We justify the supervised loss in Proposition 3 below and present the proof and detailed discussions in Appendix A.1.1. Proposition 3 (Error bound on the learned representations). Suppose that v i = 0, ∀i ∈ [N ] (v i is defined in Equation 1), the total error on the learned representations for the control, i.e., the first term in the upper bound of the absolute value of the expected difference in the true and estimated ITE (R.H.S of Equation 5), is bounded as follows: j∈I0 c j -cj ≤ βL s + j∈I0 ξ j , where L s is the supervised loss in Equation 6 and ξ j is the white noise in Equation 1.

4.2. CONSTRUCTING SYNTHETIC TWINS

Constraints. We require the weights b i in Equation 2 to satisfy two constraints (1) positivity: Regularizing is vital because the dimensionality of b i ∈ R N0 can easily exceed ten thousand in observational studies. b ij ≥ 0 ∀i ∈ [N ], j ∈ I 0 , (2) The constraints encourage the solution to be sparse by fixing the 1 -norm of b i to be one i.e. ||b i || 1 = 1 (Tibshirani, 1996) . Better sparsity leads to fewer contributors and better transparency. (3) Finally, the constraints ensure that the synthetic twin in Equation 2 is the weighted average of the contributors. Therefore the weight b ij directly translates into the "contribution" or "importance" of j to i, further improving the transparency. Matching loss. The matching loss finds weight b i so that the synthetic twin and the target individual match in representations, as depicted in Figure 3 (B) . L m (D 0 , D 1 ) = i∈D1 ||c i - j∈D0 b ij cj || 2 2 , where again D 0 ⊆ I 0 and D 1 ⊆ I 1 . We use the Gumbel-Softmax reparameterization detailed in Appendix A.9 to optimize L m under the constraints (Jang et al., 2016; Maddison et al., 2016) .

4.3. TRAINING, VALIDATION AND INFERENCE

As is standard in machine learning, we perform model training, validation and inference (testing) on three disjoint datasets. On a high level, we train the encoder and decoder on the training data using the loss functions described in Section 4.1. The validation data is then used to validate and tune the hyper-parameters of the encoder and decoder. Finally, we fix the encoder and optimize the matching loss L m on the testing data to find the weight b i , which leads to the ITE estimate using Equation 3. The detailed procedure is described in A.8. The hyperparamter sensitivity is studied in A.13.

5.1. SIMULATION STUDY

In this simulation study, we evaluate SyncTwin on the task of estimating the LDL cholesterol-lowering effect of statins, a common drug prescribed to hypercholesterolaemic patients. We simulate the ground truth ITE using the widely adopted Pharmacokinetic-Pharmacodynamic model in the literature (Faltaos et al., 2006; Yokote et al., 2008; Kim et al., 2011) . dp t dt = k in t -k • p t ; dd t dt = a t -h • d t ; dy t dt = k • p t - d t d t + d 50 k • y t . ( ) where y t is the LDL cholesterol level (outcome) and a t is the indicator of statins treatment. The interpretation of all other variables involved are presented in Appendix A.10. Data generation. Following our convention, the individuals are enrolled at t = 0, the covariates are observed in T = [-S, 0), where S ∈ {15, 25, 45}, and the ITE is to be estimated in the period T + = [0, 4]. We start by generating k in t for each individual from the following mixture distribution: k in it = g i f t ; g i = δ i e i1 + (1 -δ i )e i2 ; δ i iid ∼ Bern(p); e in iid ∼ N(µ n , Σ n ), n = 1, 2 (10) where f t ∈ R 6 are the Chebyshev polynomials, Bern(p) is the Bernoulli distribution with success probability p and N(µ n , Σ n ) is the Gaussian distribution. To introduce confounding, we vary p for the treated and the control: p = p 0 , ∀i ∈ I 0 and p = 1, ∀i ∈ I 1 , where p 0 controls the degree of confounding bias. After that, the variables p t , d t , y t are obtained by solving Equation 9 using scipy (Virtanen et al., 2020) and adding independent white noise ∼ N(0, 0.1) to the solution. The temporal variables defined above give us the covariates x t = {k in t , y t , p t , d t }. Finally, we introduce irregular sampling by creating masks m it ∼Bern(m), where probability m ∈ {0.3, 0.5, 0.7, 1}. Benchmarks. From the Synthetic Control literature, we considered the original Synthetic Control method (SC) (Abadie et al., 2010) , Robust Synthetic Control (RSC) (Amjad et al., 2018) and MC-NNM (Athey et al., 2018) . From the deep learning literature, we compared against Counterfactual Recurrent Network (CRN) (Bica et al., 2020) and Recurrent Marginal Structural Network (RMSN) (Lim et al., 2018) , which are the state-of-the-art methods to estimate ITE under temporal confounding. In addition, we included a modified version of the CFRNet, which was originally developed for the static setting (Shalit et al., 2017) . To allow the CFRNet to handle temporal covariates, we replaced its fully-connected encoder with the encoder architecture used by SyncTwin (Section 4.1). We also included the counterfactual Gaussian Process (CGP) (Schulam & Saria, 2017) and One-nearest Neighbour Matching (1NN) (Stuart, 2010) as baselines. The implementation details of all benchmarks are available in Appendix A.7. We also included two ablated versions of SyncTwin. SyncTwin-L r is trained only with reconstruction loss and SyncTwin-L s only with supervised loss. Main results. We evaluate the mean absolute error (MAE) on ITE estimation: 1 N1 N1 i=1 ||τ i -τi || 1 . In table 6 the parameter p 0 controls the level of confounding bias (smaller p 0 , larger bias). Additional results for different sequence length S and sampling irregularity m are shown in Appendix A.11. SyncTwin achieves the best or equally-best performance in all cases. The full SyncTwin with both loss functions also consistently outperforms the versions trained only with L r or L s . As discussed in Section 4 (and Appendix A.1.1), training with only reconstruction loss L r leads to significant performance degradation. It is worth highlighting that the data generating model used in this simulation ( 9) is not the same as SyncTwin's assumed latent factor model (1). This implies that SyncTwin may still achieve good performance when the assumed model (1) does not exactly hold. SC, RSC and MC-NNM underperform because their assumption that the flattened covariates x i linearly predict the outcome is violated (Section 3). Furthermore, Table 2 shows the synthetic twin created by SC matches the target covariates x i consistently better than SyncTwin, yet produces worse ITE estimates. This suggests that matching covariates better may not lead to better ITE estimate because the covariates are noisy and the method might over-match (Section 3). In addition, Figure 4 visualizes the weights b i of SyncTwin and SC in a heatmap. We can clearly see that SyncTwin produces sparser weights because SC needs to use more contributors to construct the twin that (over-)matches x i . Quantitative evaluation of the sparsity is provided in Appendix A.12. These findings verify our belief that constructing twins in the representation space (SyncTwin) rather than in the high-dimensional observation space (SC) leads to better performance and transparency.

5.2. EXPERIMENT ON REAL DATA

Purpose of study. We present an clinical observational study using SyncTwin to estimate the LDL Cholesterol-lowering effect of statins in the first year after treatment (Ridker & Cook, 2013) . Data Source. We used medical records from English National Health Service general practices that contributed anonymised primary care electronic health records to the Clinical Practice Research Datalink (CPRD), covering approximately 6.9 percent of the UK population (Herrett et al., 2015) . CPRD was linked to secondary care admissions from Hospital Episode Statistics, and national mortality records from the Office for National Statistics. We defined treatment initiation as the date of first CPRD prescription and the outcome of interest was measured LDL cholesterol (LDL). Known risk factors for LDL were selected as temporal covariates measured before treatment initiation: HDL Cholesterol, Systolic Blood Pressure, Diastolic Blood Pressure, Body Mass Index, Pulse, Creatinine, Triglycerides and smoking status. Our analysis is based on a subset of 125,784 individuals (Appendix A.15) which was split into three equally-sized subsets for training, validation and inference, each with 17,371 treated and 24,557 controls. Evaluation. We evaluate our models using the average treatment effect on the treated group (ie, ATT = E(τ i |a i = 1)) to directly correspond to the reported treatment effect in randomised clinical trials, e.g. The Heart Protection Study reported an a change of -1.26 mmol/L (SD=0.06) in LDL cholesterol for participants randomised to statins versus placebo (Group et al., 2007; 2002) . We use the sample average on the testing set to estimate the ATT as i∈D te 1 τit /|D te 1 |, where D te 1 are the individuals in the testing set who received the treatment. SyncTwin estimates the ATT to be -1.25 mmol/L (SD 0.01), which is very close to the results from the clinical trial. In comparison, CRN and RMSN estimate the ATT to be -0.72 mmol/L (SD 0.01) and -0.83 mmol/L (SD 0.01) respectively. Other benchmark methods either cannot handle irregularly-measured covariates or do not scale to the size of the dataset. Our result suggests SyncTwin is able to overcome the confounding bias in the complex real-world datasets. Transparent ITE estimation. For each individual, we can visualize the outcomes before and after the treatment and compare them with the synthetic twin in order to sense-check the estimate. The individual shown in Figure 5 (top) has a sensible ITE estimate because the synthetic twin matches its pre-treatment outcomes closely over time. In addition to visualization, we can calculate the distance d y (Equation 4) to quantify the difference between the pre-treament outcomes. From Figure 5 (bottom left) we can see in most cases the distance is small with a median of 0.24 mmol/L (compared to the population average distance 0.76 mmol/L). This means if the expert can only tolerate an error of 0.24 mmol/L on ITE estimation, half of the estimates (those with d y ≤ 0.24 mmol/L) can be accepted (Section 4). The estimates are also explainable due to the sparsity of SyncTwin. As shown in Figure 5 (bottom right) on average only 15 (out of 24,557) individuals contribute to the synthetic twin. Koutaro Yokote, Hideaki Bujo, Hideki Hanaoka, Masaki Shinomiya, Keiji Mikami, Yoh Miyashita, Tetsuo Nishikawa, Tatsuhiko Kodama, Norio Tada, and Yasushi Saito. Multicenter collaborative randomized parallel group comparative study of pitavastatin and atorvastatin in japanese hypercholesterolemic patients: collaborative study on hypercholesterolemia drug intervention and their benefits for atherosclerosis prevention (chiba study). Atherosclerosis, 201(2):345-352, 2008. Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In International Conference on Learning Representations, 2018.

A APPENDIX

A.1 THEORETICAL RESULTS

A.1.1 SITUATION WITH NO UNOBSERVED CONFOUNDERS

Proposition 1 Bias bound on ITE with no unobserved confounders. Suppose that v i = 0, ∀i ∈ [N ] and d c i = 0 for some i ∈ I 1 (v i and d c i are defined in Equation 1 and 4 respectively), the absolute value of the expected difference in the true and estimated ITE of i is bounded by: |E[τ i ] -E[τ i ]| ≤ |T + | j∈I0 b ij c j -c i ≤ |T + | j∈I0 c j -cj + c i -ci . ( ) Proof. We start the proof by observing  |E[τ i ] -E[τ i ]| = t∈T + |E[ŷ it (0)] -E[y it (0)]| = t∈T + |q t ( j∈I0 b ij c j -c i )| ≤ t∈T + ||q t || • || j∈I0 b ij c j -c i || = |T + ||| j∈I0 b ij c j -c i || || j∈I0 b ij c j -c i || = || j∈I0 b ij (c j -cj ) -(c i -ci )|| ≤ j∈I0 b ij ||c j -cj || + ||c i -ci || ≤ j∈I0 ||c j -cj || + ||c i -ci ||, where the second line follows from the triangular inequality and the third line relies on j∈I0 b ij = 1 and b ij ≥ 0, ∀j ∈ I 0 . Combining inequality 12 and 13, we prove the inequalities in Equation 11. Justification for the matching loss and d c i . Proposition 1 presents a justification for minimizing d c i (or the matching loss L m ). Essentially, when the synthetic representations are matched with the target (d c i = 0), the bias in ITE estimate is controlled by how close the learned representations c is to the true latent variable c up to an arbitrary linear transformation Λ. An important implication is that the learned representation c does not need to be equal to c, instead the learning algorithm only needs to identify c up to a linear transformation. Of course, Proposition 1 also implies that |E[τ i ] -E[τ i ]| ≤ j∈I0 ||c j -cj || + ||c i -ci || when Λ is taken to be the identity matrix instead of the minimizer. Proposition 2 Trustworthiness of SyncTwin under no hidden confounders. Suppose that all the outcomes are generated by the model in Equation 1with the unobserved confounders equal to zero s.t. v i = 0, ∀i ∈ [N ], and that we reject the estimate τi if the pre-treatment error d y i on T -is larger than δ|T -|/|T + |, the post-treatment ITE estimation error on T + is below δ. Proof. As a reminder, Q -= [q t ] t∈T -and Q = [q t ] t∈T + denote the matrix that stacks all the weight vectors q's before and after treatment as rows respectively where each q t satisfies that q t = 1 in Equation 1. The error d y i in Equation 4 can be decomposed into a representation error and a white noise error, d y i = ŷ- i -y - i 1 = j∈I0 b ij y - j -y - i 1 = j∈I0 b ij (Q -c j + ξ j ) -(Q -c i + ξ i ) 1 = Q - j∈I0 b ij c j -c i 1 + j∈I0 b ij ξ j -ξ i 1 ≤ t∈T - q t j∈I0 b ij c j -c i + j∈I0 b ij ξ j -ξ i 1 ≤ |T -| j∈I0 b ij c j -c i + j∈I0 b ij ξ j -ξ i (14) We can not estimate the error from the representation and white noise on the last line of Equation 14. Conservatively, we can say the representation error itself is larger or equal to d y i such that |T -| j∈I0 b ij c j -c i ≥ d y i , i.e., j∈I0 b ij c j -c i ≥ d y i /|T -|. The post-treatment error is upper bounded as follows, |E[τ i ] -E[τ i ]| = |E[ŷ it (0)] -E[y it (0)]| = t∈T + |q t ( j∈I0 b ij c j -c i )| ≤ |T + | j∈I0 b ij c j -c i := sup τi |E[τ i ] -E[τ i ]|. Using Equation ( 15), we have sup τi |E[τ i ] -E[τ i ]| ≥ d y i |T + |/|T -|. Conservatively, we reject the estimate τi if sup τi |E[τ i ] -E[τ i ]| is larger than δ. That is when d y i > δ|T -|/|T + |. Why does d y i indicate the trustworthiness of the estimation? Proposition 2 shows that we can control the estimation error to be below a certain threshold δ by rejecting the estimate if its error d y i during the pre-treatment period is larger than δ|T -|/|T + |. Alternatively, we can rank the estimation trustworthiness for the individuals based on d y i alone. This is helpful when the user is willing to accept a percentage of estimations which are deemed most trustworthy. We note that this proposition only holds under the assumption that the outcomes over time are generated by the model stated in Equation 1. The outcomes generated by such a model can be nonlinear and complicated due to the representation. However, the model assumes that the outcomes over time are linear functions of the same representation. This is the reason why the pre-treatment error can be used to assess the post-treatment error. We parameterize our neural network model according to Equation 1. If it is a not good fit to the data, the model should have a large estimation error before treatment. The users should also use their domain knowledge to check if the model holds for their data, i.e., if there is any factor starting to affect the outcomes in halfway and causes the representation to change over time. Proposition 3 Error bound on the learned representations. Suppose that v i = 0, ∀i ∈ [N ] (v i is defined in Equation 1), the total error on the learned representations for the control, i.e., the first term in the upper bound of the absolute value of the expected difference in the true and estimated ITE (R.H.S of Equation 11), is bounded as follows: j∈I0 c j -cj ≤ βL s + j∈I0 ξ j , where L s is the supervised loss in Equation 6 and ξ j is the white noise in Equation 1. Proof. We start the proof from the definition of the supervised loss. L s = j∈I0 Qc j -y j (0) = j∈I0 Qc j -(Qc j + ξ j ) ≥ j∈I0 t∈T - c j , -c j qt q t q t , q t cj -c j 1 2 - j∈I0 ||ξ j 2 ≥ β |T -| j∈I0 cj -c j - j∈I0 ξ j where β denotes the square root of the element of the matrices qt q t q t , q t , ∀t ∈ T -, with the smallest absolute value. The first and second equations follow from Equation 6 and 1. Let β denotes the constant 1/(β |T -|). Arranging the terms in inequality 17 and we prove Proposition 3. Justification for the supervised loss. Proposition 3 provide a justification for the supervised loss L s . By optimizing the supervised loss, SyncTwin learns the representation ci that is close to the latent variable c i , which also reduces the bias bound on ITE in Proposition 1. Rationale for the reconstruction loss. Although the bias bounds we developed so far do not include the reconstruction loss L c , we believe it is useful in real applications. Our reasoning follows from the fact that unsupervised or semi-supervised loss often improve the performance of deep neural networks (Erhan et al., 2009; 2010; Hendrycks et al., 2019) . In addition, the reconstruction loss ensures the representation c retains the information from the temporal covariates as required in the DAG (Figure 2 ). In our simulations (Section 5.1), we found that ablating the reconstruction loss leads to consistently worse performance (though the magnitude is somewhat marginal). Can we estimate the ITE as τi = y i (1) -Q • ci ? No, this is because L s is based on the factual outcome y i (0) of the control group i ∈ I 0 only. For treated individuals i ∈ I 1 , the predictor Q • ci can be biased for their counterfactual outcomes y i (0). Hence, L s is only used to learn a good representation ci for downstream procedures, and not to directly predict counterfactual outcomes.

A.1.2 SITUATION WITH UNOBSERVED CONFOUNDERS

In general, the unobserved confounders make it hard to provide good estimates for the ITE. The matching in pre-enrollment outcomes d y i (Equation 4) validates if the unobserved confounders v i create significant error in the pre-treatment period. Using the same derivation of Theorem 1, we can see that: d y i = ||Q -( j∈I0 b ij c j -c i ) + U -( j∈I0 b ij v j -v i ) + ξ||, where Q -and U -are unknown but fixed matrices relating to the data generating process and ξ is a term only depending on the white noise. As shown in Proposition 1, the matching in representations encourages the first term involving c i to be small. Hence, a large value in d y i implies that the remaining term involving the unmeasured confounders v i is big, which leads to a large estimation error. It is worth pointing out that a small value of d y i does not guarantee there is no unobserved confounders -a hypothesis we cannot test empirically. For instance, consider the weights U -= 0. It follows that the second term in Equation 18will always be zero even if v i = 0 -there exists unobserved confounders but they do not impact the outcomes before treatment (Equation 1). In summary, d y i does not prove or disprove the existence of unobserved confounders; it only indicates their impact on the pre-treatment outcomes. Our assumption is a small relaxation of the standard no unmeasured confounders assumption by allowing a linear effect from some unmeasured confounders. More conservatively, we can assume there is no unmeasured confounders by setting all the v i to 0, ∀i ∈ [N ] in Equation 1.

A.2 COMPARISON OF THE TEMPORAL COVARIATES ALLOWED IN THE RELATED WORKS

As introduced in Section 2, SyncTwin is able to handle temporal covariates sampled at different frequencies, i.e. the set of observation times T i and a mask m it can be different for different individuals. In comparison, Synthetic Control (Abadie et al., 2010) , robust Synthetic Control (Amjad et al., 2018) , and MC-NNM (Athey et al., 2018) are only able to handle regularly-sampled covariates, i.e. T i = {-1, -2, . . . , -L} ∀i ∈ [N ], and m it = 1 ∀i ∈ [N ], t ∈ T i . In other words, the temporal covariates [x is ] s∈[Si] = X i ∈ R D×Si has a matrix form. The deep learning methods including CRN (Bica et al., 2020) and RMSN (Lim et al., 2018) have the potential to handle irregularly-measured variable-length covariates when a suitable architecture is used. However, the architectures proposed in the original papers only apply to regularly-sampled case and no simulation or real data experiments were conducted for the more general irregular cases. As shown in Table 3 , Synthetic control (Abadie et al., 2010; Abadie, 2019) and its variants (Athey et al., 2018; Amjad et al., 2018) rely on two causal assumptions: (1) consistency: y it (a it ) = y it and

A.3 COMPARISON OF THE CAUSAL ASSUMPTIONS IN THE RELATED WORKS

(2) data generating assumption (linear factor model): y it (0) = q t x i + u t v i + ξ it ∀i ∈ [N ], t ∈ T -∪ T + . ( ) where x i = vec(X i ) ∈ R D×L , vec is the vectorization operation; u t ∈ R U and q t ∈ R D×L are time-varying variables and v i ∈ R U is a latent variable. ξ it is an error term that has mean zero and satisfies ξ it ⊥ ⊥ a rs , x r , u s , v r for ∀ k, r, s, t. It is worth highlighting that the data generating assumption of Synthetic Control is a special case of the more general assumption of SyncTwin in Equation 1. To see this, let c i = x i = vec(X i ) in Equation 1, i.e. we use the flattened temporal covariates directly as the representation. Further let φ θ (c i , t is ) = c i [Ds : D(s + 1)] and ε is = 0, where c[a : b] takes a slice of vector c between index a and b. The result is exactly Equation 19. Why does Synthetic Control tend to over-match? Both SyncTwin and Synthetic Control estimate the treatment effects using a weighted combination of control outcomes (Equation 3). However, Synthetic Control finds weight b ij in a different way by directly minimizing L x = ||x i - j b ij x j ||. Since x i contains the observation noise and other random components that do not relate to the outcomes, the weights b ij that minimize L x tend to over-match, i.e. they capture the irrelevant randomness in x i . In contrast, SyncTwin finds b ij based on the learned representations ci rather than x i (L m , Equation 6). Since ci has much lower dimensionality than x i , the reconstruction loss L r encourages the Seq2Seq network to learn a ci that only retains the signal in x i but not the noise. Meanwhile, the supervised loss encourages ci to only retain the information that predicts the outcomes. As a consequence, we expect the weights based on ci to be less prone to over-match. Moreover, since the relationship between ci and x i is nonlinear (as captured by the decoder network), the weights b ij that minimize L m will generally not minimize the Synthetic Control objective L x , therefore avoiding over-match.

A.3.2 COUNTERFACTUAL RECURRENT NEURAL NETWORKS

As shown in Table 3 , CRN (Bica et al., 2020) and RMSN (Lim et al., 2018) makes the following three causal assumptions. ( 1) Consistency: y it (a it ) = y it . (2) Sequential overlap (aka. positivity): P r(a it = 1|a i,t-1 , x it ) > 0 whenever P r(a i,t-1 , x it ) = 0. (3) No unobserved confounders: y it (0), y it (1) ⊥ ⊥ a it | x it , a i,t-1 . In summary, CRN makes the same consistency assumption as SyncTwin. However, SyncTwin does not assume sequential overlap or no unobserved confounders while CRN does not make assumptions on the data generating model. The sequential overlap assumption means that the individuals should have non-zero probability to change treatment status at any time t ≥ 0 given the history. This assumption is violated in the clinical observational study setting we outlined in Section 1, where the treatment group will continue to be treated and cannot switch to the control group after they are enrolled (and similarly for the control group). While the sequential overlap assumption allows these methods to handle more general situations where treatment switching do occur, their performance is negatively impacted in the "special" (yet still widely applicable) setting we consider in this work. While CRN makes strict no-unobserved-confounder assumption, SyncTwin allows certain types of unobserved confounders to occur. In particular, the latent factor v i in Equation 1 can be unobserved confounders. Being less reliant on no-unobserved-confounder assumption is important for medical applications because it is hard to assume the dataset captures all aspects of the patient health status. SyncTwin 's ability to handle unobserved confounders v i relies on the validity of its data generating assumption, which we discuss next. Why does SyncTwin not explicitly require overlap? The overlap assumption is commonly made in treatment effect estimation methods. We first give a very brief review of why two importance classes of methods need overlap. (1) For methods that rely on propensity scores, overlap makes sure that the propensity scores are not zero, which enables various forms of propensity weighting. (2) For methods that rely on covariate adjustment, overlap ensures that the conditional expectation E[y i |X i , a i ] is well-defined, i.e. the conditioning variables (X i , a i ) have non-zero probability. In comparison, SyncTwin relies on neither the propensity scores nor the explicit adjustment of covariates, and hence it does not make overlap assumption explicitly. However, as discussed in Proposition 1, SyncTwin requires the synthetic twin to match the representations d c i ≈ 0, which implies ci ≈ j∈I0 b ij ci for some b ij -the target individual should be in or close to the convex hull formed by the controls in the representation space. This condition has a similar spirit to overlap (but very different mathematically). When overlap is satisfied there tend to be control individuals in the neighbourhood of the treated individual, making it easier to construct matching twins. Conversely, if overlap is violated, the controls will tend to far away from the treated individual, making it harder to construct a good twin.

A.4 THE GENERALITY OF THE ASSUMED DATA GENERATING MODEL

SyncTwin assumes that the outcomes are generated by a latent factor model (Teräsvirta et al., 2010) with the latent factors c i learnable from covariates X i and the latent factors v i that are unobserved confounders. We assume the dimensionality of c i and v i to be low compared with the number of time steps. Despite its seemingly simple form, the assumed latent factor model is very flexible because the factors are in fact latent variables. The latent factor model is widely studied in Econometrics. In many real applications, the temporally observed variables naturally have a low-rank structure, thus can be described as a latent factor model (Abadie & Gardeazabal, 2003; Abadie et al., 2010) . The latent factor model also captures many with temperature hyper-parameter τ (Jang et al., 2016) . It is straightforward to verify that b k satisfies the three constraints while the loss L m remains differentiable with respect to z k . We use the Gumbel softmax function instead of the standard softmax function because Gumbel softmax tend to produce sparse vector b k , which is highly desirable as we discussed in Section 4. Table 5 shows the results under irregularly-measured covariates with varying degree of irregularity m (smaller m, more irregular and fewer covariates are observed). For methods that are unable to deal with irregular covariates, we first impute the unobserved values using Probabilistic PCA before applying the algorithms (Hegde et al., 2019) . SyncTwin achieves the best performance in all cases. Furthermore, SyncTwin's performance deteriorates more slowly than the benchmarks when sampling becomes more irregular (larger m). This suggests that the encoder network in SyncTwin is able to learn good representations even from highly irregularly-measured sequences. Table 6 shows the results under various lengths of the observed covariates S (smaller S, shorter sequences are observed). Again SyncTwin achieves the best performance in all cases. As expected, SyncTwin makes smaller error when the observed sequence is longer. Note that this is not the case of CRN and RMSNtheir performance deteriorates when the observed sequence is longer. This might indicate that these two methods are less able to learn good balancing representations (or balancing weights) when the sequence is longer.

The memory footprint to directly optimize

L m is O (|D 0 | + |D 1 |) × |D 0 | , A.12 SPARSITY COMPARED WITH SYNTHETIC CONTROL In Figure 4 we have shown visualy that SyncTwin produces sparser solution than SC. To quantify the differences, we report the Gini index ( ij b ij (1b ij )/N 1 ), entropy ( ij -b ij log(b ij )/N 1 ) and the number of contributors used to construct the twin ( ij 1{b ij > 0}/N 1 ) in the simulation study. All three metrics reflect the sparsity of the learned weight vector (smaller more sparse). Table 7 shows that SyncTwin achieve sparser results that SC in all metrics considered. The full and ablated versions of SyncTwin have similar sparsity because the sparsity is regulated in the matching loss, which all versions share. It is worth pointing out that RSC and MC-NNM do not produce sparse weights and the weights do not need to be positive and sum to one (Amjad et al., 2018; Athey et al., 2018) . A.13 SENSITIVITY OF HYPER-PARAMETERS It is beneficial to understand the network's sensitivity to each hyper-parameter so as to effectively optimize them during validation. In addition to the standard hyper-parameters in deep learning (e.g. learning rate, batch size, etc.), SyncTwin also includes the following specific hyper-parameters: (1) τ , the temperature of the Gumbel-softmax function Appendix A.9, (2) λ p in the training loss L tr (since only the ratio between λ p and λ r matters, we keep λ r = 1 and search different values of λ p ) , and (3) H, the dimension of the representation ci . Here we present a sensitivity analysis on the hyper-parameters H, λ p and τ using the simulation framework detailed in Section 5.1. Here we present the results for N 0 = 2000 and S = 15 although these results generalize to all the simulation settings we considered. The results are presented in Figure 6 , where we can derive two insights. Firstly, the hyper-parameter τ is very important to the performance and need to be tuned carefully during validation. This is understandable because τ is the temperature parameter of the Gumbel softmax function and it directly controls the sparsity of matrix B. In comparison, hyper-parameter H and λ p do not impact the performance in significant way. Therefore we recommend to use H = 40 and λ p = 1 as the default. Secondly, we observe that the validation loss L va closely tracks the error on ITE estimation (which is not directly observable in reality). These results support the use of L va to validate models and perform hyper-parameter optimization.



CONCLUSIONIn this work, we present SyncTwin, an transparent ITE estimation method that deals with temporal confounding and has a broad range of applications in clinical observational studies and beyond. Combining the Synthetic Control method and deep representation learning, SyncTwin achieves transparency and strong performance in both simulated and real data experiments.



Figure 1: Illustration of a treated individual. Yellow dots represent the outcomes under no treatment.

Figure 2: A: Illustration of SyncTwin (shaded area: the time points after the treatment starts). 1. Temporal covariates are encoded as representation vectors. 2. The synthetic twin of a treated target individual is constructed as the weighted average of the few contributors from the control group. 3. The difference between the observed outcome and the synthetic twin outcome estimates ITE. B: the DAG of the data generating model (Sec. 2).

Figure 3: Illustration of the loss functions. (A) The representation networks are trained using Ls and Lr in Equation 6. Note that the supervised loss Ls only applies to the control. (B) Validation and inference involve optimizing the matching loss Lm in Equation8. Note the encoder needs to be fixed during optimization.

and (2) sum-to-one: j∈I0 b ij = 1, ∀i ∈ [N ]. The constraints are needed for three reasons. (1) The constraints reduce the solution space of b i and serve as a regularizer.

Figure 4: Heatmap of the weights bi learned by SyncTwin and SC. Each row represents one bi.

Mean absolute error on ITE under different levels of confounding bias p0. m = 1 and S = 25 are used. Estimated standard deviations are shown in the parentheses. The best performer is in bold.

Figure 5: Illustration of the transparency of SyncTwin. Top: the outcomes (LDL) before and after treatment of a target individual and its synthetic twin. Bottom left: histogram of distance d y (Equation 4). Bottom right: histogram of number of contributors used to construct the synthetic twin.

12) where the first equation follows from the definition of ITE in Section 4. The second equation follows from Equation 1 and 2 together with the fact that v i = 0, ∀i ∈ [N ]. The third line follows from Cauchy-Schwarz inequality. The fourth line uses the fact that ||q t || = 1. By definition, d c i = 0 implies b ij cj = ci . Continuing the proof,

Mean absolute error between the observed covariates x

Comparison of the causal assumptions in the related works. The definitions of Consistency, Sequential overlap, and No unobserved confounder are given in A.3 in bold. The data generating model (D.G.M) in Equation 1 contains the one in Equation 9 as a special case.

which can be further reduced to O |D B | × |D 0 | if we use stochastic gradient decent with a mini-batch D B ⊆ D 0 ∪ D 1 . A.10 THE SIMULATION MODEL In Equation 9, R t is the LDL cholesterol level (outcome) and I t is the dosage of statins. For each individual in the treatment group, one dose of statins (10 mg) is administered daily after the treatment starts, which gives dosage I t = 0 if t ≤ t 0 and I t = 1 otherwise. K, H and D 50 are constants fixed to the values reported in Faltaos et al. (2006). K in t ∈ R is a individual-specific time varying variable that summarizes a individual's physiological status including serum creatinine, uric acid, serum creatine phosphokinase (CPK), and glycaemia. P t and D t are two intermediate temporal variables both affecting R t .

Mean absolute error on ITE with varying irregular m. S = 25 and p0 = 0.5 are used in all cases.Estimated standard deviations are shown in the parentheses. The best performer is in bold. * did not finish within 48h.

Mean absolute error on ITE under different lengths of the temporal covariates S. m = 1 and p0 = 0.5 are used in all cases. Estimated standard deviations are shown in the parentheses. The best performer is in bold. * did not finish within 48 hours.

Sparsity metrics of the learned bi. Estimated standard deviations are shown in the parentheses. Here p0 = 0.5, m = 1, S = 25. The worst performer is italicized

annex

of well-studied scenarios as special cases (Finkel, 1995) such as the conventional additive unit and time fixed effects (y it (0) = q t + c i ). Last but not least, It has also been shown that the low-rank latent factor models can well approximate many nonlinear latent variable models (Udell & Townsend, 2017) .Latent factor models in the static setting are very familiar in the deep learning literature. Consider a deep feed-forward neural network that uses a linear output layer to predict some real-valued outcomes y ∈ R D in the static setting (notations used in this example are not related to the ones used in the rest of the paper). Denote the last layer of the neural network as h -1 ∈ R K ; it is easy to see that the neural network corresponds to a latent factor model i.e. y = Ah -1 + b, where h -1 is the latent factor. Note that this holds true for arbitrarily complicated feed-forward networks as long as the output layer is linear.

A.5 ESTIMATING ITE FOR CONTROL AND NEW INDIVIDUALS

We have been focusing on estimating ITE for a treated individual i ∈ I 1 . The same approach can estimate the ITE for a control individual without loss of generality. After obtaining the representation ci for i ∈ I 0 , SyncTwin can use the treatment group j ∈ I 1 to construct the synthetic twin by optimizing the matching loss Equation 8. The checking and estimation procedure remains the same.

SyncTwin can also estimate the effect of a new individual

The same idea still applies, but this time we need to construct two synthetic twins: one from the control group and one from the treatment group. The ITE estimation can be obtain using the difference between the two twins.SyncTwin also easily generalizes to the situation where there are A > 1 treatment groups each receiving a different treatment. In this case, the treatment indicator a i ∈ [0, 1, . . . , A]. For a target individual in any of the treatment groups, SyncTwin can construct its twin using the control group I 0 . The remaining steps are the same as the single treatment group case.

A.6 UNRELATED WORKS WITH SIMILAR TERMINOLOGY

Several recent works in the deep learning ITE literature employ similar terminologies such as "matching" (Johansson et al., 2018; Kallus, 2018) . However, they are fundamentally different from SyncTwin because they only work for static covariates and they try to match the overall distribution of the treated and control group rather than constructing a synthetic twin that matches one particular treated individual.The Virtual Twin method (Foster et al., 2011) is designed for randomized clinical trials where there is no confounding (temporal or static). As a result, it cannot overcome the confounding bias when the problem is to estimate causal treatment effect from observational data.

A.7 IMPLEMENTATION DETAILS OF THE BENCHMARK ALGORITHMS

Synthetic control. We used the implementation of Synthetic Control in the R package Synth (1.1-5). The package is available at https://CRAN.R-project.org/package=Synth. Robust Synthetic Control. We used the implementation accompanied with the original paper (Amjad et al., 2018) at https://github.com/SucreRouge/synth_control. We optimized the hyperparameters on the validation set using the method described in Section 3.4.3 Amjad et al. (2018) . The best hyperparameter setting was then applied to the test set.MC-NNM. We used the implementation in the R package SoftImpute (1.4) available at https: //CRAN.R-project.org/package=softImpute. The regularization strength λ is tuned on validation set using grid search before applied to the testing data.Counterfactual Recurrent Network and Recurrent Marginal Structural Network. We used the implementations by the authors Bica et al. (2020) ; Lim et al. (2018) at https://bitbucket. org/mvdschaar/mlforhealthlabpub/src/master/. The networks were trained on the training dataset. We experimented different hyper-parameter settings on the validation dataset, and applied the best setting to the testing data. We also found that the results are not sensitive to the hyperparameters.Counterfactual Gaussian Process. We used the implementation with GPy (GPy, since 2012), which is able to automatically optimize the hyperparameters such as the kernel width using the validation data.One-nearest neighbour. We used our own implementation. Since no parameters need to be learned or tuned, the algorithm was directly applied on the testing dataset.Search range of hyper-parameters 1. Synthetic control: hyperparameters are optimized by Synth directly.2. Robust Synthetic control: num sc ∈ {1, 2, 3, 4, 5} Training. On the training dataset D tr 0 , we learn the representation networks by optimizing L tr = λ r L r + λ p L s , where L r and L s are the loss functions defined in Equation 6. The hyperparameter λ r and λ p controls the relative importance between the two losses. We provide an ablation study in Section 5.1 and perform detailed analysis on hyperparameter importance in Appendix A.13. The objective L tr can be optimized using stochastic gradient descent. In particular, we used the ADAM algorithm with learning rate 0.001 (Kingma & Ba, 2014) .Validation. Since we never observe the true ITE, we cannot evaluate the error of ITE estimation, ||τ i -τi || 2 2 . As a standard practice (Bica et al., 2020) , we rely on the factual loss on observed outcomes:, where ŷi (0) is defined as in Equation 2 and obtained as follows. We obtain the ci for all i ∈ D va and then optimize the matching loss L m (D va 0 , D va 1 ) to find weights b va i . It is important to keep the encoder fixed throughout the optimization; otherwise it might overfit to D va . Finally, ŷi (0) = j∈D va 0 b va ij y j (0). Inference. The first steps of the inference procedure are the same as validation. We start by obtaining the representation ci for all i ∈ D te and then obtain weights b te i by optimizing the matching loss L m (D te 0 , D te 1 ) while keeping the encoder fixed. Using weights b te i , the ITE for any i ∈ D te 1 can be estimated as τi = y i (1) -j∈D te 0 b te ij y j (0) according to Equation 3. Similarly, we obtain ĉi , ŷit (0) according to in Equation 2. The expert can check d y i to evaluate the trustworthiness of τi . Here we present a way to optimize the matching loss L m in Equation 8. To ensure the three constraints discussed in Section 4.2 while also allowing gradient-based learning algorithm, we reparameterize 8) Calculate the gradient of L m (D te 0 , D 1 ) via back propagation. Update B using the Optimizer while keeping the Encoder fixed. Use weight matrix B to obtain τi , ∀i ∈ D te 1 using Equation 3. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q H lambda_p tau In figure 7 we present the wall-clock computation time (in seconds) of SyncTwin under various simulation conditions -with the control group size N 0 = (200, 1000, 2000) and the length of pre-enrollment period S = (15, 25, 45) . The simulations were performed on a server with a Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz and a Nvidia(R) GeForce(TM) RTX 2080 Ti GPU. All simulations finished within 10 mins. As we expect, the computation time increases with respect to N 0 and S as more data need to be processed. However, a 10-fold increase in N 0 only approximately doubled the computation time, suggesting that SyncTwin scales well with sample size. In comparison, S seems to affect the computation time more because the encoder and decoder need to be trained on longer sequences.

A.15 ADDITIONAL RESULTS IN THE CPRD STUDY

We the treatment and the control group in the CPRD experiment are selected based on the selection criterion in Figure 8 . We have followed all the guidelines listed in Dickerman et al. (2019) to make sure the selection process does not increase the confounding bias. The summary statistics of the treatment and control groups are listed below. We can clearly see a selection bias as the treatment group contains a much higher proportion of male and people with previous cardiovascular or renal diseases. 

