EXPLORING TRANSFORMER BACKBONES FOR HETEROGENEOUS TREATMENT EFFECT ESTIMATION

Abstract

Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent works use Multilayer Perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generalizability. To extend beyond the single domain formulation and towards more realistic learning scenarios, we explore model design spaces beyond MLPs, i.e., transformer backbones, which provide flexibility where attention layers govern interactions among treatments and covariates to exploit structural similarities of potential outcomes for confounding control. Through careful model design, Transformers as Treatment Effect Estimators (TransTEE) is proposed. We show empirically that TransTEE can: (1) serve as a general-purpose treatment effect estimator which significantly outperforms competitive baselines on a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments.) and is applicable to both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, explainable in covariate adjustment, and real-world utility in auditing pre-trained language models.

1. INTRODUCTION

One of the fundamental tasks in causal inference is to estimate treatment effects given covariates, treatments and outcomes. Treatment effect estimation is a central problem of interest in clinical healthcare and social science (Imbens & Rubin, 2015) , as well as econometrics (Wooldridge, 2015) . Under certain conditions (Rosenbaum & Rubin, 1983) , the task can be framed as a particular type of missing data problem, whose structure is fundamentally different in key ways from supervised learning and entails a more complex set of covariate and treatment representation choices. Previous works in statistics leverage parametric models (Imbens & Rubin, 2015; Wager & Athey, 2018; Künzel et al., 2019; Foster & Syrgkanis, 2019) to estimate heterogeneous treatment effects. To improve their utilities, feed-forward neural networks have been adapted for modeling causal relationships and estimating treatment effects (Yoon et al., 2018; Bica et al., 2020b; Schwab et al., 2020; Nie et al., 2021; Curth & van der Schaar, 2021b) , in part due to their flexibility in modeling nonlinear functions (Hornik et al., 1989) and high-dimensional input (Johansson et al., 2016) . Among them, the specialized NN's architecture plays a key role in learning representations for counterfactual inference (Alaa & Schaar, 2018; Curth & van der Schaar, 2021b ) such that treatment variables and covariates are well distinguished (Shalit et al., 2017) . Despite these encouraging results, several key challenges make it difficult to adopt these methods as standard tools for treatment effect estimation. Most current works based on subnetworks do not sufficiently exploit the structural similarities of potential outcomes for heterogeneous TEEfoot_0 and accounting for them needs complicated regularizations, reparametrization or multi-task architectures that are problem-specific (Curth & van der Schaar, 2021b) . Moreover, they heavily rely on their treatment-specific designs and cannot be easily extended beyond the narrow context in which they are originally. For example, they have poor practicality and generalizability when high-dimensional structural data (e.g., texts and graphs) are given as input (Kaddour et al., 2021) . Besides, those MLP-based models currently lag far behind recent advances in machine learning methodology, which are prone to issues of scale, expressivity and flexibility. Specifically, those side limitations include parameter inefficiency (Table 1 ), and brittleness under different scenarios, such as when treatments shift slightly from the training distribution. The above limitations clearly show a pressing need for an effective and practical framework to estimate treatment effects. In this work, we explore recent advanced models in the deep learning community to boost the model design for TEE tasks. Specifically, the core idea of our approach consists of three parts: as an Slearner, TransTEE embeds all treatments and covariates, which avoids multi-task architecture and shows improved flexibility and robustness to continuous treatment value distribution shifts; attention mechanisms are used for modeling treatment interaction and treatment-covariate interaction. In this way, TransTEE enables adaptive covariate selection (De Luna et al., 2011; VanderWeele, 2019) for inferring causal effects. For example, one can observe in Figure 1 that both pre-treatment covariates and confounders are appropriately adjusted with higher weights, which recovers the "disjunctive cause criterion" (De Luna et al., 2011) that accounts for those two kinds of covariates and is helpful for ensuring the plausibility of the conditional ignorability assumption when complete knowledge of a causal graph is not available. This recipe also gives improved versatility when working with heterogeneous treatments types (Figure 2 ). Our first contribution shows that transformer backbones, equipped with proper design choices, can be effective and versatile treatment effect estimators under the Rubin-Neyman potential outcomes framework. TransTEE is empirically verified to be (i) a flexible framework applicable for a wide range of TEE settings; (ii) compatible and effective with propensity score modeling; (iii) parameterefficient; (iv) explainable in covariate adjustment; (v) robust under continuous treatment shifts; (vi) useful for debugging pre-trained language models (LMs) to promote favorable social outcomes. Moreover, comprehensive experiments on six benchmarks with four types of treatments are conducted to verify the effectiveness of TransTEE in estimating treatment effects. We show that TransTEE produces covariate adjustment interpretation and significant performance gains given discrete, continuous or structured treatments on popular benchmarks including IHDP, News, TCGA. We introduce a new surrogate modeling task to broaden the scope of TEE beyond semi-synthetic evaluation and show that TransTEE is effective in real-world applications like auditing fair predictions of LMs.

2. RELATED WORK

Neural Treatment Effect Estimation. There are many recent works on adapting neural networks to learn counterfactual representations for treatment effect estimation (Johansson et al., 2016; Shalit et al., 2017; Louizos et al., 2017; Yoon et al., 2018; Bica et al., 2020b; Schwab et al., 2020; Nie et al., 2021; Curth & van der Schaar, 2021b) . To mitigate the imbalance of covariate representations across treatment groups, various approaches are proposed including optimizing distributional divergence (e.g. IPM including MMD, Wasserstein distance), entropy balancing (Zeng et al., 2020) (converges to JSD between groups), counterfactual variance (Zhang et al., 2020) . However, their domainspecific designs make them limited to different treatments as shown in Table 1 : methods like VCNet (Nie et al., 2021) use a hand-crafted way to map a real-value treatment to an n-dimension vector with a constant mapping function, which is hard to converge under shifts of treatments (Table 6 in Appendix); models like TARNet (Shalit et al., 2017) need an accurate estimation of the value interval of treatments. Moreover, previous estimators embed covariates to only one representation space by fully connected layers, tending to lose their connection and interactions (Shalit et al., 2017; Johansson et al., 2020) . And it is non-trivial to adapt to the wider settings given existing ad hoc designs on network architectures. For example, the case with n treatments and m associated dosage requires n × m branches for methods like DRNet (Schwab et al., 2020) , which put a rigid requirement on the extrapolation capacity and infeasible given observational data. Transformers and Attention Mechanisms Transformers (Vaswani et al., 2017) have demonstrated exemplary performance on a broad range of language tasks and their variants have been successfully adapted to representation learning over images (Dosovitskiy et al., 2021) , programming languages (Chen et al., 2021) , and graphs (Ying et al., 2021) partly due to their flexibility and expressiveness. Their wide utility has motivated a line of work for general-purpose neural architectures (Jaegle et al., 2021; 2022) that can be trained to perform tasks across various modalities like images, point clouds, audios and videos. But causal inference is fundamentally different from the above models' focus, i.e. supervised learning. And one of our goals is to explore the generalizability of attentionbased models for TEE across domains with high-dimensional inputs, an important desideratum in causal representation learning (Schölkopf et al., 2021) . There are recent attempts to use attention mechanisms for TEE Tasks (Guo et al., 2021; Xu et al., 2022) . CETransformer (Guo et al., 2021) uses embeds covariates for different treatments as a T-learner, They only trivially learn covariate embeddings but not treatment embedding, while the latter is shown more important for TEE tasks. In contrast, TransTEE is an S-learner, which is more well-suited to account for causal heterogeneity (Künzel et al., 2019; Curth & van der Schaar, 2021b; a) . ANU (Xu et al., 2022) utilizes attention mechanisms to map the original covariate space X into a latent space Z with a single model. We detail the difference in Appendix A.

3. PROBLEM STATEMENT AND ASSUMPTIONS

Treatment Effect Estimation. We consider a setting in which we are given N observed samples (x i , t i , s i , y i ) N i=1 , each containing N pre-treatment covariates {x i ∈ R p } N i=1 . The treatment variable t i in this work has various support, e.g., {0, 1} for binary treatment settings, R for continuous treatment settings, and graphs/words for structured treatment settings. For each sample, the potential outcome (µ-model) µ(x, t) or µ(x, t, s) is the response of the i-th sample to a treatment t, where in some cases each treatment will be associated with a dosage s ti ∈ R. The propensity score (π-model) is the conditional probability of treatment assignment given the observed covariates π(T = t|X = x). The above two models can be parameterized as µ θ and π ϕ , respectively. The task is to estimate the Average Dose Response Function (ADRF): (Shoichet, 2006) , which includes special cases in discrete treatment scenarios that can also be estimated as the average treatment effect (ATE): AT E = E[µ(x, 1) -µ(x, 0)] and its individual version ITE. What makes the above problem more challenging than supervised learning is that we never see the missing counterfactuals and ground truth causal effects in observational data. Therefore, we first introduce the required fundamentally important assumptions that give the strongly ignorable condition such that statistical estimands can be interpreted causally. Assumption 3.2. (Positivity/Overlap) The treatment assignment is non-deterministic such that, i.e. 0 < π(t|x) < 1, ∀x ∈ X , t ∈ T Assumption 3.1 ensures the causal effect is identifiable, implying that treatment is assigned independent of the potential outcome and randomly for every subject regardless of its covariates, which allows estimating ADRF using µ(t) , 1978) . One naive  µ(x, t) = E[Y |X = x, do(T = t)] := E[Y |do(T = t)] = E[E[[Y |x, T = t]] (Rubin ⋯ 𝑡 ∈ [𝑡#, 𝑡$] 𝑡 ∈ [𝑡", 𝑡#] X t s FlexTENet (discrete) VCNet (continuous) Figure 2 : A schematic comparison of TransTEE and recent works including DragonNet (Shi et al., 2019) , FlexTENet (Curth & van der Schaar, 2021b) , DRNet (Schwab et al., 2020) and VCNet (Nie et al., 2021) . TransTEE handles all the scenarios without handcrafting treatment-specific architectures and any additional parameter overhead. estimator of µ(x, t) = E[Y |X = x, T = t] is the sample average µ(t) = n i=1 μ(x i , t ). Assumption 3.2 states that there is a chance of seeing units in every treated group.

4. TRANSTEE: TRANSFORMERS AS TREATMENT EFFECT ESTIMATORS

The systematic similarity of potential outcomes of different treatment groups is important for TEE (Curth & van der Schaar, 2021b) . Note that x is often high-dimensional while t is not, which means naively feeding (x, t) to MLPs is not favorable since the impacts of treatment tend to be lost. As a result, various architectures and regularizations have been proposed to enforce structural similarity and differences among treatment groups. However, they are limited to specific use cases as shown in Section 2 and Figure 2 . To remedy it, we use three simple yet effective design choices based on attention mechanisms. The resulting scalable framework TransTEE can tackle the problems of most existing treatment effect estimators (e.g., multiple/continuous/structured treatments, treatments interaction, and treatments with dosage) without ad-hoc architectural designs, e.g., multiple branches. Preliminary. The main module in TransTEE is the attention layer Vaswani et al. (2017) : given d-dimensional query, key, and value matrices Q ∈ R d×d k , K ∈ R d×d k , V ∈ R d×dv , attention mechanism computes the outputs as H(Q, K, V ) = softmax( QK T √ d k )V . In practice, multi-head attention is preferable to jointly attend to the information from different representation subspaces. H M (Q, K, V ) = Concat(head 1 , ..., head h )W O , where head i = H(QW Q i , KW K i , V W V i ), where W Q i ∈ R d×d k , W V i ∈ R d×d k , W V i ∈ R d×dv and W O ∈ R hdv×d are learnable matrices.

4.1. COVARIATE AND TREATMENT EMBEDDING LAYERS

Treatment Embedding Layer. As illustrated in Figure 2 and Table . 1, as treatments are often of much lower dimension compared to covariates, to avoid missing the impacts of treatments, previous works (e.g., DragonNet (Shi et al., 2019) , FlexTENet (Curth & van der Schaar, 2021b) , DRNet (Schwab et al., 2020) ) assign covariates from different treatment groups to different branches, which is highly parameter inefficient. Besides, We analyze in Proposition 2 (Appendix D) that, for continuous treatments/dosages, the performance is affected by both number of branches and the value interval of treatment. However, almost all previous works on continuous treatment/dosage assume the treatment or dosage is in a fixed value interval e.g., [0, 1] and Figure 3 shows that prevalent works fail when tested under shifts of treatments. These two observations motivate us to use two learnable linear layers to project scalar treatments and dosages to d-dimension vectors separately: M t = Linear(t), M s = Linear(s), where M t ∈ R d . M s ∈ R d exists just when each treatment has a dosage parameter, otherwise, only treatment embedding is needed. When multiple (n) treatments act simultaneously, the projected matrix will be M t ∈ R d×n , M s ∈ R d×n and when facing structural treatments (languages, graphs), the treatment embedding will be projected by language models and graph neural networks respectively. By using the treatment embeddings, TransTEE is shown to be (i) robust under treatment shifts, and (ii) parameter-efficient. Covariates Embedding Layer. Different from previous works that embed all covariates by one fully connected layer, where the differences between covariates tend to be lost, and is hard to study the function of an individual covariate in a sample. TransTEE learns different embeddings for each covariate, namely M x = Linear(x), and M x ∈ R d×p , where p is the number of covariate. Covariates embedding enables us to study the effect of individual covariate on the outcome.

4.2. COVARIATE AND TREATMENT SELF-ATTENTION

For covariates, prevalent methods represent covariates as a whole feature using MLPs, where pairwise covariate interactions are lost when adjusting covariates. Therefore, we cannot study the effect of each covariate on the estimated result. In contrast, TransTEE processes each covariate embedding independently and model their interactions by self-attention layers. Namely, M l x = H M (M l-1 x , M l-1 x , M l-1 x ) + M l-1 x , M l x = MLP(BN( M l x )) + M l x . where M l x is the output of l layer and BN is the BatchNorm layer. Simultaneously, the treatments and dosages embeddings are concatenated and projected to the latent dimension by a linear layer, which generates a new embedding M st ∈ R d . Then self-attention is applied M l st = H M (M l-1 st , M l-1 st , M l-1 st ) + M l-1 st , M l st = MLP(BN( M l st )) + M l st . The self-attention layer for treatments enables treatment interactions, an important desideratum for Sand T-learners. Namely, TransTEE can model the scenario where multiple treatments are applied and attain strong practical utility, e.g., multiple prescriptions in healthcare or different financial measures in economics. This is an effective remedy for existing methods which are limited to settings where various treatments are not used simultaneously.

4.3. TREATMENT-COVARIATE CROSS-ATTENTION

One of the fundamental challenges of causal meta-learners is to model treatment-covariate interactions. TransTEE realizes this by a cross-attention module, treating M st as query and M x as key and value M l = H M (M l-1 st , M l-1 x , M l-1 x ) + M l-1 , M l = MLP( M l ) + M l , ŷ = MLP(Pooling(M L )), where M L is the output of the last cross-attention layer and M 0 = M L st . The above interactions are particularly important for adjusting proper covariate or confounder sets for estimating treatment effects (VanderWeele, 2019) , which empirically yields suitable covariate adjustment principles (the Disjunctive Cause Criteria) (De Luna et al., 2011; VanderWeele, 2019) about pre-treatment covariates and confounders as intuitively illustrated in Figure 1 and corroborated in our experiments. Denote ŷ := µ θ (x, t) and the training objective is the mean square error (MSE) of the outcome regression: L θ (x, y, t) = n i=1 (y i -µ θ (x i , t i )) 2 . Remark. We include an illustration of TransTEE by a concrete example in Appendix B. Note that, although the embedding technique and attention mechanisms are commonly used in Computer Vision, Neural Language Processing communities, it is not well understood how to guide the design of these modules for causal inference and why these techniques benefit TEE tasks are underexplored. In this work, through the flexible use of embedding and attention mechanisms we design a strong TEE architecture, we further use conceptual analysis and empirical results to show the benefit brought by the used design choices. Besides, when combined with the strong modeling capacity of Transformers, TransTEE can be extended to high-dimensional data flexibly and effectively on structured data. The generalizability of the TransTEE also allows new applications like auditing language models beyond semi-synthetic settings as shown in the next section.

5. EXPERIMENTAL RESULTS

We elaborate on basic experimental settings, results, analysis, and empirical studies in this section. See Appendix E for full details of all experimental settings and detailed definitions of metrics. See Appendix F for many more results and remarks.

5.1. EXPERIMENTAL SETTINGS

Datasets. Since the true counterfactual outcome (or ADRF) are rarely available for real-world data, we use synthetic or semi-synthetic data for empirical evaluation. for continuous treatments, we use one synthetic dataset and two semi-synthetic datasets: the IHDP and News datasets. For treatment with continuous dosages, we obtain covariates from a real dataset TCGA (Chang et al., 2013) and generate treatments, where each treatment is accompanied by a dosage. The resulting dataset is named TCGA (D). Following (Kaddour et al., 2021) , datasets for structured treatments include Small-World (SW), which contains 1, 000 uniformly sampled covariates and 200 randomly generated Watts-Strogatz small-world graphs (Watts & Strogatz, 1998) as treatments, and TCGA (S), which uses 9, 659 gene expression of cancer patients (Chang et al., 2013) for covariates and 10, 000 molecules from the QM9 dataset (Ramakrishnan et al., 2014) as treatments. For the study on language models, we use the Enriched Equity Evaluation Corpus (EEEC) (Feder et al., 2021) . Baselines. Baselines for continuous and binary treatments include TARnet (Shalit et al., 2017) , Dragonnet (Shi et al., 2019) , DRNet (Schwab et al., 2020) , FlexTENet (Curth & van der Schaar, 2021b) , and VCNet (Nie et al., 2021) . SCIGAN (Bica et al., 2020b) is chosen as the baseline for continuous dosages. Besides, we revise DRNet (Schwab et al., 2020) , TARNet (Shalit et al., 2017) , and VCNet (Nie et al., 2021) to DRNet (D), TARNet (D), VCNet (D), respectively, which enable multiple treatments and dosages. Specifically, DRNet (D) has T main flows, each corresponding to a treatment and is divided into B D branches for continuous dosage. Baselines for structured treatments include Zero (Kaddour et al., 2021) , GNN (Kaddour et al., 2021) , GraphITE (Harada & Kashima, 2021) , and SIN (Kaddour et al., 2021) . To compare the performance of different frameworks fairly, all of the models regress on the outcome with empirical samples without any regularization. For MLE training of the propensity score model, the objective is the negative loglikelihood: L ϕ := -1 n n i=1 log π ϕ (t i |x i ). Evaluation Metric. For continuous and binary treatments, we use the average mean squared error on the test set. For structured treatments, following (Kaddour et al., 2021) , we rank all treatments by their propensity π(t|x) in a descending order. Top K treatments are selected and the treatment effect of each treatment pair is evaluated by unweighted/weighted expected Precision in Estimation of Heterogeneous Effect (PEHE) (Kaddour et al., 2021) , where the WPEHE@K accounts for the fact that treatment pairs that are less likely to have higher estimation errors should be given less importance. For multiple treatments and dosages, AMSE is calculated over all dosage and treatment pairs, resulting in AMSE D .

5.2. CASE STUDY AND NUMERICAL RESULTS

Case study on treatment distribution shifts We start by conducting a case study on treatment distribution shifts (Figure 3 ), and exploring an extrapolation setting in which the treatment may subsequently be administered at values never seen before during training. Surprisingly, we find that while standard results rely on constraining the values of treatments Nie et al. (2021) and dosages Schwab et al. (2020) to a specific range, our methods perform surprisingly well when extrapolating beyond these ranges as assessed on several benchmarks. By comparison, other methods appear comparatively brittle in these same settings. See Appendix D for detailed discussion. Case study of propensity modeling. TransTEE is conceptually simple and effective. However, when the sample size is small, it becomes important to account for selection bias (Alaa & Schaar, 2018) . However, most existing regularizations can only be used when the treatments 7UHDWPHQW are discrete (Bica et al., 2020a; Kallus, 2020; Du et al., 2021) . Thus we propose two regularization variants for continuous treatment/dosages, which are termed Treatment Regularization (TR, L T R ϕ (x, t) = n i=1 t i -π ϕ ( ti |x i ) 2 ) and its probabilistic version Probabilistic Treatment Regularization (PTR, L P T R ϕ = n i=1 (ti-π ϕ (µ|xi)) 2 2π ϕ (σ 2 |xi) + 1 2 log π ϕ (σ 2 |x i ) ) respectively. The overall model is trained in a adversarial pattern, namely min θ max ϕ L θ (x, y, t) -L ϕ (x, t). Specifically, a propensity score model π ϕ (t|x) parameterized by an MLP is learned by minimizing L ϕ (x, t), and then the outcome estimators µ θ (x, t) is trained by min θ L θ (x, y, t) -L ϕ (x, t). To overcome selection biases over-representation space, the bilevel optimization enforces effective treatment effect estimation while modeling the discriminative propensity features to partial out parts of covariates that cause the treatment but not the outcome and dispose of nuisance variations of covariates (Kaddour et al., 2021) . Such a recipe can account for selection bias where π(t|x) ̸ = p(t) and leave spurious correlations out, which can also be more robust under model misspecification especially in the settings that require extrapolation on treatment (See Table 2 and Appendix C for concrete formalisms and discussions.). Table 2, Appendix Table 5 and Table 12 , with the addition of adversarial training as well as TR and PTR, TransTEE's estimation error with continuous treatments can be further reduced. Overall, TR is better in the continuous case with smaller treatment distribution shifts, while PTR is preferable when shifts are greater. Both TR and PTR cannot bring performance gains over discrete cases. The superiority of TR and PTR in combination with TransTEE over comprehensive existing works, especially in semi-synthetic benchmarks like IHDP that may systematically favor some types of algorithms over others (Curth et al., 2021) , also calls for more understanding of NNs' inductive biases in treatment effect estimation problems of interest. Moreover, covariate selection visualization in TR and PTR (Figure 4 (a) , Table 4 and Appendix F) supports the idea that modeling the propensity score effectively promotes covariate adjustment and partials out the effects from the covariates on the treatment features. We also compare the training dynamic of different regularizations in Appendix F, where TR and PTR are further shown able to improve the convergence of TransTEE.

As in

Continuous treatments. To evaluate the efficiency with which TransTEE estimates the average doseresponse curve (ADRF), we compare against other recent NN-based methods (Tables 2). Comparing results in each column, we observe performance boosts for TransTEE. Further, TransTEE attains a much smaller error than baselines in cases where the treatment interval is not restricted to [0, 1] (e.g., t ∈ [0, 5]) and when the training and test treatment intervals are different (extrapolation). Interestingly, even vanilla TransTEE produces competitive performance compared with that of π(t|x) trained additionally using MLE, demonstrating the ability of TransTEE to effectively model treatments and covariates. The estimated ADRF curves on the IHDP and News datasets are shown in Figure 11 and Figure 13 in Appendix. TARNet and DRNet produce discontinuous ADRF estimators and VCNet only performs well when t ∈ [0, 1]. However, TransTEE attains lower estimation error and preserves the continuity of ADRF on different treatment intervals. Continuous dosage. In Table 5 , we compare TransTEE against baselines on the TCGA (D) dataset with default treatment selection bias 2.0 and dosage selection bias 2.0. As the number of treatments 13 . The performance gain between GNN and Zero indicates that taking into account graph information significantly improves estimation. The results suggest that, overall, the performance of TransTEE is the best due to the strong modeling capability and advanced model structure for processing high-dimensional treatments. SIN is the best model among these baselines. However, when the bias is equal to 0.1, SIN fails to attain estimation results better than the Zero baseline. To evaluate each model's robustness to varying levels of selection bias, performance curve with κ ∈ [0, 40] for the SW dataset and κ ∈ [0, 0.5] for the TCGA dataset are shown in Figure 14 Analysis of covariate adjustment of cross-attention module. TransTEE embeds each covariate independently and then make treatments select proper covariates for prediction by cross-attention. The resulting interpretability of the covariate adjustment process using attention weights is one clear advantage over existing works. Thus we visualize the covariate selection results (cross-attention weights) in Figure 4 (a). As elaborated in Appendix E.3, the IHDP dataset has 25 covariates, which is divided into 3 groups: S con = {1, 2, 3, 5, 6}, S dis,1 = {4, 7 ∼ 15}, and S dis,2 = {16 ∼ 25}. S con influences both T and Y , S dis,1 influences only Y , and S dis,1 influences only T . Covariates in S dis,1 are named noisy covariates since they have no correlation with the treatment. Their causal relationships are illustrated in Figure 5 . Interestingly, confounders S con are assigned higher weights while noisy covariates (those influence the outcome but are irrelevant to the treatment) lower S dis,1 , which matches the principles in (VanderWeele, 2019) and corroborate the ability of TransTEE to estimate treatment effects in complex datasets by controlling both pre-treatment variables and confounders properly. Moreover, Figure 4 (b) shows that TransTEE consistently outperforms baselines across different numbers of noisy covariates. We further conduct 10 repetitions for TransTEE and its TR and PTR counterparts as reported in Table 4 (Appendix Figure 10 visualizes their cross-attention weights). Denote w con , w 1 , w 2 as the summation of weights assigned to S con , S dis,1 , S dis,2 respectively. We can see that, incorporated with both TR and PTR regularization, TransTEE assigns more weights to confounding covariates (S con ) and fewer weights on noisy covariates, which further verifies the compatibility of TransTEE with propensity score modeling since both TR and PTR improve confounding control. Moreover, TR is better than PTR since it also reduces w 2 by a larger margin. This observation gives a suggestion that we should systematically probe TR and PTR besides comparing their numerical performance, especially in settings where the unconfoundedness assumption is violated (Ding et al., 2017) and controlling instrumental variables will incur biases in TEE. Amount of model parameters comparison. The experiment is to corroborate the conceptual comparison in Table 1 . We find that the proposed TransTEE has consistently fewer parameters than baselines on all the settings as shown in Figure 4 (c). Besides, increasing the number of treatments allows more accurate approximation for continuous treatments/dosages, most of these baselines need to increase branches which incurs parameter redundancy. However, TransTEE is much more efficient.

5.4. EMPIRICAL STUDY ON PRE-TRAINED LANGUAGE MODELS

To evaluate the real-world utility of TransTEE, in this subsection, we demonstrate an initial attempt for auditing and debugging large pre-trained language models, an important use case in NLP that is beyond semi-synthetic settings and under-explored in the causal inference literature. Specifically, we use TransTEE to estimate the treatment effects for detecting the effects of domain-specific factors of variation (such as the change of subject's attributes in a sentence) on the predictions of pre-trained language models. We experiment with BERT (Kenton & Toutanova, 2019) (e.g., racial and gender-related nouns) over natural language on the (real) EEEC dataset. We use both the correlation/representation-based baselines introduced in (Feder et al., 2021) and implement treatment effect estimators (e.g., TARnet, DRNet, VCNet, and the proposed TransTEE). Interestingly, results in Table 3 show that TransTEE effectively estimates the treatment effects of domain-specific variation perturbations even without substantive downstream fine-tuning on specialized datasets. TransTEE outperforms baselines adapted from MLP. Moreover, we showcase the top-k samples with the maximal/minimal ITE and analysis in Appendix F.3. The results show that TransTEE has the potential to provide estimators for practical use cases in predicting model predictions (Ilyas et al., 2022) . For example, those identified samples can provide actionable insights like function as contrast sets for analyzing and understanding LMs (Gardner et al., 2020; Abraham et al., 2022) and TransTEE can estimate ATE to enforce invariant or fairness constraints for LMs (Veitch et al., 2021) in a lightweight and efficient manner, which we leave for future work.

6. CONCLUDING REMARKS

In this work, we show attention mechanisms can be effective and versatile design choices for TEE tasks. Extensive experiments well verify the effectiveness and utility of the proposed TransTEE, which also imply that more challenging and unified evaluation alternatives of TEE are needed. Moreover, we hope that our findings can lay the groundwork for future work in developing advanced machine learning techniques like pre-training in large-scale observational data in estimating treatment effects, where TransTEE can serve as an effective backbone. Similar to almost all the causal inference methods on observational data, one potential limitation of TransTEE is the reliance on the ignorability assumption. Therefore, one important future direction is extending TransTEE to settings with more complex causal graphs and generate identifiable causal functionals tractable for optimization (Jung et al., 2020) supported by identification theory. Since adjusting covariates without accounting for the causal graph might yield inaccurate or biased estimates of the causal effect (Pearl, 2009) , how to integrate TransTEE with domain knowledge (Imbens & Rubin, 2015) for alleviating its potential negative societal impacts in consequential decision making will also be important. 

G Remarks on Interpretability

A EXTENDED RELATED WORK Propensity Score. Most related works fundamentally rely on strongly ignorable conditions. Still even under ignorability, treatments may be selectively assigned according to propensities that depend on the covariates. To overcome the impact of such confounding, many statistical methods (Austin, 2011) like covariate adjustment (Austin, 2011), matching (Rubin & Thomas, 1996; Abadie & Imbens, 2016) , stratification (Frangakis & Rubin, 2002) , reweighting (Hirano et al., 2003 ), g-computation (Imbens & Rubin, 2015) , have been proposed. More recent approaches include propensity dropout (Alaa et al., 2017) , and multi-task Gaussian process (Alaa & van der Schaar, 2017) . Explicitly modeling the propensity score, which reflects the underlying policy for assigning treatments to subjects, has also shown to be effective in reasoning about the unobserved counterfactual outcomes and accounting for confounding. Based upon it, double robust estimators and targeted regularization are proposed to guarantee the consistency of estimated treatment effects under misspecification of either the outcome or propensity score model (Kang & Schafer, 2007; Funk et al., 2011) . There are also works using adversarial training for balanced representations (Bica et al., 2020a; Kallus, 2020; Du et al., 2021) . However, most traditional approaches are restricted to binary treatments and the capacity of NNs for such problems have not been fully leveraged.

Domain Adaptation

There are some close connections between causal inference and domain adaptation, in particular, out-of-distribution robustness. Intuitively, traditional domain adversarial training learns representations that are indistinguishable by the domain classifier by minimizing the worstdomain empirical error (Ganin et al., 2016; Zhao et al., 2018; Wang et al., 2022; Zhang et al., 2022) . The algorithmic insights can be handily translated to the TEE domain (Shalit et al., 2017; Johansson et al., 2020; Feder et al., 2021) . Here we also have the desideratum that covariate representations should be balanced such that the selection bias is minimized and the effect is maximally determined by the treatment. Algorithmically, when the treatment is continuous, we connect our method to continuously indexed domain adaptation (Wang et al., 2020) . Our formulation and algorithm also serve to build connections to a diverse set of statistical thinking on causal inference and domain adaptation, of which much can be gained by mutual exchange of ideas (Johansson et al., 2020) . Explicitly modeling the propensity score also seeks to connect causal inference with transfer learning to inspire domain adaptation methodology and holds the potential to handle a wider range of problems like hidden stratification in domain generalization, which we leave for future work. Comparision between TransTEE and ANU (Xu et al., 2022) . (i) The model structure is different. ANU performs cross-attention between z x , and z t , and no self-attention is applied. However, TransTEE performs self-attention on z x , z t respectively and then cross-attention is performed between z x , z t . When facing high-dimensional data, such as texts, images, and graphs, without multiple selfattention layers on z x , z t separately, the representations will be weak. That is why in machine translation, object detection, and segmentation tasks, the representations of images/texts will be firstly processed by multiple self-attention layers and then perform cross-attention with queries. We will verify this point in the following experiments. (ii) ANU cannot be applied to multi-treatment settings, which have been extensively studied recently (Kaddour et al., 2021; Bica et al., 2020b; Parbhoo et al., 2021) . The comparison experiments are in Section F.1. To better understand the workflow with the above designs, we present a simple illustration here. Consider a use case in medicine effect estimation, where x contains p patient information, e.g., Age, Sex, Blood Pressure (BP), and Previous infection condition (Prev) with a corresponding causal graph (Figure 1 ). n medicines (treatments) are applied simultaneously and each medicine has a corresponding dosage. As shown in Figure 6 , each covariate, treatment, and dosage will first be embedded to d-dimension representation by a specific learnable embedding layer. Each treatment embedding will be concatenated with its dosage embedding and the concatenated feature will be projected by a linear layer to produce d dimensional vectors. Self-attention modules optimizes these embeddings by aggregating contextual information. Specifically, attribute Prev is more related to age than sex, hence the attention weight of Prev feature to age feature is larger and the update of Prev feature will be more dependent on the age feature. Similarly, the interaction of multi-medicines is also attained by the self-attention module.

B AN ILLUSTRATIVE EXAMPLE

The last Cross-attention module enables treatment-covariate interactions, which is shown in Figure 2 that, each medicine will assign a higher weight to relevant covariates especially confounders (BP) than irrelevant ones. Finally, we pool the resulted embedding and use one linear layer to predict the outcome.

C DETAILS AND DISCUSSIONS ABOUT PROPENSITY SCORE MODELLING

We first discuss the fundamental differences and common goals between our algorithm and traditional ones: as a general approach to causal inference, (Kaddour et al., 2021) ; and (iii) taking an adversarial domain adaptation perspective, the methodology is effective for learning invariant representations and further regularizes µ θ (x, t) to be invariant to nuisance factors and may perform better empirically on some classes of distribution shifts (Ganin et al., 2016; Shalit et al., 2017; Zhao et al., 2018; Johansson et al., 2020; Wang et al., 2020) . Based on the above discussion, when treatments are discrete, one might consider directly applying heuristic methods like adversarial domain adaptation (see (Ganin et al., 2016; Zhao et al., 2018) for algorithmic development guidelines). We note the heuristic nature of domain-adversarial methods (see (Wu et al., 2019) for clear failure cases), and a debunking of the common claim that (Ben-David et al., 2010) guarantees the robustness of such methods. Here, we focus on continuous TEE, a more general and challenging scenario, where we want to estimate ADRF, and propose two variants of L ϕ as an adversary for the outcome regression objective L θ accordingly. Recall that L θ (x, y, t) = n i=1 (y i -µ θ (x i , t i )) 2 , the adversarial training process is shown in Eq. 1 below: min θ max ϕ L θ (x, y, t) -L ϕ (x, t). (1) We refer to the above minimax game for algorithmic randomization in replace of costly randomized controlled trials. Such algorithmic randomization based on neural representations using propensity score creates subgroups of different treated units as if they had been randomly assigned to different treatments such that conditional independence T |= X | π(T |X) is enforced across strata and continuation, which approximates a random block experiment to the observed covariates (Imbens & Rubin, 2015) . Below we introduce two variants of L ϕ (x, t): Treatment Regularization (TR) is a standard MSE over the treatment space given the predicted treatment ti and the ground truth t i L T R ϕ (x, t) = n i=1 t i -π ϕ ( ti |x i ) 2 . (2) TR is explicitly matching the mean of the propensity score to that of the treatment. In an ideal case, the π(t|x) should be uniformly distributed given different x. However, the above treatment regularization procedure only provides matching for the mean of the propensity score, which can be prone to bad equilibriums and treatment misalignment (Wang et al., 2020) . Thus, we introduce the distribution of t and model the uncertainty rather than predicting a scalar t: Probabilistic Treatment Regularization (PTR) is a probabilistic version of TR which models the mean µ (with a slight abuse of notation) and variance σ 2 of estimated treatment ti L P T R ϕ = n i=1 (t i -π ϕ (µ|x i )) 2 2π ϕ (σ 2 |x i ) + 1 2 log π ϕ (σ 2 |x i ) . The PTR matches the whole distribution, i.e. both the mean and variance, of the propensity score to that of the treatment, which can be preferable in certain cases. Equilibrium of the Minimax Game. We analyze that TR and PTR can align the first and second moment of continuous treatments at equilibrium respectively, and thus promote the independence between treatment t and covariate x. To be clear, we denote µ θ (x, t) := w y • (Φ x (x), Φ t (t)) and π ϕ (t|x) := w t • Φ x (x), which decompose the predictions into featurizers Φ t : T → Z T , Φ x : X → Z X and predictors w y : Z X × Z T → Y, w t : Z X → T . For example, Φ x (x) and Φ t (t) can be the linear embedding layer and attention modules in our implementation. The propensity is computed on Φ x (x), an intermediate feature representation of x. Similarly, µ θ (x, t) is computed from Φ t (t) and Φ x (x). For the ease of our analysis below, we assume the predictors w t , w x are fixed. Proposition 1. (The optimum of propensity score model) In the equilibrium of the game, assuming the outcome prediction model is fixed, then the optimum of TR is achieved when E[Φ t (t)|Φ x (x)] = E[Φ t (t)], ∀ Φ x (x) via matching the mean of propensity score π(Φ t (t)|Φ x (x)) and the marginal distribution p(Φ x (x)) and the optimum discriminator of PTR is achieved via matching both the mean and variance such that E[Φ t (t)|Φ x (x)] = E[Φ t (t)], V[Φ t (t)|Φ x (x)] = V[Φ t (t)], ∀ Φ x (x). Proof. The proof concerns the analysis of the Equilibrium of the Minimax Game. It is a special case of (Wang et al., 2020) when there are only two players, i.e. µ θ and π ϕ . We represent treatments explicitly and interpret the connections with combating selection biases. Given the outcome regression model µ θ fixed, the optimal propensity score model π * is π * = arg min π L ϕ (Φ x (x), Φ t (t)) = arg min π E (Φx(x),Φt(t))∼p(Φx(x),Φt(t)) Φ t (t) -π θ Φ t ( t)|x 2 = arg min π E Φx(x)∼p(Φx(x)) E Φt(t)∼p(Φt(t)|Φx(x)) Φ t (t) -π θ Φ t ( t)|x 2 . (4) The inner minimum is achieved at π * θ Φ t ( t)|x = E Φt(t)∼p(Φt(t)|Φx(x)) [Φ t (t)] given the following quadratic form: E (Φx(x),Φt(t))∼p(Φx(x),Φt(t)) Φ t (t) -π θ Φ t ( t)|Φ x (x) 2 = E Φt(t)∼p(Φt(t)|Φx(x)) [Φ t (t) 2 ] -2π θ Φ t ( t)|x E Φt(t)∼p(Φt(t)|Φx(x)) [Φ t (t)] + π θ Φ t ( t)|x 2 . (5) We assume the above optimum condition of the propensity score model always holds with respect to the outcome regression model during training, then the minimax game in Eq. 1 can be converted to maximizing the inner loop: max ϕ -L ϕ (x, Φ t (t)) = L ϕ * (Φ x (x), Φ t (t)) = E (Φx(x),Φt(t))∼p(Φx(x),Φt(t)) Φ t (t) -E Φt(t)∼p(Φt(t)|Φx(x)) [Φ t (t)] 2 = E Φx(x)∼p(Φx(x)) E Φt(t)∼p(Φt(t)|Φx(x))∼p(Φx(x),Φt(t)) Φ t (t) -E Φt(t)∼p(Φt(t)|Φx(x)) [Φ t (t)] 2 = E Φx(x)∼p(Φx(x)) V Φt(t)∼p(Φt(t)|Φx(x)) [Φ t (t)] = E Φx(x) V[Φ t (t)|Φ x (x)]. (6) Next we show the difference between Eq. 6 and the variance of the treatment V[Φ t (t)]: E Φx(x)∼p(Φx(x)) V Φt(t)∼p(Φt(t)|Φx(x)) [Φ t (t)] -V[Φ t (t)] =E Φx(x)∼p(Φx(x)) [E[Φ t (t) 2 |Φ x (x)] -E[Φ t (t)|Φ x (x)] 2 ] -(E[Φ t (t) 2 ] -E[Φ t (t)] 2 ) =E[Φ t (t)] 2 -E Φx(x) [E[Φ t (t)|Φ x (x)] 2 ] = E Φx(x) [E[Φ t (t)|Φ x (x)]] 2 -E Φx(x) [E[Φ t (t)|Φ x (x)] 2 ] ≤E Φx(x) [E[Φ t (t)|Φ x (x)] 2 ] -E Φx(x) [E[Φ t (t)|Φ x (x)] 2 ] = 0 (7) where the last inequality is by Jensen's inequality and the convexity of Φ t (t) 2 . The optimum is achieved when E[Φ t (t)|Φ x (x)] is constant w.r.t Φ x (x) and so E[Φ t (t)|Φ x (x)] = E[Φ t (t)], ∀Φ x (x). The proof process for PTR is similar but includes the derivation of variance matching. π * = arg min π L ϕ (Φ x (x), Φ t (t)) = arg min π E (Φx(x),Φt(t))∼p(Φx(x),Φt(t)) (E[Φ t (t)|Φ x (x)] -Φ t (t)) 2 2V[Φ t (t)|Φ x (x)] + log V[Φ t (t)|Φ x (x)] 2 = arg min π E Φx(x) E Φt(t) (E[Φ t (t)|Φ x (x)] -Φ t (t)) 2 2V[Φ t (t)|Φ x (x)] + log V[Φ t (t)|Φ x (x)] 2 , where E Φx(x) and E Φt(t) denote E Φx(x)∼p(Φx(x)) and E Φt(t)∼p(Φt(t)|Φx(x)) respectively for brevity. The first term can be reduce to a constant given the definition of variance: E Φx(x)∼p(Φx(x)) E Φt(t)∼p(Φt(t)|Φx(x)) (E[Φ t (t)|x] -Φ t (t)) 2 2V[Φ t (t)|x] = E Φx(x)∼p(Φx(x)) V[Φ t (t)|x] 2V[Φ t (t)|x] = 1 2 . ( ) The second term can be upper bounded by using Jensen's inequality: E Φx(x)∼p(Φx(x)) E Φt(t)∼p(Φt(t)|Φx(x)) log V[Φ t (t)|x] 2 ≤ 1 2 log E Φx(x)∼p(Φx(x)) [V[Φ t (t)|Φ x (x)]] ≤ 1 2 log (V[Φ t (t)]) . ( ) Combining Eq. 9 and Eq. 10, the optimum 1 2 + 1 2 log (V[Φ t (t)]) is achieved when E[Φ t (t)|Φ x (x)], V[Φ t (t)|Φ x (x)] is constant w.r.t Φ x (x) and so E[Φ t (t)|Φ x (x)] = E[Φ t (t)], V[Φ t (t)|Φ x (x)] = V[Φ t (t)], ∀Φ x (x) according to the equality conditions of the first and second inequality in Eq. 10, respectively.

D ANALYSIS OF THE FAILURE CASES OVER TREATMENT DISTRIBUTION SHIFTS

As shown in Figure 3 (a, c ), with the shifts of the treatment interval, the estimation performance of DRNet and TARNet decline significantly. VCNet achieves ∞ estimation error when h = 5 partly because its hand-craft projection matrix can only process values near [0, 1]. Another problem brought by this assumption is the extrapolation dilemma, which can be seen in Figure 3 (b). When training on t ∈ [0, 1.75], these discrete approximation methods cannot transfer to new distribution t ∈ (1.75, 2.0]. These unseen treatments are rounded down to the nearest neighbors t ′ in T and be seemed the same as t ′ . We conduct ablation about the treatment embedding as in Table 6 in Appendix. Such a simple fix (VCNet+Embeddings) removes the demand on a fixed interval constraint to treatments and attains superior performance on both interpolation and extrapolation settings. The result clearly shows the pitfalls of hand-crafted feature mapping for TEE. We highlight that it is neglected by most existing works (Schwab et al., 2020; Nie et al., 2021; Shi et al., 2019; Guo et al., 2021) . Extrapolation is still a challenging open problem. We can see that no existing work does well when training and test treatment intervals have big gaps. However, the empirical evidence validates the improved effectiveness of TransTEE that uses learnable embeddings to map continuous treatments to hidden representations. Below we show the assumption that the value of treatments or dosages are in a fixed interval [l, h] is sub-optimal and thus these methods get poor extrapolation results. For simplicity, we only consider a data sample has only one continuous treatment t and the result is similar for continuous dosage. Proposition 2. Given a data sample (x, t, y), where  x ∈ R d , t ∈ [l, h], y ∈ R. Assume µ is a L-Lipschitz function over (x, t) ∈ R d+1 , namely |µ(u) -µ(v)| ≤ L∥u -v∥. Partitioning [l, h] uniformly into δ sub-interval, and then get T = l + h-l δ * 0, l + h-l δ * 1, ..., l + h-l δ * δ . Previous studies most rounding down a treatment t to its nearest value in T (either l + tδ h-l h-l δ or l + tδ h-l h-l δ ) max µ x, tδ h -l h -l δ -µ(x, t), µ x, tδ h -l h -l δ -µ(x, t) ≤ max L tδ h -l h -l δ -t , L tδ h -l h -l δ -t ≤ L h -l δ (11) The bound is affected by both the number of branches δ and treatment interval [l, h]. However, as far as we know, most previous works ignore the impacts of the treatment interval [l, h] and adopt a simple but much stronger assumption that treatments are all in the interval [0, 1] Nie et al. ( 2021) or a fixed interval Schwab et al. (2020) . These observations well manifest the motivation of our general framework for TEE without the need for treatment-specific architectural designs. 

E ADDITIONAL EXPERIMENTAL SETUPS

All the assets (i.e., datasets and the codes for baselines) we use include a MIT license containing a copyright notice and this permission notice shall be included in all copies or substantial portions of the software. We conduct all the experiments on a machine with i7-8700K CPU, 32G RAM, and four Nvidia GeForce RTX2080Ti (10GB) GPU cards. E.1 DETAIL EVALUATION METRICS. AMSE T = 1 N N i=1 T f (x i , t) -f (x i , t) π(t)dt (12) UPEHE@K = 1 N N i=1 1 C 2 K t,t ′ f (x i , t, t ′ ) -f (x n , t, t ′ ) 2 WPEHE@K = 1 N N i=1 1 C 2 K t,t ′ f (x i , t, t ′ ) -f (x i , t, t ′ ) 2 p(t|x)p(t ′ |x) , ( ) AMSE D = 1 N T N i=1 T t=1 D f (x i , t, s) -f (x n , t, s) π(s)dt E.2 NETWORK STRUCTURE AND PARAMETER SETTING  × p × #Emb bsz × 1 × # Emb Self-Attention    Multi-head Att BatchNorm Linear BatchNorm    × #Layers    Multi-head Att BatchNorm Linear BatchNorm    × #Layers Output Size Bsz × p × #Emb Bsz × 1 × #Emb Cross-Attention    Multi-head Att BatchNorm Linear BatchNorm    × #Layers Output Size Bsz × 1 × #Emb Projection Layer [Linear] Output Size Bsz × 1 E.3 SIMULATION DETAILS. Synthetic Dataset (Nie et al., 2021) . The synthetic dataset contains 500 training points and 200 testing points. Data is generated as follows: x j ∼ Unif[0, 1] , where x j is the j-th dimension of  t|x = 10 sin (max(x 1 , x 2 , x 3 )) + max(x 3 , x 4 , x 5 ) 3 1 + (x 1 + x 5 ) 2 + sin(0.5x 3 ) (1 + exp(x 4 -0.5x 3 )) + x 2 3 + 2 sin(x 4 ) + 2x 5 -6.5 + N (0, 0.25) y|x, t = cos(2π(t -0.5)) t 2 + 4 max(x 1 , x 6 ) 3 1 + 2x 2 3 + N (0, 0.25) where t = (1 + exp(-t)) -1 . for treatment in [0, h], we revised it to t = (1 + exp -t) -1 * h, IHDP (Hill, 2011) is a semi-synthetic dataset containing 25 covariates, 747 observations and binary treatments. For treatments in [0, 1], we follow VCNet (Nie et al., 2021) and generate treatments and responses by: t|x = 2x 1 1 + x 2 + 2 max(x 3 , x 5 , x 6 ) 0.2 + min(x 3 , x 5 , x 6 ) + 2 tanh 5 i∈S dis,2 (x i -c 2 ) |S dis,2 | -4 + N (0, 0.25) y|x, t = sin(3πt) 1.2 -t tanh 5 i∈S dis,1 (x i -c 1 ) |S dis,1 | + exp(0.2(x 1 -x 6 )) 0.5 + 5 min(x 2 , x 3 , x 5 ) + N (0, 0.25), where t = (1 + exp(-t)) -1 , S con = {1, 2, 3, 5, 6} is the index set of continuous features, 7, 8, 9, 10, 11, 12, 13, 14, 15}, S dis,2 = {16, 17, 18, 19, 20, 21, 22, 23, 24, 25} and S dis,1 = {4, S dis,1 S dis,2 = [25] -S con . Here c 1 = E i∈S dis,1 xi |S dis,1 | ,c 2 = E i∈S dis,2 xi |S dis,2 | . To allow comparison on various treatment intervals t ∈ [0, h], treatments and responses are generated by: t = (1 + exp(-t)) -1 * h y|x, t = sin(3πt/h) 1.2 -t/h tanh 5 i∈S dis,1 (x i -c 1 ) |S dis,1 | + exp(0.2(x 1 -x 6 )) 0.5 + 5 min(x 2 , x 3 , x 5 ) + N (0, 0.25), where the orange part is the only different compared to the generalization of vanilla IHDP dataset (h = 1). Note that S dis,1 only impacts outcome that serves to be noisy covariates; S dis,2 contains pretreatment covariates that only impact treatments, which also serves to be instrumental variables. This allows us to observe the improvement using TransTEE when noisy covariates exist. Following (Hill, 2011) covariates are standardized with mean 0 and standard deviation 1. News. The News dataset consists of 3000 randomly sampled news items from the NY Times corpus (Newman, 2008) and was originally introduced as a benchmark in the binary treatment setting. We generate the treatment and outcome in a similar way as (Nie et al., 2021) but with a dynamic range or treatment intervals [0, h]. We first generate v ′ 1 , v ′ 2 , v ′ 3 ∼ N (0, 1) and then set v i = v ′ i /∥v ′ i ∥ 2 ; i ∈ {1, 2, 3}. Given x, we generate t from Beta 2, v ⊤ 3 x 2v ⊤ 2 x * h. And we generate the outcome by  y ′ |x, t = exp v ⊤ 2 x v ⊤ 3 x -0.3 'RVDJH6HOHFWLRQ%LDV $06( 6&,*$1 7DUQHW' 'UQHW' 9FQHW' 7UDQV7(( y|x, t = 2(max(-2, min(2, y ′ )) + 20v ⊤ 1 x) * 4(t -0.5) 2 + sin π 2 t + N (0, 0.5) TCGA (D) (Bica et al., 2020b) We obtain covariates x from a real dataset The Cancer Genomic Atlas (TCGA) and consider 3 treatments, where each treatment is accompanied by one dosage and a set of parameters, v t 1 , v t 2 , v t 3 . For each run, we randomly sample a vector, u t i ∼ N (0, 1) and then set v t i = u t i /∥u t i ∥ where ∥ • ∥ is Euclidean norm. The shape of the response curve for each treatment, f t (x, s) is given in Table 9 . We add ϵ ∼ N (0, 0.2) noise to the outcomes. Interventions are assigned by sampling a dosage, d t , for each treatment from a beta distribution, d t |x ∼ Beta(α, β t ). α ≥ 1 controls the dosage selection bias (α = 1 gives the uniform distribution). β t = α-1 s * t + 2 -α, where s * t is the optimal dosagefoot_1 for treatment t. We then assign a treatment according to t f |x ∼ Categorical(Softmax(κf (x, s t ))) where increasing κ increases selection bias, and κ = 0 leads to random assignments. The factual intervention is given by (t f , s t f ). Unless otherwise specified, we set κ = 2 and α = 2. For structural treatments, we first define the Baseline effect (Bica et al., 2020b) . For each run of the experiment, we randomly sample a vector u 0 ∼ Unif[0, 1], and set v 0 = u 0 /∥u o ∥, where ∥ • ∥ is the Euclidean norm. The baseline effect is defined as 7UHDWPHQW 5HVSRQVH 7UXWK 'UQHW' 9FQHW' 6&,*$1 7UDQV7(( (a) Estimated ADRF for t1. 7UHDWPHQW 5HVSRQVH 7UXWK 'UQHW' 9FQHW' 6&,*$1 7UDQV7(( (b) Estimated ADRF for t2. 7UHDWPHQW 5HVSRQVH 7UXWK 'UQHW' 9FQHW' 6&,*$1 7UDQV7(( µ 0 (x) = v ⊤ 0 x Table 9 : Dose response curves used to generate semi-synthetic outcomes for patient features x. In the experiments, we set C = 10. v t 1 , v t 2 , v t 3 are the parameters associated with each treatment t. Treatment Dose-Response Optimal dosage 1 f 1 (x, s) = C (v 1 1 ) ⊤ x + 12(v 1 3 ) ⊤ xs -12(v 1 3 ) ⊤ xs 2 s * 1 = (v 1 2 ) ⊤ x 2(v 1 3 ) ⊤ x 2 f 2 (x, s) = C (v 2 1 ) ⊤ x + sin π( v 2⊤ 2 x v 2⊤ 3 x s) s * 2 = (v 2 3 ) ⊤ x 2(v 2 2 ) ⊤ x 3 f 3 (x, s) = C (v 3 1 ) ⊤ x + 12s(s -b) 2 , where b = 0.75 (v 3 2 ) ⊤ x (v 3 3 ) ⊤ x b 3 if b ≥ 0.75 else 1 Small-World (Kaddour et al., 2021) . 20-dimensional multivariate covariates are uniformly sampled according to x i ∼ Unif[-1, 1]. There are 1, 000 units in in-sample dataset, and 500 in the out-sample one. Graph interventions For each graph intervention, a number of nodes between 10 and 120 are uniformly sampled, the number of neighbors for each node is between 3 and 8, and the probability of rewiring each edge is between 0.1 and 1. Watts-Strogatz small-world graphs are repeatedly generated until a connected one is get. Each vertex has one feature, i.e. its degree centrality. A graph's node connectivity is denoted as ν(G) and its average shortest path length as ℓ(G). Similar for the baseline effect, two randomly sampled vectors v ν , v ℓ are generated. Then, given an assigned graph treatment G and a covariate vector x, the outcome is generated by y = 100µ 0 (x) + 0.2ν(G) 2 • v ⊤ ν x + ℓ(G) • ν ⊤ ℓ x + ϵ, ϵ ∼ N (0, 1) TCGA (S) (Kaddour et al., 2021) We use 9, 659 gene expression measurements of cancer patients for covariates. The in-sample and datasets consist of 5, 000 units and the out-sample one of 4, 659 units, respectively. Each unit is a covariate vector x ∈ R 4000 and these units are split randomly into in-and out-sample datasets in each run randomly. For each covariate vector x, its 8-dimensional PCA components x PCA ∈ R 8 is computed. Graph interventions We randomly sample 10, 000 molecules from the Quantum Machine 9 (QM9) dataset (Ramakrishnan et al., 2014) (with 133k molecules in total) in each run. We create a relational graph, where each node corresponds to an atom and consists of 78 atom features. We label each edge corresponding to the chemical bond types, e.g., single, double, triple, and aromatic bonds. We collect 8 molecule properties mu, alpha, homo, lumo, gap, r2, zpve, u0 in a vector z ∈ R 8 , which is denoted as the the assigned molecule treatment. Finally, we generate outcomes by y = 10µ 0 (x) + 0.01z ⊤ x PCA + ϵ, ϵ ∼ N (0, 1) Enriched Equity Evaluation Corpus (EEEC) (Feder et al., 2021) consists of 33, 738 English sentences and the label of each sentence is the mood state it conveys. The task is also known as Profile of Mood States (POMS). Each sentence in the dataset is created using one of 42 templates, with placeholders for a person's name and the emotion, e.g., "<Person> made me feel <emotional state word>.". A list of common names that are tagged as male or female, and as African-American or European will be used to fill the placeholder (<Person>). One of four possible mood states: Anger, Sadness, Fear and Joy is used to fill the emotion placeholder. Hence, EEEC has two kinds of counterfactual examples, which are Gender and Race. For the Gender case, it changes the name and the Gender pronouns in the example and switches them, such that for the original example: "It was totally unexpected, but Roger made me feel pessimistic." it will have the counterfactual example:"It was totally unexpected, but Amanda made me feel pessimistic." For the Race concept, it creates counterfactuals such that for the original example "Josh made me feel uneasiness for the first time ever in my life.", the counterfactual example is: "Darnell made me feel uneasiness for the first time ever in my life.". For each counterfactual example, the person's name is taken at random from the pre-existing list corresponding to its type.

F ADDITIONAL EXPERIMENTAL RESULTS

F.1 COMPARISION BETWEEN TRANSTEE AND ANU (XU ET AL., 2022) We implement ANU and evaluate it in the same settings and show that is inferior compared to the proposed TransTEE as follows. Specifically, we compare the attentive neural uplift model (ANU) (Xu (2) We further evaluate the real-world utility of ANU (Xu et al., 2022) and the experimental setting is detailed in Section 5.4 in the main paper. Covariates here are long sentences. Thanks to the use of self-attention modules, TransTEE can achieve better estimation results compared to baselines (Table 11 ). For AHU, no self-attention layer is applied, and the final estimation is inaccurate, which verifies the superiority of the proposed framework.

F.2 ADDITIONAL NUMERICAL RESULTS AND ABLATION STUDIES

Choice of the balancing weight for treatment regularization. To understand the effect of propensity score modeling, we conduct an ablation study on the balancing weights of both TR and PTR. Figure 9 presents the results of the experiments on the IHDP dataset. The main observation is that both TR and PTR with a proper regularization strength consistently improve estimation compared to TransTEE without regularization. The best performers are achieved when λ is 0.5 for both two methods, which shows that the best balancing parameter (0.5 on our experiments.) for these two regularization terms should be searched carefully. Besides, training both the treatment predictor and the feature encoder simultaneously in a zero-sum game is difficult and sometimes unstable (shown in Figure 9 right) Robustness to noisy covariates. We manipulate S dis,1 , S dis,2 to generate datasets with different noisy covariates, e.g., when the number of covariates that only influence the outcome is 6, , 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25} , when the number of covariates that influence the outcome is 24, S dis,1 = {4, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 , }, and S dis,2 = {25}. Figure Figure 4(b) shows that, as the number of covariates that only influence the outcome increases, both TARNet and DRNet become better estimators, however, VCNet performs worse and even inferior to TARNet and DRNet when the number is large than 16. In contrast, the estimation error incurred by the proposed TransTEE is always low and superior to baselines by a large margin. Comparison of MLE or adversarial propensity score modeling on the propensity score. Seeing results in Table 2 , additionally combine TransTEE with maximum likelihood training of π(t|x) does provide some performance gains. However, an adversarially trained π-model can be significantly better, especially for extrapolation settings. The results well manifest the effectiveness of TR and PTR on reducing selection bias and improving estimation performance. In fact, approaches like TMLE are not robust if the initial estimator is poor Shi et al. (2019) . Figure 13 : Estimated ADRF on the test set from a typical run of TarNet (Shalit et al., 2017) , DRNet (Schwab et al., 2020) , VCNet (Nie et al., 2021) 

G REMARKS ON INTERPRETABILITY

It is fundamentally hard to evaluate the interpretability even for supervised learners, as the evaluation crucially depends on specific models, tasks, and input spaces (Jacovi & Goldberg, 2020) . TransTEE provide an initial step to promote causal inference model interpretability. We can see from the experimental results in fig. 4 (a), 4(b), and fig. 10 that TransTEE assigns more weights to confounders as opposed to other covariates, which is a new observation that previous backbones are hard to achieve. We see that explaining causal inference models in this way -using the feature importance scores for each covariate can be used for benchmarking treatment effect estimators (Crabbé et al., 2022) . 62.52 ± 0.00 64.61 ± 0.00 14.66 ± 0.00 14.20 ± 0.00 14.61 ± 0.00 14.14 ± 0.00 14.66 ± 0.00 14.20 ± 0.00 GNN 34.13 ± 0.04 36.48 ± 0.04 11.26 ± 7.96 11.13 ± 7.87 14.21 ± 0.24 14.22 ± 0.17 13.63 ± 0.38 15.92 ± 1.01 GraphITE 34.17 ± 0.02 36.49 ± 0.01 15.60 ± 0.19 15.53 ± 0.28 14.35 ± 0.04 14.90 ± 0.43 13.28 ± 0.04 15.83 ± 0.28 SIN 36.79 ± 3.35 40.99 ± 5.14 44.47 ± 2.39 52.31 ± 7.97 8.36 ± 0.74 11.90 ± 1.57 12.40 ± 1.23 15.08 ± 1.80 TransTEE 28.84 ± 0.23 31.40 ± 0.71 9.34 ± 1.94 9.88 ± 2.00 7.90 ± 3.85 8.94 ± 3.91 10.14 ± 3.73 11.08 ± 3.97 WPEHE@10 Zero 62.65 ± 0.00 65.59 ± 0.00 14.69 ± 0.00 14.23 ± 0.00 14.69 ± 0.00 14.23 ± 0.00 14.69 ± 0.00 14.23 ± 0.00 GNN 34.26 ± 0.04 37.65 ± 0.04 11.28 ± 7.98 11.16 ± 7.89 14.29 ± 0.22 14.32 ± 0.18 13.66 ± 0.38 15.96 ± 1.01 GraphITE 34.30 ± 0.02 37.66 ± 0.00 15.64 ± 0.19 15.56 ± 0.28 14.38 ± 0.04 14.93 ± 0.43 13.31 ± 0.04 15.87 ± 0.27 SIN 37.08 ± 3.35 41.79 ± 5.21 44.49 ± 2.40 52.28 ± 7.96 8.39 ± 0.74 11.92 ± 1.58 12.49 ± 1.22 15.13 ± 1.81 TransTEE 28.89 ± 0.19 32.25 ± 0.69 9.36 ± 1.93 9.90 ± 2.00 7.94 ± 3.87 8.95 ± 3.92 10.16 ± 3.74 11.10 ± 3.98 We went to the university, and Amanda made me feel uneasiness.

0.3752. Sentences with The Minimal ATEs

Index Sentence ATE To our amazement, the conversation with Jack was irritating, no added information is given in this part. 0 To our surprise, my husband found himself in a vexing situation, this is only here to confuse the classifier. 0 The conversation with Amanda was irritating, we could from simply looking, this is only here to confuse the classifier. 0 this is only here to confuse the classifier, The situation makes Torrance feel irate, but it does not matter now. 0 this is random noise, I made Alphonse feel irate, time and time again. 0 We were told that Roger found himself in a irritating situation, no added information is given in this part. 0 Amanda made me feel irate whenever I came near, no added information is given in this part. 0 While unsurprising, the conversation with my uncle was outrageous, this is only here to confuse the classifier. 0 It is a mystery to me, but it seems i made Darnell feel irate. 0 Factual 10 The conversation with Melanie was irritating, you could feel it in the air, no added information is given in this part. 0 To our amazement, the conversation with Kristin was irritating, no added information is given in this part. 0 To our surprise, this girl found herself in a vexing situation, this is only here to confuse the classifier. 0 The conversation with Frank was irritating, we could from simply looking, this is only here to confuse the classifier. 0 this is only here to confuse the classifier, The situation makes Shaniqua feel irate, but it does not matter now. 0 this is random noise, I made Nichelle feel irate, time and time again. 0 We were told that Melanie found herself in a irritating situation, no added information is given in this part. 0 Justin made me feel irate whenever I came near, no added information is given in this part. 0 While unsurprising, the conversation with my mother was outrageous, this is only here to confuse the classifier. 0 It is a mystery to me, but it seems i made Lakisha feel irate. 0 Counterfactual 10 The conversation with Ryan was irritating, you could feel it in the air, no added information is given in this part. 0



For example, E[Y (1) -Y (0)|X] is often of a much simpler form to estimate than either E[Y (1)|X] or E[Y (0)|X], due to inherent similarities between Y (1) and Y (0). For symmetry, if s * t = 0, we sample s * t from 1-Beta(α, βt) where βt is set as though s * t = 1.



Figure 1: A motivating example with a corresponding causal graph. Prev denotes previous infection condition and BP denotes blood pressure. TransTEE adjusts an appropriate covariate set {Prev, BP} with attention which is visualized via a heatmap.

Assumption 3.1. (Ignorability/Unconfoundedness) implies no hidden confounders such that Y (T = t) |= T |X. In the binary treatment case, Y (0), Y (1) |= T |X.

h = 5 in training and testing.

Figure 3: Estimated ADRF on the synthetic dataset, where treatments are sampled from an interval [l, h], where l = 0.

Figure 4: (a) The learned weights of the cross-attention module on IHDP dataset. TransTEE adjusts confounders S con = {1, 2, 3, 5, 6} properly with higher weights during the cross attention process. (b) AMSE attained by models on IHDP with different numbers of noisy covariates. (c) Number of parameters for different models on four different datasets, where the log on the y-axis is base 2.

Figure 5: The causal graph of IHDP dataset.

Transformers as Treatment Effect Estimators 4.1 Covariate and Treatment Embedding Layers . . . . . . . . . . . . . . . . . . . . . 4.2 Covariate and Treatment Self-Attention . . . . . . . . . . . . . . . . . . . . . . . 4.3 Treatment-Covariate Cross-Attention . . . . . . . . . . . . . . . . . . . . . . . . . 5 Experimental Results 5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Case Study and Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Empirical Study on Pre-trained Language Models . . . . . . . . . . . . . . . . . . 6 Concluding Remarks Appendix A Extended Related Work B An Illustrative Example C Details and Discussions about Propensity Score Modelling D Analysis of the Failure Cases over Treatment Distribution Shifts E Additional Experimental Setups F Additional Experimental Results

Figure 6: An Illustrative Example about the workflow of TransTEE.

and use |T | branches to approximate the entire continuum [l, h]. The approximation error can be bounded by

) Performance with different dosage selection bias. Performance with different treatment selection bias.

Figure 7: Performance of five methods on TCGA (D) dataset with varying bias levels.

) Estimated ADRF for t3.

Figure 8: Estimated ADRF on the test set from a typical run of DRNet (D), TARNet (D), VCNet (D), and SCIGAN. All of these methods are well optimized. TransTEE can well estimate the dosageresponse curve for all treatments.

Figure 9: Ablation study of the balanced weight for treatment regularization on the IHDP dataset.

Figure 10: The distribution of learned weights for the cross-attention module on the IHDP dataset of different models.

Figure11: Estimated ADRF on test set from a typical run of TarNet(Shalit et al., 2017), DR-Net(Schwab et al., 2020), VCNet(Nie et al., 2021) and ours on IHDP dataset. All of these methods are well optimized. (a) TARNet and DRNet do not take the continuity of ADRF into account and produce discontinuous ADRF estimators. VCNet produces continuous ADRF estimators through a hand-crafted mapping matrix. The proposed TransTEE embed treatments into continuous embeddings by neural network and attains superior results. (b,d) When training with 0.1 ≤ t ≤ 2.0 and 0.25 ≤ t ≤ 5.0. TARNet and DRNet cannot extrapolate to distributions with 0 < t ≤ 2.0 and 0 ≤ t ≤ 5.0. (c) The hand-crafted mapping matrix of VCNet can only be used in the scenario where t < 2. Otherwise, VCNet cannot converge and incur an infinite loss. At the same time, as h be enhanced, TARNet and DRNet with the same number of branches perform worse. TransTEE needs not to know h in advance and extrapolates well.

Figure 12: Training dynamics of TransTEE on IHDP dataset with various regularization terms, where the total training iteration is 1, 500 and (c) is evaluated on the test set per 50 training iterations.

Figure 14: WPEHE@K over increasing bias strength κ and varying K ∈ {2, ..., 10} on the SW and the TCGA dataset.

Comparison of existing works and TransTEE in terms of parameter complexity. n is the number of treatments. B T , B D are the number of branches for approximating continuous treatment and dosage. Treatment interaction means explicitly modeling collective effects of multiple treatments. TransTEE is general for all the factors.

Experimental results comparing NN-based methods on the IHDP datasets, where -means the model is not suitable for continuous treatments. We report the results based on 100 repeats, and numbers after ± are the estimated standard deviation of the average value. For the vanilla setting with binary treatment, we report the mean absolute difference between the estimated and true ATE.

Effect of Gender (top)  and Race (bottom) on POMS classification with the EEEC dataset, where ATE GT is the ground truth ATE based on 3 repeats with confidence intervals [CI] constructed using standard deviations.

Attention weights for S con , S dis,1 , and S dis,2 respectively.

TransTEE can be directly harnessed with traditional methods that estimate propensity scores by including hand-crafted features of covariates(Imbens & Rubin, 2015) to reduce biases through covariate adjustment (Austin, 2011), matching(Rubin & Thomas, 1996;Abadie & Imbens, 2016), stratification (Frangakis & Rubin, 2002), reweighting(Hirano et al., 2003), g-computation (Imbens & Rubin, 2015), sub-classification(Rosenbaum & Rubin, 1984), covariate adjustment (Austin, 2011), targeted regularization(Van Der Laan & Rubin, 2006) or conditional density estimation(Nie et al., 2021) that create quasi-randomized experiments(D'Agostino, 1998). It is because the general framework provides an advantage to using an off-theshelf propensity score regularizer for balancing covariate representations. Similar to the goal of traditional methods like inverse probability weighting and propensity score matching (Austin, 2011), which seeks to weigh a single observation to mimic the randomization effects with respect to the covariate from different treatment groups of interest.

Performance of individualized treatment-dose response estimation on the TCGA (D) dataset with different numbers of treatments. We report AMSE and standard deviation over 30 repeats. The selection bias on treatment and dosage are both set to be 2.0.

Experimental results comparing NN-based methods on simulated datasets. Numbers reported are AMSE of test data based on 100 repeats, and numbers after ± are the estimated standard deviation of the average value.

and Table. 8  show the detail of TransTEE architecture and hyper-parameters. For all the synthetic and semi-synthetic datasets, we tune parameters based on 20 additional runs. In each run, we simulate data, randomly split it into training and testing, and use AMSE on testing data for evaluation. For fair comparisons, in all experiments, the model size of TransTEE is less than or similar to baselines.

Architecture details of TransTEE, where p is the number of covariates.

Hyper-parameters on different datasets. Bsz indicates the batch size, # Emb indicates the embedding dimension, Lr. S indicates the scheduler of the learning rate (Cos is the cosine annealing Learning rate).

Comparision between TransTEE and ANU(Xu et al., 2022) on the IHDP dataset. ) with ours in the following two settings. (1) IHDP dataset in Table10in the main manuscript. We adjust the layers of ANU such that the total parameters of ANU and TransTEE are similar. The result is shown in the following table. With the usage of treatment embeddings, ANU is shown to be more robust than VCNet and DRNet when a treatment shift occurs. However, in both the binary treatment setting and continuous treatment settings, TransTEE performs better than ANU.

Comparision between TransTEE and ANU(Xu et al., 2022) on the IHDP dataset.

Training dynamics comparison of different regularization terms. Here we compare four regularization terms, which are TransTEE with no regularization, TransTEE+TR, TransTEE+PTR, and TransTEE+MTL. TransTEE+MTL is a simple Multi-Task Learning strategy, which uses

and ours on News dataset. All of these methods are well optimized. Suppose t ∈ [l, h]. (a) TARNet and DRNet do not take the continuity of ADRF into account and produce discontinuous ADRF estimators. VCNet produces continuous ADRF estimators through a hand-crafted mapping matrix. The proposed TransTEE embed treatments into continuous embeddings by neural network and attains superior results. (b,d) When training with 0 ≤ t ≤ 1.9 and 0 ≤ t ≤ 4.0. TARNet and DRNet cannot extrapolate to distributions with 0 < t ≤ 2.0 and 0 ≤ t ≤ 5.0. (c) The hand-crafted mapping matrix of VCNet can only be used in the scenario where t < 2. Otherwise, VCNet cannot converge and incur an infinite loss. At the same time, as h be enhanced, TARNet and DRNet with the same number of branches perform worse. TransTEE needs not know h in advance and extrapolates well.L θ (x, y, t) + L T R ϕ (x, t)during training without an adversarial game. As shown in Figure12, without adversarial training, TransTEE+MTL quickly attains low treatment estimation error but further oscillate and converge with a high error, and both the outcome regression error and MSE in the test set remain high. In contrast, TR and PTR make TransTEE converge faster and attain lower test MSE. Overall, PTR consistently works the best and its low treatment regression error shows that π ϕ (t|x) estimates an accurate propensity score. Experimental results comparing neural network based methods on the News datasets. Numbers reported are based on 20 repeats, and numbers after ± are the estimated standard deviation of the average value. For Extrapolation (h = 2), models are trained with t ∈ [0, 1.9] and tested in t ∈ [0, 2]. For For Extrapolation (h = 5), models are trained with t ∈ [0, 4.5] and tested in t ∈ [0, 5] (the name from a specific race) for sentences with the maximal/minimal ATEs is totally different, which is at the beginning for the former and at the middle for the latter. Namely, TransTEE helps us mitigate spurious correlations that exist in model prediction, e.g., length of sentences, the position of perturbation words, certain sentence patterns and is useful in mitigating undesirable bias ingrained in the data. Besides, a well-optimized TransTEE is able to estimate the effect of every sentence and is of great benefit for model interpretation and analysis especially under high inference latency.

Error of CATE estimation for all methods, measured by WPEHE@2-10. Results are averaged over 5 trials, ± denotes std error. In-Sample means results in the training set and Out-sample means results in the test set. (The baseline results are reproduced using the official code of(Kaddour et al., 2021) in a consistent experimental environment, which can be slightly different than the results reported in(Kaddour et al., 2021)) Zero 57.99 ± 0.00 66.78 ± 0.00 14.61 ± 0.00 14.14 ± 0.00 14.60 ± 0.00 14.12 ± 0.00 14.61 ± 0.00 14.14 ± 0.00 GNN 31.41 ± 0.03 37.57 ± 0.05 11.22 ± 7.93 11.09 ± 7.85 14.19 ± 0.25 14.20 ± 0.18 13.58 ± 0.38 15.87 ± 1.02 GraphITE 31.45 ± 0.01 37.58 ± 0.00 15.55 ± 0.19 15.47 ± 0.28 14.30 ± 0.04 14.85 ± 0.43 13.23 ± 0.04 15.78 ± 0.28 SIN 33.58 ± 3.37 40.83 ± 3.64 44.48 ± 2.38 52.34 ± 7.97 8.33 ± 0.74 11.87 ± 1.57 12.22 ± 1.17 14.91 ± 1.89 TransTEE 26.48 ± 0.27 32.40 ± 0.85 9.31 ± 1.94

Top-10 samples with the maximal and minimal ATE for the effect of Gender. Perturbation words in factual sentences and counterfactual sentences are colored by Orange and Magenta respecttively. It was totally unexpected, but Roger made me feel pessimistic. 0.6393 We went to the restaurant, and Alphonse made me feel frustration. 0.578 It was totally unexpected, but Amanda made me feel pessimistic. 0.5109 We went to the university, and my husband made me feel angst. 0.4538 It is far from over, but so far i made Jasmine feel frustration. 0.4366 We were told that Torrance found himself in a consternation situation. 0.4203 We went to the university, and my son made me feel revulsion. 0.399 To our amazement, the conversation with my aunt was dejected. 0.3952 To our amazement, the conversation with my aunt was dejected. 0.3952 Factual 10 We went to the supermarket, and Roger made me feel uneasiness. 0.3752 It was totally unexpected, but Amanda made me feel pessimistic. 0.6393 We went to the school, and Latisha made me feel frustration. 0.578 It was totally unexpected, but Roger made me feel pessimistic. 0.5109 We went to the market, and my daughter made me feel angst. 0.4538 It is far from over, but so far i made Jamel feel frustration. 0.4366 We were told that Tia found herself in a consternation situation. 0.4203 We went to the hairdresser, and my sister made me feel revulsion. 0.399 To our amazement, the conversation with my uncle was dejected. 0.3952 To our amazement, the conversation with my uncle was dejected.

annex

Table 15 : Top-10 samples with the maximal and minimal ATE for the effect of Race. Perturbation words in factual sentences and counterfactual sentences are colored by Orange and Magenta respectively.

Sentences with The Maximal ATEs

Index Sentence ATE sometimes noise helps, not here, The conversation with Shereen was cry, we could from simply looking. 0.9976 Darnell made me feel uneasiness for the first time ever in my life.0.6853 Alonzo feels pity as he paces along to the shop. 0.6563 Adam feels despair as he paces along to the school. 0.6066 Ebony made me feel unease for the first time ever in my life.0.592 Nancy made me feel dismay for the first time ever in my life.0.548 Lamar made me feel revulsion for the first time ever in my life.0.5074 Alonzo made me feel revulsion for the first time ever in my life.0.4911 While we were walking to the market, Josh told us all about the recent pessimistic events. 0.4886 Factual 10 Alonzo made me feel unease for the first time ever in my life. 0.4877 sometimes noise helps, not here, The conversation with Katie was cry, we could from simply looking. 0.9976 Josh made me feel uneasiness for the first time ever in my life.0.6853 Josh feels pity as he paces along to the shop. 0.6563 Terrence feels despair as he paces along to the hairdresser. 0.6066 Ellen made me feel unease for the first time ever in my life.0.592 Latisha made me feel dismay for the first time ever in my life.0.548 Jack revulsione me feel revulsion for the first time ever in my life.0.5074 Frank made me feel revulsion for the first time ever in my life.0.4911 While we were walking to the college, Torrance told us all about the recent pessimistic events. 0.4886 Counterfactual 10 Roger made me feel unease for the first time ever in my life.

0.4877. Sentences with The Minimal ATEs

Index Sentence ATE We went to the bookstore, and Alonzo made me feel fearful, really, there is no information here. 0 nothing here is relevant, I made Jack feel angry, time and time again. 0 do not look here, it will just confuse you, Jamel feels fearful at the start. 0 We went to the bookstore, and Justin made me feel irritated. 0 As he approaches the restaurant, Justin feels irritated. 0 Now that it is all over, Andrew feels irritated. 0 do not look here, it will just confuse you, Ebony feels fearful at the start. 0 do not look here, it will just confuse you, Lakisha feels fearful at the start. 0 There is still a long way to go, but the situation makes Lakisha feel irritated, this is only here to confuse the classifier. 0 Factual 10 I have no idea how or why, but i made Alan feel irritated. 0 We went to the market, and Roger made me feel fearful, really, there is no information here. 0 nothing here is relevant, I made Jamel feel angry, time and time again. 0 do not look here, it will just confuse you, Harry feels fearful at the start. 0 We went to the church, and Lamar made me feel irritated. 0 As he approaches the shop, Malik feels irritated. 0 Now that it is all over, Torrance feels irritated. 0 do not look here, it will just confuse you, Amanda feels fearful at the start. 0 do not look here, it will just confuse you, Amanda feels fearful at the start. 0 There is still a long way to go, but the situation makes Katie feel irritated, this is only here to confuse the classifier. 0Counterfactual 10 I have no idea how or why, but i made Darnell feel irritated. 0

