A NEURAL MEAN EMBEDDING APPROACH FOR BACK-DOOR AND FRONT-DOOR ADJUSTMENT

Abstract

We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This objective is attained by first estimating the conditional mean of the desired outcome variable given relevant covariates (the "first stage" regression), and then taking the (conditional) expectation of this function as a "second stage" procedure. We propose to compute these conditional expectations directly using a regression function to the learned input features of the first stage, thus avoiding the need for sampling or density estimation. All functions and features (and in particular, the output features in the second stage) are neural networks learned adaptively from data, with the sole requirement that the final layer of the first stage should be linear. The proposed method is shown to converge to the true causal parameter, and outperforms the recent state-of-the-art methods on challenging causal benchmarks, including settings involving high-dimensional image data.

1. INTRODUCTION

The goal of causal inference from observational data is to predict the effect of our actions, or treatments, on the outcome without performing interventions. Questions of interest can include what is the effect of smoking on life expectancy? or counterfactual questions, such as given the observed health outcome for a smoker, how long would they have lived had they quit smoking? Answering these questions becomes challenging when a confounder exists, which affects both treatment and the outcome, and causes bias in the estimation. Causal estimation requires us to correct for this confounding bias. A popular assumption in causal inference is the no unmeasured confounder requirement, which means that we observe all the confounders that cause the bias in the estimation. Although a number of causal inference methods are proposed under this assumption (Hill, 2011; Shalit et al., 2017; Shi et al., 2019; Schwab et al., 2020) , it rarely holds in practice. In the smoking example, the confounder can be one's genetic characteristics or social status, which are difficult to measure for both technical and ethical reasons. To address this issue, Pearl (1995) proposed back-door adjustment and front-door adjustment, which recover the causal effect in the presence of hidden confounders using a back-door variable or frontdoor variable, respectively. The back-door variable is a covariate that blocks all causal effects directed from the confounder to the treatment. In health care, patients may have underlying predispositions to illness due to genetic or social factors (hidden), which cause measurable symptoms. The symptoms can be used as the back-door variable if the treatment is chosen based on these. By contrast, a front-door variable blocks the path from treatment to outcome. In perhaps the bestknown example, the amount of tar in a smoker's lungs serves as a front-door variable, since it is increased by smoking, shortens life expectancy, and has no direct link to underlying (hidden) sociological traits. Pearl (1995) showed that causal quantities can be obtained by taking the (conditional) expectation of the conditional average outcome. While Pearl (1995) only considered the discrete case, this framework was extended to the continuous case by Singh et al. (2020) , using two-stage regression (a review of this and other recent approaches for the continuous case is given in Section 5). In the first stage, the approach regresses from the relevant covariates to the outcome of interest, expressing the function as a linear combination of non-linear feature maps. Then, in the second stage, the causal parameters are estimated by learning the (conditional) expectation of the non-linear feature map used in the first stage. Unlike competing methods (Colangelo & Lee, 2020; Kennedy et al., 2017) , two-stage regression avoids fitting probability densities, which is challenging in high-dimensional settings (Wasserman, 2006, Section 6.5 ). Singh et al. (2020) 's method is shown to converge to the true causal parameters and exhibits better empirical performance than competing methods. One limitation of the methods in Singh et al. (2020) is that they use fixed pre-specified feature maps from reproducing kernel Hilbert spaces, which have a limited expressive capacity when data are complex (images, text, audio) . To overcome this, we propose to employ a neural mean embedding approach to learning task-specific adaptive feature dictionaries. At a high level, we first employ a neural network with a linear final layer in the first stage. For the second stage, we learn the (conditional) mean of the stage 1 features in the penultimate layer, again with a neural net. The approach develops the technique of Xu et al. (2021a; b) and enables the model to capture complex causal relationships for high-dimensional covariates and treatments. Neural network feature means are also used to represent (conditional) probabilities in other machine learning settings, such as representation learning (Zaheer et al., 2017) and approximate Bayesian inference (Xu et al., 2022) . We derive the consistency of the method based on the Rademacher complexity, a result of which is of independent interest and may be relevant in establishing consistency for broader categories of neural mean embedding approaches, including Xu et al. (2021a; b) . We empirically show that the proposed method performs better than other state-of-the-art neural causal inference methods, including those using kernel feature dictionaries. This paper is structured as follows. In Section 2, we introduce the causal parameters we are interested in and give a detailed description of the proposed method in Section 3. The theoretical analysis is presented in Section 4, followed by a review of related work in Section 5. We demonstrate the empirical performance of the proposed method in Section 6, covering two settings: a classical backdoor adjustment problem with a binary treatment, and a challenging back-door and front-door setting where the treatment consists of high-dimensional image data.

2. PROBLEM SETTING

In this section, we introduce the causal parameters and methods to estimate these causal methods, namely a back-door adjustment and front-door adjustment. Throughout the paper, we denote a random variable in a capital letter (e.g. A), the realization of this random variable in lowercase (e.g. a), and the set where a random variable takes values in a calligraphic letter (e.g. A). We assume data is generated from a distribution P .

Causal Parameters

We introduce the target causal parameters using the potential outcome framework (Rubin, 2005) . Let the treatment and the observed outcome be A ∈ A and Y ∈ Y ⊆ [-R, R]. We denote the potential outcome given treatment a as Y (a) ∈ Y. Here, we assume no inference, which means that we observe Y = Y (a) when A = a. We denote the hidden confounder as U ∈ U and assume conditional exchangeability ∀a ∈ A, Y (a) ⊥ ⊥ A|U , which means that the potential outcomes are not affected by the treatment assignment. A typical causal graph is shown in Figure 1a . We may additionally consider the observable confounder O ∈ O, which is discussed in Appendix C. A first goal of causal inference is to estimate the Average Treatment Effect (ATE) a) , which is the average potential outcome of A = a. We also consider Average Treatment Effect on the Treated (ATT) θ ATT (a; a ′ ) = E Y (a) |A = a ′ , which is the expected potential outcome of A = a for those who received the treatment A = a ′ . Given no inference and conditional exchangeability assumptions, these causal parameters can be written in the following form. Proposition 1 (Rosenbaum & Rubin, 1983; Robins, 1986) . Given unobserved confounder U , which satisfies no inference and conditional exchangeability, we have in Appendix C. Note that since the confounder U is not observed, we cannot recover these causal parameters only from (A, Y ). 1 θ ATE (a) = E Y ( θ ATE (a) = E U [E [Y |A = a, U ]] , θ ATT (a; a ′ ) = E U [E [Y |A = a, U ] |A = a ′ ] .

Back-door Adjustment

In back-door adjustment, we assume the access to the back-door variable X ∈ X , which blocks all causal paths from unobserved confounder U to treatment A. See Figure 1b for a typical causal graph. Given the back-door variable, causal parameters can be written only from observable variables (A, Y, X) as follows. Proposition 2 (Pearl, 1995, Theorem 1) . Given the back-door variable X, we have θ ATE (a) = E X [g(a, X)] , θ ATT (a; a ′ ) = E X [g(a, X)|A = a ′ ] , where g(a, x) = E [Y |A = a, X = x]. By comparing Proposition 2 to Proposition 1, we can see that causal parameters can be learned by treating the back-door variable X as the only "confounder", despite the presence of the additional hidden confounder U . Hence, we may apply any method based on the "no unobservable confounder" assumption to back-door adjustment. Front-door Adjustment Another adjustment for causal estimation is front-door adjustment, which uses the causal mechanism to determine the causal effect. Assume we observe the front-door variable M ∈ M, which blocks all causal paths from treatment A to outcome Y , as in Figure 1c . Then, we can recover the causal parameters as follows. Proposition 3 (Pearl, 1995, Theorem 2) . Given the front-door variable M , we have θ ATE (a) = E A ′ [E M [g(A ′ , M )|A = a]] , θ ATT (a; a ′ ) = E M [g(a ′ , M )|A = a] , where g(a, m) = E [Y |A = a, M = m] and A ′ ∈ A is a random variable that follows the same distribution as treatment A. Unlike the case of the back-door adjustment, we cannot naively apply methods based on the "no unmeasured confounder" assumption here, since Proposition 3 takes a different form to Proposition 1.

3. ALGORITHMS

In this section, we present our proposed methods. We first present the case with back-door adjustment and then move to front-door adjustment. The algorithm is summarized in Appendix A.

Back-door adjustment

The algorithm consists of two stages; In the first stage, we learn the conditional expectation g = E [Y |A = a, X = x] with a specific form. We then compute the causal parameter by estimating the expectation of the input features to g. The conditional expectation g(a, x) is learned by regressing (A, X) to Y . Here, we consider a specific model g(a, x) = w ⊤ (ϕ A (a) ⊗ ϕ X (x)), where ϕ A : A → R d1 , ϕ X : X → R d2 are feature maps represented by neural networks, w ∈ R d1d2 is a trainable weight vector, and ⊗ denotes a tensor product a ⊗ b = vec(ab ⊤ ). This tensor form used for g(a, x) explicitly separates out the treatment of the features of X and of A; in the event that X is much higher dimension than A, then concatenating both as a single input tends to downplay the information in A. In addition, we can take advantage of linearity and focus on estimating the relevant (conditional) expectation as discussed later. Given data {(a i , y i , x i )} n i=1 ∼ P size of n, the feature maps ϕ A , ϕ X and the weight w can be trained by minimizing the following empirical loss: LX 1 (w, ϕ A , ϕ X ) = 1 n n i=1 (y i -w ⊤ (ϕ A (a i ) ⊗ ϕ X (x i ))) 2 . ( ) We may add any regularization term to this loss, such as weight decay λ∥w∥ 2 . Let the minimizer of the loss LX 1 be ŵ, φA , φX = arg min LX 1 and the learned regression function be ĝ(a, x) = ŵ⊤ ( φA (a) ⊗ φX (x)). Then, by substituting ĝ for g in Proposition 2, we have θ ATE (a) ≃ ŵ⊤ φA (a) ⊗ E φX (X) , θ ATT (a; a ′ ) ≃ ŵ⊤ φA (a) ⊗ E φX (X) A = a ′ . This is the advantage of assuming the specific form of g(a, x) = w ⊤ (ϕ A (a) ⊗ ϕ X (x)); From linearity, we can recover the causal parameters by estimating E[ φX (X)], E[ φX (X)|A = a ′ ]. Such (conditional) expectations of features are called (conditional) mean embedding, and thus, we name our method "neural (conditional) mean embedding". We can estimate the marginal expectation E[ φX (X)], as a simple empirical average E[ φX (X)] ≃ 1 n n i=1 φX (x i ). The conditional mean embedding E[ φX (X)|A = a ′ ] requires more care, however: it can be learned by a technique proposed in Xu et al. (2021a) , in which we train another regression function from treatment A to the back-door feature φX (X). Specifically, we estimate E[ φX (X)|A = a ′ ] by f φX (a ′ ), where the regression function f φX : A → R d2 be given by f φX = arg min f :A→R d 2 LX 2 (f ; φX ), LX 2 (f ; ϕ X ) = 1 n n i=1 ∥ϕ X (x i ) -f (a i )∥ 2 . Here, ∥ • ∥ denotes the Euclidean norm. The loss LX 2 may include the additional regularization term such as a weight decay term for parameters in f . We have θATE (a) = ŵ⊤ φA (a) ⊗ 1 n n i=1 φX (x i ) , θATT (a; a ′ ) = ŵ⊤ φA (a) ⊗ f φX (a ′ ) as the final estimator for the back-door adjustment. The estimator for the ATE θATE is reduced to the average of the predictions θATE = 1 n n i=1 ĝ(a, x i ). This coincides with other neural network causal methods (Shalit et al., 2017; Chernozhukov et al., 2022b) , which do not assume g(a, z) = w ⊤ (ϕ A (a) ⊗ ϕ X (x)). As we have seen, however, this tensor product formulation is essential for estimating ATT by back-door adjustment. It will also be necessary for the front-door adjustment, as we will see next. Front-door adjustment We can obtain the estimator for front-door adjustment by following the almost same procedure as the back-door adjustment. Given data {(a i , y i , m i )} n i=1 , we again fit the regression model ĝ(a, m) = ŵ⊤ φA (a) ⊗ φM (m) by minimizing LM 1 (w, ϕ A , ϕ M ) = 1 n n i=1 (y i -w ⊤ (ϕ A (a i ) ⊗ ϕ M (m i ))) 2 , where ϕ M : M → R d2 is a feature map represented as the neural network. From Proposition 3, for f φM (a) = E φM (M ) A = a , we have θ ATE (a) ≃ ŵ⊤ E φA (A) ⊗ f φM (a) and θ ATT (a; a ′ ) ≃ ŵ⊤ φA (a ′ ) ⊗ f φM (a) . Again, we estimate feature embedding by empirical average for E[ φA (A)] or solving another regression problem for µ φM (a). The final estimator for front-door adjustment is given as θATE (a) = ŵ⊤ 1 n n i=1 φA (a i ) ⊗ f φM (a) , θATT (a; a ′ ) = ŵ⊤ φA (a ′ ) ⊗ f φM (a) , where f φM is given by minimizing loss LM 2 = 1 n n i=1 ∥ϕ M (m i ) -f (a i )∥ 2 (with additional regularization term).

4. THEORETICAL ANALYSIS

In this section, we prove the consistency of the proposed method. We focus on the back-door adjustment case, since the consistency of front-door adjustment can be derived identically. The proposed method consists of two successive regression problems. In the first stage, we learn the conditional expectation g, and then in the second stage, we estimate the feature embeddings. First, we show each stage's consistency, then present the overall convergence rate to the causal parameter. Consistency for the first stage: In this section, we consider the hypothesis space of g as H g = {w ⊤ (ϕ A (a) ⊗ ϕ X (x)) | w ∈ R d1d2 , ϕ A (a) ∈ R d1 , ϕ X (x) ∈ R d2 , ∥w∥ 1 ≤ R, max a∈A ∥ϕ A (a)∥ ∞ ≤ 1, max x∈X ∥ϕ X (x)∥ ∞ ≤ 1}. Here, we denote ℓ 1 -norm and infinity norm of vector b ∈ R d as ∥b∥ 1 = d i=1 |b i | and ∥b∥ ∞ = max i∈[d] b i . Note that from inequality ∥ϕ A (a) ⊗ ϕ X (x)∥ ∞ ≤ ∥ϕ A (a)∥ ∞ ∥ϕ X (x)∥ ∞ and Hölder's inequality, we can show that h(a, x) ∈ [-R, R] for all h ∈ H g . First, we discuss the richness of this hypothesis space by the following theorem. Theorem 1. Let A, X ⊂ R d be compact. Given sufficiently large R, d 1 , d 2 , for any continuous function f : A × X → R and constant ε > 0, there exists h ∈ H g which satisfies sup a,x |f (a, x) - h(a, x)| ≤ ε. The proof uses the modified version of universal approximation theorem (Cybenko, 1989) for neural net, which will be given in Appendix B.1. Theorem 1 tells that we can approximate any continuous function f with an arbitrary accuracy, which suggests the richness of our function class. Given this hypothesis space, the following lemma bounds the deviation of estimated conditional expectation ĝ and the true one. Lemma 1. Given data S = {a i , y i , x i } n i=1 , let minimizer of loss LX 1 be ĝ = arg min LX 1 . If the true conditional expectation g is in the hypothesis space g ∈ H g , w.p. at least 1 -2δ, we have ∥g -ĝ∥ P (A,X) ≤ 16R RS (H g ) + 8R 2 (log 2/δ)/2n, where RS (H g ) is empirical Rademacher complexity of H g given data S. The proof is given in Appendix B.3. Here, we present the empirical Rademacher complexity when we apply a feed-forward neural network for features.

Lemma 2. The empirical Rademacher complexity RS (H

g ) scales as RS (H g ) ≤ O(C L / √ n) for some constant C if we use a specific L-layer neural net for features ϕ A , ϕ X . See Lemma 7 in Appendix B.3 for the detailed expression of the upper bound. Note that this may be of independent interest since the similar hypothesis class is considered in Xu et al. (2021a; b) , and no explicit upper bound is provided on the empirical Rademacher complexity in that work. Consistency for the second stage: Next, we consider the second stage of regression. In back-door adjustment, we estimate the feature embedding E[ φX (X)] and the conditional feature embedding E[ φX (X)|A = a ′ ]. We first state the consistency of the estimation of marginal expectation, which can be shown by Hoeffding's inequality. Lemma 3. Given data {x i } n i=1 and feature map φX , w.p. at least 1 -δ, we have E φX (X) - 1 n n i=1 φX (x i ) ∞ ≤ 2 log(2d 2 /δ) n . For conditional feature embedding E[ φX (X)|A = a ′ ], we solve the regression problem f φX = arg min f LX 2 (f ; φX ), the consistency of which is stated as follows. Lemma 4. Let hypothesis space H f be H f = {a ∈ A → (f 1 (a), . . . , f d2 (a)) ⊤ ∈ [-1, 1] d2 | f 1 , . . . , f d2 ∈ H f }, where H f is some hypothesis space of functions of f : X → [-1, 1]. Let the true function be f φX (a) = E[ φX (X)|A = a], and we assume f φX ∈ H f . Let f φX = arg min f ∈H f LX 2 (f ; φX ), given data S = {(a i , x i )}. Then, we have f φX (A) -f φX (A) P (A),∞ ≤ 16 RS (H f ) + 8 (log(2d 2 /δ))/2n w.p. at least 1 -2δ, where ∥f (A)∥ P (A),∞ = max i ∥f i ∥ P (A) and RS (H f ) is the empirical Rademacher complexity of H f given data S. The proof is identical to Lemma 1. We use neural network hypothesis class for H f whose empirical Rademacher complexity is bounded by O(1/ √ n) as discussed in Proposition 5 in Appendix B.3. Consistency of the causal estimator Finally, we show that if these two estimators converge uniformly, we can recover the true causal parameters. To derive the consistency of the causal parameter, we put the following assumption on hypothesis spaces in order to guarantee that convergence in ℓ 2norm leads to uniform convergence. Assumption 1. For functions h 1 , h 2 ∈ H g , there exists constant c > 0 and β that sup a∈A,x∈X |h 1 (a, x) -h 2 (a, x)| ≤ 1 c ∥h 1 (A, X) -h 2 (A, X)∥ 1 β P (A,X) . Intuitively, this ensures that we have a non-zero probability of observing all elements in A × X . We can see that Assumption 1 is satisfied with β = 1 and c = min (a,x)∈A×X P (A = a, X = x) when treatment and back-door variables are discrete. A similar intuition holds for the continuous case; in Appendix B.2, we show that Assumption 1 holds when with β = 2d+2 2 when A, X are d-dimensional intervals if the density function of P (A, X) is bounded away from zero and all functions in H g are Lipschitz continuous. Theorem 2. Under conditions in Lemmas 1 to 3 and Assumption 1, w.p. at least 1 -4δ, we have sup a∈A |θ ATE (a) -θATE (a)| ≤ O(n -1 4β ). If we furthermore assume that for all f , f , sup a∈A ∥f (a) -f (a)∥ ∞ ≤ 1 c ′ max i∈[d2] ∥f (A) -f (A)∥ P (A),∞ 1 β ′ , then, w.p. at least 1 -4δ, we have sup a,a ′ ∈A |θ ATT (a; a ′ ) -θATT (a; a ′ )| ≤ O(n -1 4β + n -1 4β ′ ). The proof is given in Appendix B.3. This rate is slow compared to the existing work (Singh et al., 2020) , which can be as fast as O(n -1/4 ). However, Singh et al. (2020) assumes that the correct regression function g is in a certain reproducing kernel Hilbert space (RKHS), which is a stronger assumption than ours, which only assumes a Lipschitz hypothesis space. Deriving the matching minimax rates under the Lipschitz assumption remains a topic for future work.

5. RELATED WORK

Meanwhile learning approaches to the back-door adjustment have been extensively explored in recent work, including tree models (Hill, 2011; Athey et al., 2019) , kernel models (Singh et al., 2020) and neural networks (Shi et al., 2019; Chernozhukov et al., 2022b; Shalit et al., 2017) , most literature considers binary treatment cases, and few methods can be applied to continuous treatments. Schwab et al. (2020) proposed to discretize the continuous treatments and Kennedy et al. ( 2017); Colangelo & Lee (2020) conducted density estimation of P (X) and P (X|A). These are simple to implement but suffer from the curse of dimensionality (Wasserman, 2006, Section 6.5) . Recently, the automatic debiased machine learner (Auto-DML) approach (Chernozhukov et al., 2022a) has gained increasing attention, and can handle continuous treatments in the back-door adjustment. Consider a functional m that maps g to causal parameter θ = E [m(g, (A, X))]. For the ATE case, we have m(g, (A, X)) = g(a, X) since θ ATE (a) = E [g(a, X)]. We may estimate both g and the Riesz representer α that satisfies E [m(g, (A, X))] = E [α(A, X)g(A, X)] by the least-square regression to get the causal estimator. Although Auto-DML can learn a complex causal relationship with neural network model (Chernozhukov et al., 2022b) , it requires a considerable amount of computation when the treatment is continuous, since we have to learn a different Riesz representer α for each treatment a. Furthermore, as discussed in Appendix B.4, the error bound on α can grow exponentially with respect to the dimension of the probability space, which may harm performance in high-dimensional settings. Singh et al. (2020) proposed a feature embedding approach, in which feature maps are specified as the fixed feature maps in a reproducing kernel Hilbert space (RKHS). Although this strategy can be applied to a number of different causal parameters, the flexibility of the model is limited since it uses pre-specified features. Our main contribution is to generalize this feature embedding approach to adaptive features which enables us to capture more complex causal relationships. Similar techniques are used in the additional causal inference settings, such as deep feature instrumental variable method (Xu et al., 2021a) or deep proxy causal learning (Xu et al., 2021b) . By contrast with the back-door case, there is little literature that discusses non-linear front-door adjustment. The idea was originally introduced for the discrete treatment setting (Pearl, 1995) and was later discussed using the linear causal model (Pearl, 2009) . To the best of our knowledge, Singh et al. ( 2020) is the only work that considers the nonlinear front-door adjustment, where fixed kernel feature dictionaries are used. We generalize this approach using adaptive neural feature dictionaries and obtain promising performance.

6. EXPERIMENTS

In this section, we evaluate the performance of the proposed method based on two scenarios. One considers the back-door adjustment methods with binary treatment based on IHDP dataset (Gross, 1993) and ACIC dataset (Shimoni et al., 2018) . Another tests the performance on a high-dimensional treatment based on dSprite image dataset (Matthey et al., 2017) . We first describe the training procedure we apply for our proposed method, and then report the results of each benchmark. The details of hyperparameters used in the experiment are summarized in Appendix D.

6.1. TRAINING PROCEDURE

During the training, we use the learning procedure proposed by Xu et al. (2021a) . Let us consider the first stage regression in a back-door adjustment, in which we consider the following loss LX 1 with weight decay regularization LX 1 (w, ϕ A , ϕ X ) = 1 n n i=1 (y i -w ⊤ (ϕ A (a i ) ⊗ ϕ X (x i ))) 2 + λ∥w∥ 2 . To minimize LX 1 with respect to (w, ϕ A , ϕ X ), we can use the closed form solution of weight w. If we fix features ϕ A , ϕ X , the minimizer of w can be written ŵ(ϕ A , ϕ X ) = 1 n n i=1 (ϕ A,X (a i , x i ))(ϕ A,X (a i , x i )) ⊤ + λI -1 1 n n i=1 y i ϕ A,X (a i , x i ), where ϕ A,X (a, x) = ϕ A (a) ⊗ ϕ Z (a). Then, we optimize the features as φA , φX = arg min ϕ A ,ϕ X LX 1 ( ŵ(ϕ A , ϕ X ), ϕ A , ϕ X ) using Adam (Kingma & Ba, 2015) . We empirically found that this stabilizes the learning and improves the performance of the proposed method.

6.2. BINARY TREATMENT SCENARIO

In this section, we report the performans on two classical causal datasets: IHDP dataset and ACIC dataset. The IHDP dataset is widely used to evaluate the performance of the estimators for the ATE (Shi et al., 2019; Chernozhukov et al., 2022b; Athey et al., 2019) . This is a semi-synthetic dataset based on the Infant Health and Development Program (IHDP) (Gross, 1993) . Following existing work, we generate 1000 sets of 747 observations of outcomes and binary treatments based on the 25-dimensional observable confounder in the original data.The ACIC dataset is introduced in (Shi et al., 2019) , which is based on linked birth and infant death data (LBIDD) (Mathews & MacDorman, 2006) . This is considered a more challenging benchmark dataset than IHDP since it contains data points with extreme propensity scores (i.e. P (A = 1|X) can be very close to 0 or 1). We select 101 datasets following Shi et al. (2019) and remove outliers in each dataset using the procedure described in Appendix D. We compare our method to competing causal methods, DragonNet (Shi et al., 2019) , RieszNet (Chernozhukov et al., 2022b) , and RKHS Embedding (Singh et al., 2020) . DragonNet is a neural causal inference method specially designed for the binary treatment, which applies the targeted regularization (van der Laan & Rubin, 2006) to ATE estimation. RieszNet implements Auto-DML with a neural network, which learns the conditional expectation g and Riesz representer α jointly while sharing the intermediate features. Given estimated ĝ, α, it proposes three ways to calculate the causal parameter; Direct : E [m(ĝ, (A,X))] , IPW : E [Y α(A,X)] , DR : E [m(ĝ, (A,X))+ α(A,X)(Y -ĝ(A,X))] , where functional m maps g to the causal parameter (See Section 5 for the example of functional m). We report the performance of each estimator in RieszNet. RKHS Embedding employs the feature embedding approach with a fixed kernel feature dictionaries. The results are summarized in Table 1 . Although RieszNet(IPW) estimator performs promisingly in IHDP, the performance degenerates for the ACIC dataset, which suggests RieszNet(IPW) is prone to extreme propensity scores. This is not surprising, since the true Riesz representer in this case is α(A, X) = A P (A=1|X) -1-A P (A=0|X) , which can be very large if P (A = 1|X) becomes close to 0 or 1. This also harms the performance of RieszNet(DR). We can see that the proposed method outperforms all competing methods besides RieszNet(DR) in the IHDP dataset, for which the performance is comparable (0.117 ± 0.002 v.s. 0.110 ± 0.003).

6.3. HIGH-DIMENSIONAL TREATMENT SCENARIO

To test the performance of our method of causal inference in a more complex setting, we used dSprite data (Matthey et al., 2017) , which is also used as the benchmark for other high-dimensional causal inference methods (Xu et al., 2021a; b) . The dSprite dataset consists of images that are 64 × 64 = 4096-dimensional, described by five latent parameters (shape, scale, rotation, posX and posY). Throughout this paper, we fix (shape, scale, rotation) and use posX ∈ [0, 1] and posY ∈ [0, 1] as the latent parameters. Based on this dataset, we propose two experiments; one is ATE estimation based on the back-door adjustment, and the other is ATT estimation based on front-door adjustment. Back-door Adjustment In our back-door adjustment experiment, we consider the case where the image is the treatment. Let us sample hidden confounder U ∼ Unif(0, 1), and consider the backdoor as (X 1 , X 2 ) = (U cos θ + ε 1 , U sin θ + ε 2 ) where ε 1 , ε 2 ∼ N (0, 0.09), θ ∼ Unif(0, 2π). We define treatment A as the image, where the parameters are set as posX = X1+1.5 3 , posY = X2+1.5 3 . We add Gaussian noise N (0, 0.01) to each pixel of images. The outcome is given as follows, Y = h 2 (A) 100 + 4(U -0.5) + ε Y , h(A) = 64 i,j=1 (i -1) 64 (j -1) 64 A [ij] where A [ij] denotes the value of the pixel at (i, j) and ε Y is the noise variable sampled from ε Y ∼ N (0, 0.25). Each dataset consists of 5000 samples of (Y, A, X 1 , X 2 ) and we consider the problem of estimating θ ATE (a) = h 2 (A)/100. We compare the proposed method to RieszNet and RKHS Embedding, since DragonNet is designed for binary treatments and is not applicable here. We We can see that the proposed method performs best in the setting, which shows the power of the method for complex high-dimensional inputs. The RKHS Embedding method suffers from the limited flexibility of the model for the case of complex high-dimensional treatment, and performs worse than all neural methods besides RieszNet(IPW). This suggests that it is difficult to estimate Riesz representer α in a high-dimensional scenario, which is also suggested by the exponential growth of the error bound to the dimension as discussed in Appendix B.4. We conjecture this also harms the performance of RieszNet(Direct) and RieszNet(DR), since the models for conditional expectation ĝ and Riesz representer α share the intermediate features in the network and are jointly trained in RieszNet. Frontdoor Adjustment We use dSprite dataset to consider front-door adjustment. Again, we sample hidden confounder U 1 , U 2 ∼ Unif(-1.5, 1.5), and we set the image to be the treatment, where the parameters are set as posX = U1+1.5

3

, posY = U2+1.5 . We add Gaussian noise N (0, 0.01) to each pixel of the images. We use M = h(A) + ε M as the front-door variable M , where ε M ∼ N (0, 0.04). The outcome is given as follows, Y = M 2 100 + 5(U 1 + U 2 ) + ε Y , ε Y ∼ N (0, 0.25) We consider the problem of estimating θ ATT (a; a ′ ) and obtain the average squared error on 121 points of a while fixing a ′ to the image of posX = 0.6, posY = 0.6. We compare against RKHS Embedding, where the result is given in Figure 3 . Note that RieszNet has not been developed for this setting. Again, the RKHS Embedding method suffers from the limited flexibility of the model, whereas our proposed model successfully captures the complex causal relationships.

7. CONCLUSION

We have proposed a novel method for back-door and front-door adjustment, based on the neural mean embedding. We established consistency of the proposed method based on a Rademacher complexity argument, which contains a new analysis of the hypothesis space with the tensor product features. Our empirical evaluation shows that the proposed method outperforms existing estimators, especially when high-dimensional image observations are involved. As future work, it would be promising to apply a similar adaptive feature embedding approach to other causal parameters, such as marginal average effect ∇ a θ ATE (a) (Imbens & Newey, 2009) . Furthermore, it would be interesting to consider sequential treatments, as in dynamic treatment effect estimation, in which the treatment may depend on the past covariates, treatments and outcomes. Recently, a kernel feature embedding approach (Singh et al., 2021) has been developed to estimate the dynamic treatment effect, and we expect that applying the neural mean embedding would benefit the performance.

A ALGORITHM SUMMARY

Here, we provide the summary of algorithm. Algorithm 1: Back-door Adjustment Data: Back-door adjustment data {a i , y i , x i } 1 Learn weights and features ŵ, φA , φX = arg min LX 1 , LX 1 = 1 n n i=1 (y i -w ⊤ (ϕ A (a i ) ⊗ ϕ X (x i ))) 2 . 2 Learn conditional embedding f φX = arg min f :A→R d 2 LX 2 (f ; φX ), LX 2 (f ; ϕ X ) = 1 n n i=1 ∥ϕ X (x i ) -f (a i )∥ 2 3 Compute causal parameters as θATE (a) = ŵ⊤ φA (a) ⊗ 1 n n i=1 φX (x i ) θATT (a; a ′ ) = ŵ⊤ φA (a) ⊗ f φX (a ′ ) Algorithm 2: Front-door Adjustment Data: Front-door adjustment data {a i , y i , m i } 1 Learn weights and features ŵ, φA , φM = arg min LM 1 , LM 1 = 1 n n i=1 (y i -w ⊤ (ϕ A (a i ) ⊗ ϕ M (m i ))) 2 . 2 Learn conditional embedding f φM = arg min f :A→R d 2 LM 2 (f ; φM ), LM 2 (f ; ϕ M ) = 1 n n i=1 ∥ϕ M (x i ) -f (a i )∥ 2 3 Compute causal parameters as θATE (a) = ŵ⊤ 1 n n i=1 φA (a i ) ⊗ f φM (a) θATT (a; a ′ ) = ŵ⊤ φA (a ′ ) ⊗ f φM (a) B TECHNICAL DETAILS

B.1 UNIVERSAL APPROXIMATION THEORY

In this section, we provide the proof of Theorem 1. Recall our hypothesis space is H g = {w ⊤ (ϕ A (a) ⊗ ϕ X (x)) | w ∈ R d1d2 , ϕ A (a) ∈ R d1 , ϕ X (x) ∈ R d2 , ∥w∥ 1 ≤ R, max a∈A ∥ϕ A (a)∥ ∞ ≤ 1, max x∈X ∥ϕ X (x)∥ ∞ ≤ 1}. Let consider the feature be ϕ A (a) = [σ(s ⊤ 1 a + α 1 ), . . . , σ(s ⊤ d1 a + α d1 )] ⊤ ϕ X (x) = [σ(t ⊤ 1 x + β 1 ), . . . , σ(t ⊤ d2 x + β d2 ) ] ⊤ where σ is the sigmoid function and s i , t i ∈ R D , α i , β i ∈ R are parameters. By considering the case of d 1 = d 2 and setting "non-diagonal" elements of w to zero, we can see that g(a, x) = d1 i=1 w i σ(s ⊤ i a + α i )σ(t ⊤ i x + β i ). is a member of of H g . Next, we present the following lemma. Lemma 5. Let µ be a finite, signed regular Borel measures on A × X . If σ satisfies the followings: ∀s, t ∈ R D , ∀α, β ∈ R, A×X σ(s ⊤ a + α)σ(t ⊤ x + β)dµ(a, x) = 0 ⇔ µ = 0, then, given any continuous function f : A × A → R and ε > 0, there is a finite sum g(z) = n i=1 w i σ(s ⊤ i a + α i )σ(t ⊤ i x + β i ), which satisfies max a,x∈A×X |f (a, x) -g(a, x)| ≤ ε. The proof is identical to Theorem 1 in Cybenko (1989) . Now, all we have to prove is that the Sigmoid function σ satisfies (3). This can be shown by the similar discussion as in the Lemma 1 in (Cybenko, 1989) . Proof of Theorem 1. Assume that ∀s, t ∈ R D , ∀α, β ∈ R, A×X σ(s ⊤ a + α)σ(t ⊤ x + β)dµ(a, x) = 0 Then, for all γ, δ ∈ R, we have 0 = lim λ1→∞ lim λ2→∞ A×X σ(λ 1 (s ⊤ a + α) + γ)σ(λ 2 (t ⊤ x + β) + δ)dµ(a, x) = A×X lim λ1→∞ lim λ2→∞ σ(λ 1 (s ⊤ a + α) + γ)σ(λ 2 (t ⊤ x + β) + δ)dµ(a, x) = A×X ξ A (a)ξ X (x)dµ(a, x), where ξ A (a) =    0 (s ⊤ a + α < 0) 1 (s ⊤ a + α > 0) σ(γ) (s ⊤ a + α = 0) , ξ X (x) =    0 (t ⊤ x + β < 0) 1 (t ⊤ x + β > 0) σ(δ) (t ⊤ x + β = 0) . We used the Lesbegue Bounded Convergence Theorem in the second equation. From definition, we have 0 = A×X ξ A (a)ξ X (x)dµ(a, x) = σ(γ)σ(δ)µ(Π A s,α × Π X t,β ) + σ(γ)µ(Π A s,α × H X t,β ) + σ(δ)µ(H A s,α × Π X t,β ) + µ(H A s,α × H X t,β ), where Π A s,α = {a ∈ A|s ⊤ a + α = 0} Π X t,β = {x ∈ X |t ⊤ x + β = 0} H A s,α = {a ∈ A|s ⊤ a + α > 0} H X t,β = {x ∈ X |t ⊤ x + β > 0}. Hence for all s, α, t, β, we have µ(Π A s,α × Π X t,β ) = µ(Π A s,α × H X t,β ) = µ(H A s,α × Π X t,β ) = µ(H A s,α × H X t,β ) = 0. Based on this, we show µ = 0. Fix s, t and consider functional F (h) defined as F (h) = A×X h(s ⊤ a, t ⊤ x)dµ(a, x), where h is bounded measurable function h(u, v) : [ū, u] × [v, v] → R, where ū = max a∈A s ⊤ a, u = min a∈A s ⊤ a, v = max x∈X t ⊤ x, v = min x∈X t ⊤ x. Let indicator function I (b,c]×(d,e] (u, v) defined as I [b,c)×[d,e) (u, v) = 1 (u ∈ [b, c), v ∈ [d, e)) 0 otherwise . Then, we have F I [b,∞)×[c,∞) = µ (Π A s,-b ∪ H A s,-b ) × (Π X t,-c ∪ H X t,-c ) = 0. Since I [b,c)×[d,e) = I [b,∞)×[d,∞) -I [c,∞)×[d,∞) -I [,b∞)×[e,∞) + I [c,∞)×[e,∞) , we have F (I [b,c)×[d,e) ) = 0 for all b, c, d, e ∈ R. For linearlity, we have F N i=1 η i I [bi,ci)×[di,ei) = 0. Note that N i=1 η i I [bi,ci)×[di,ei) uniformly converges to any bounded measurable function h : [ū, u] × [v, v] → R. Hence, F (h) = 0. In particular, h(u, v) = cos(u + v), sin(u + v) are bounded measurable functions, and thus, A×X exp(i(s ⊤ a + t ⊤ x))dµ(a, x) = A×X cos(s ⊤ a + t ⊤ x) + i sin(s ⊤ a + t ⊤ x)dµ(a, x) = F (cos(u + v)) + iF (sin(u + v)) = 0. Thus, the Fourier transform of µ is 0 and so µ must be zero as well. From Lemma 5, we see Theorem 1.

B.2 IMPLICATION OF ASSUMPTION 1

In this section, we discuss the implication of Assumption 1, especially when the back-door and treatment variables are continuous. First, we show the upper bound of the sup norm of Lipschitz function. Lemma 6. Let Z ∈ Z be the probability variable following P (Z) and Z ⊂ [0, 1] d . Then, for all L-Lipschitz function h bounded in h(z) ∈ [-R, R], we have max z∈Z |h(z)| ≤ 4 c 1 d+2 (2R + 2 √ dL) d d+2 ∥h∥ 2 d+2 P (Z) if the density function f (z) is bounded away from zero f (z) ≥ ε > 0. Proof. Since Z is compact, there exists z * such that |h(z * )| = max z∈Z |h(z)|. Let M = |h(z * )| and we consider the following rectangle B = z ∈ Z ∀i ∈ [d] max 0, z * [i] - M 2R + 2 √ dL ≤ z [i] ≤ min 1, z * [i] + M 2R + 2 √ dL , where z [i] denotes the i-th element of z. Then, from Lipschitz continuity, for all z ∈ B, we have |h(z)| ≥ |h(z * )| -L∥z * -z∥ 2 = M -L d i=1 |z * [i] -z [i] | 2 ≥ M -L d i=1 M 2R + 2 √ dL 2 ≥ M -L d i=1 M 2 √ dL 2 ≥ M/2 Now, consider the volume of B. Since M 2R + 2 √ dL ≤ R 2R + 2 √ dL ≤ R 2R = 1 2 , the events 0 ≥ z * [i] - M 2R+2 √ dL and 1 ≤ z * [i] + M 2R+2 √ dL do not occur simultaneously. Therefore, we have min 1, z * [i] + M 2R + 2 √ dL -max 0, z * [i] - M 2R + 2 √ dL ≥ M 2R + 2 √ dL , and ∥h∥ 2 P (Z) = Z |h(z)| 2 f (z)dz ≥ B |h(z)| 2 f (z)dz ≥ c M 2R + 2 √ dL d M 2 4 . Since M = max z∈Z |h(z)|, we have max z∈Z |h(z)| ≤ 4 c 1 d+2 (2R + 2 √ dL) d d+2 ∥h∥ 2 d+2 P (Z) . By this, we can give the Assumption 1 follows for the interval probability space. Corollary 1. If A = [0, 1] d A , X = [0, 1] d X , and all function h ∈ H g are L-Lipschitz continuous, we have max a,x∈A×X |h 1 (a, x) -h 2 (a, x)| ≤ C∥h 1 -h 2 ∥ 2 d A +d X +2 P (A,X) , where C = 4 c 1 d A +d X +2 (4R + 4 √ d A + d X L) d A +d X d A +d X +2 . Note that the assumption on hypothesis space is easy to satisfy since all neural network is Lipchitz function if we use the ReLU activation and regularize the operator norm of the weight in each layer.

B.3 CONSISTENCY RESULTS

Proof of Lemma 1 We use the following Rademacher bound to prove the consistency (Mohri et al., 2012) . Proposition 4. (Mohri et al., 2012, Theorem 11. 3) Let X be a measurable space and H be a family of functions mapping from X to Y ⊆ [-R, R]. Given fixed dataset S = ((y 1 , x 1 ), (y 2 , x 2 ), . . . , (y n , x n )) ∈ (X × Y) n , the empirical Rademacher complexity is given by RS (H) = E σ 1 n sup h∈H n i=1 σ i h(x i ) , where σ = (σ 1 , . . . , σ n ), with σ i independent random variables taking values in {-1, +1} with equal probability. Then, for any δ > 0, with probability at least 1 -δ over the draw of an i.i.d sample S of size n, each of following holds for all h ∈ H: E (Y -h(X)) 2 ≤ 1 n n i=1 (y i -h(x i )) 2 + 8R RS (H) + 4R 2 log 2/δ 2n , 1 n n i=1 (y i -h(x i )) 2 ≤ E (Y -h(X)) 2 + 8R RS (H) + 4R 2 log 2/δ 2n . Given Proposition 4, we can prove the consistency of conditional expectation. Proof of Lemma 1. From Proposition 4 and ĝ, g ∈ H g , for the probability at least 1 -2δ, we have followings. E (Y -ĝ(A, X)) 2 ≤ 1 n n i=1 (y i -ĝ(a i , x i )) 2 + 8R RS (H g ) + 4R 2 log 2/δ 2n 1 n n i=1 (y i -g(a i , x i )) 2 ≤ E (Y -g(A, X)) 2 + 8R RS (H g ) + 4R 2 log 2/δ 2n From the minimality of ĝ = arg min LX 1 , we have E (Y -ĝ(A, X)) 2 ≤ E (Y -g(A, X)) 2 + 16R RS (H g ) + 8R 2 log 2/δ 2n ⇔ E (g(A, X) -ĝ(A, X)) 2 ≤ 16R RS (H g ) + 8R 2 log 2/δ 2n . Taking the square root of both sides completes the proof. Empirical Rademacher Complexity of H g We discuss the empirical Rademacher complexity of H g when we use feed-forward neural network for features ϕ A , ϕ X here. The discussion is based on a "peeling" argument proposed in Neyshabur et al. (2015) . Proposition 5 ((Neyshabur et al., 2015) , Theorem 1). Let hypothesis space of L layer neural net be H NN that H NN = f : R D → R f (s) = W (L) σ W (L-1) σ(. . . σ(W (1) s) , L i=1 ∥W (i) ∥ p,q ≤ γ , where σ is ReLU function and 2) , . . . , W (L-1) ∈ R H×H are weights. The norm ∥ • ∥ p,q is matrix L p,q -norm sup x̸ =0 ∥W x∥ q /∥x∥ p . Then, for any L, q ≥ 1, any 1 ≤ p ≤ ∞, and any set S = {s 1 , . . . , s n }, the empirical Rademacher complexity is bounded as W (1) ∈ R D×H , W (L) ∈ R 1×H , W RS (H NN ) ≤ 1 n γ 2 2H [ 1 p * -1 q ] + 2(L-1) (min{p * , 4 log(2D)}) max i ∥s i ∥ p * for p * = 1/(1 -1/p) and [x] + = max{0, x}. Given this, we can bound the empirical Rademacher complexity of H g when each coordinate of features is a truncated member of H NN . Lemma 7. Let A, X ⊂ R D and define hypothesis set H NNFeat. (d) that H NNFeat. (d) = ϕ : R D → R d ϕ(s) = (σ(f 1 (s)), σ(f 2 (s)), . . . , σ(f d (s))) ⊤ , f 1 , . . . , f d ∈ H NN where σ is a ramp function σ(x) = min(1, max(0, x)). Consider H g that H g = {w ⊤ (ϕ A (a) ⊗ ϕ X (x)) | w ∈ R d1d2 , ϕ A (a) ∈ R d1 , ϕ X (x) ∈ R d2 , ∥w∥ 1 ≤ R, ϕ A ∈ H NNFeat. (d 1 ), ϕ X ∈ H NNFeat. (d 2 )}. Given data set S = {(a 1 , x 1 ), . . . (a n , x n )}, we have RS (H g ) ≤ 6R 1 n γ 2 2H [ 1 p * -1 q ] + 2(L-1) (min{p * , 4 log(2D)}) max i ∥a i ∥ p * + max i ∥x i ∥ p * . Note that we have max a∈A ∥ϕ A (a)∥ ∞ ≤ 1, max x∈X ∥ϕ X (x)∥ ∞ ≤ 1 since we apply σ in the features. The proof is given as follows. Proof. Let us define the following hypothesis spaces. HNN = {σ • f |f ∈ H NN }, H2 NN = { f1 (a) f2 (x)| f1 , f2 ∈ HNN }. Then, from the definition, we have H g ⊂    d1 i=1 d2 j=1 w ij h ij (a, x) d1 i=1 d2 j=1 |w ij | ≤ R, ∀i, j h ij ∈ H2 NN    . Since the maximum of a linear function of w over the constraint ∥w∥ ≤ R is achieved for the values satisfying ∥w∥ = R, we have RS (H g ) ≤ RS      d1 i=1 d2 j=1 w ij h ij (a, x) d1 i=1 d2 j=1 |w ij | ≤ R, ∀i, j h ij ∈ H2 NN      = RS      d1 i=1 d2 j=1 w ij h ij (a, x) d1 i=1 d2 j=1 |w ij | = R, ∀i, j h ij ∈ H2 NN      ≤ R RS      d1 i=1 d2 j=1 w ij h ij (a, x) d1 i=1 d2 j=1 |w ij | = 1, ∀i, j h ij ∈ H2 NN      Let H2 NN -H2 NN be the function space defined as H2 NN -H2 NN = h 1 (a, x) -h 2 (a, x) h 1 , h 2 ∈ H2 NN . Since H2 NN contains the zero function, the final hypothesis space is the subset the convex hull of H2 NN -H2 NN because d1 i=1 d2 j=1 w ij h ij (a, x) = wi,j ≥0 w ij (h ij (a, x) -0) + wi,j <0 |w ij |(0 -h ij (a, x)). Therefore, we have RS (H g ) ≤ R RS ( H2 NN -H2 NN ) ≤ 2R RS ( H2 NN ). Now, we can bound RS ( H2 NN ) as RS ( H2 NN ) = RS ({ f1 (a) f2 (x)| f1 , f2 ∈ HNN }) = RS 1 2 ( f1 (a) + f2 (x)) 2 -( f1 (a)) 2 -( f2 (x)) 2 f1 , f2 ∈ HNN = 1 2 RS ( f1 (a) + f2 (x)) 2 f1 , f2 ∈ HNN + 1 2 RS ( f1 (a)) 2 f1 ∈ HNN + 1 2 RS ( f2 (x)) 2 f2 ∈ HNN ≤ 2 RS f1 (a) + f2 (x) f1 , f2 ∈ HNN + RS A ( HNN ) + RS X ( HNN ) = 3 RS A ( HNN ) + 3 RS X ( HNN ), where S A = {a i } and S X = {x i }. Here, we used Talagrand's contraction lemma (Mohri et al., 2012, Lemma 5.11 ) in the inequality. Again, from Talagrand's contraction lemma, we have RS A ( HNN ) ≤ RS A (H NN ), RS X ( HNN ) ≤ RS X (H NN ), since σ is an 1-Lipchitz function. Combining them, we have RS (H g ) ≤ 6R( RS A (H NN ) + RS X (H NN )). This and Proposition 5 completes the proof. Now, we derive the final theorem to show the consistency of the method. Proof of Theorem 2. From the triangular inequality, we have |θ ATE (a) -θATE (a)| ≤ |θ -E [ĝ(a, X)]| + θATE (a) -E [ĝ(a, X)] For the first term of r.h.s, we have |θ ATE (a) -E [ĝ(a, X)]| = |E [g(a, X) -ĝ(a, X)]| ≤ E [|g(a, X) -ĝ(a, X)|] ≤ sup a∈A,x∈X |g(a, x) -ĝ(a, x)| For the second term, we have θ -E [ĝ(a, X)] = ŵ⊤ φA (a) ⊗ 1 n n i=1 φX (x i ) -φA (a) ⊗ E φX (X) ≤ ∥ ŵ∥ 1 φA (a) ⊗ 1 n n i=1 φX (x i ) -φA (a) ⊗ E φX (X) ∞ ≤ ∥ ŵ∥ 1 φA (a) ∞ 1 n n i=1 φX (x i ) -E φX (X) ∞ ≤ R 1 n n i=1 φX (x i ) -E φX (X) ∞ Therefore, we have |θ ATE (a) -θATE (a)| ≤ sup a,x |g(a, x) -ĝ(a, x)| + R 1 n n i=1 φX (x i ) -E φX (X) ∞ . Using Lemmas 1 and 3 and Assumption 1, we have sup a,x |g(a, x) -ĝ(a, x)| ≤ 1 c 16R RS (H g ) + 8R 2 log 2/δ 2n 1/2β , E φX (X) - 1 n n i=1 φX (x i ) ∞ ≤ 2 log(2d 2 /δ) n with probability at least 1 -4δ. Combining them and applying Lemma 7 completes the proof for ATE bound. For ATT, we can derive the followings with the same discussion |θ ATT (a; a ′ ) -θATT (a; a ′ )| ≤ sup a,x |g(a, x) -ĝ(a, x)| + sup a ′ ∈A R f φX (a ′ ) -E φX (X)|A = a ′ ∞ . Using Lemma 4 and the assumption made in Theorem 2, we have f φX (a ′ ) -E φX (X)|A = a ′ ≤ 1 c ′ 16 RS (H f ) + 8 log(2d 2 /δ) 2n 1/2β ′ . If we use neural network hypothesis space H f considered in Proposition 5, we can see that the ATT bound holds.

B.4 LIMITATION OF SMOOTHNESS ASSUMPTION ON RIESZ REPRESENTER

In Chernozhukov et al. (2022b) , we consider a functional m such that the causal parameter θ can be written as θ = E [m(g, (A, X))], where g is the conditional expectation g(a, x) = E [Y |A = a, X = x]. Then, a Riesz Representer α, which satisfies E [m(g, (A, X))] = E [α(A, X)g(A, X)] , exists as long as E (m 2 (α, (A, X)) ≤ M ∥α∥ 2 P (A,X) , for all α ∈ H α and a smoothness parameter M . When we consider ATE θ ATE (a), the corresponding functional m would be m(α, (A, X)) = α(a, X). Chernozhukov et al. (2021, Theorem 1) shows that the deviation of estimated the Riesz Representer α and the true one α 0 scales as linear to the smoothness parameter M . where δ n is the critical radius that scales ∥α -α 0 ∥ 2 P (A,X) ≤ O(M δ n + n -1/2 ), A O Y U δ n = O log n n . when we consider fully connected neural networks. Now, we show that the smoothness parameter M can have an exponential dependency on the dimension of the space, even for simple α. Consider A = [-1, 1] d and some compact space X . We assume the uniform distribution for P (A, X). Consider following α α(a, x) = max 1 - d i=1 2|a [i] |, 0 , where a [i] denotes i-th element of a, and here we consider α that does not depend on x. Say, we are interested in estimating θ ATE (a) of a = 0 = [0, . . . , 0] ⊤ , for which E (m( α)(A, X)) 2 = E (α(0, X)) 2 = 1. Now consider B that B = a ∈ A ∀i ∈ [d], - 1 2 ≤ a [i] ≤ 1 2 . Then, since α(a, x) = 0 for all a / ∈ B, we have We use the assumption that P (A, X) is the uniform distribution to have the last equality. Hence, if α ∈ H α , the smoothness parameter M must have the exponential dependency M ≥ 2 d .

C OBSERVABLE CONFOUNDER

In this section, we consider the case where we have the additional observable confounder, the causal graph of which is given in Figure 4 . Given the causal graph in Figure 4 , ATE and ATT are defined as follows. These causal parameters can be recovered if the back-door or the front-door variable is provided as follows. Back-door adjustments: First, we present the Proposition stating these causal parameters can be recovered if we are given the back-door variable X. Proposition 6 (Pearl, 1995) . Given the back-door adjustment X in Figure 4b  (y i -w ⊤ (ϕ A (a i ) ⊗ ϕ O (o i ) ⊗ ϕ X (x i ))) 2 given data (y i , a i , o i , x i ). Here, w is the weight and ϕ A , ϕ O , ϕ X are the feature maps. From Proposition 6, we have where f φO ⊗ φX , f φX (o) are learned from f φO ⊗ φX = arg min f 1 n n i=1 ∥ φO (o i ) ⊗ φX (x i ) -f (a i )∥ 2 f φX = arg min f 1 n n i=1 ∥ φX (x i ) -f (o i )∥ 2 . Front-door adjustment: Given the front-door variable M , these causal parameters can be identified as follows. Proposition 7 (Pearl, 1995) . Given the front-door variable M in Figure 4c  (y i -w ⊤ (ϕ A (a i ) ⊗ ϕ O (o i ) ⊗ ϕ M (m i ))) 2 . Table 5 : Network structures of RieszNet in dSprite back-door adjustment experiment. For the fullyconnected layers (FC), we provide the input and output dimensions. SN denotes Spectral Normalization (Miyato et al., 2018) . 



In the binary treatment case A = {0, 1}, the ATE is typically defined as the expectation of the difference of potential outcome E[Y (1) -Y(0) ]. However, we define ATE as the expectation of potential outcome E[Y(a) ], which is a primary target of interest in a continuous treatment case, also known as dose response curve. The same applies to the ATT as well.



Figure 1: Causal graphs we consider. The dotted circle means the unobservable variable.

Figure 2: ATE experiment based on dSprite data

Figure4: Causal graph with observable confounder. The bidirectional arrows mean that we allow both directions or even a common ancestor variable.

, X)| 2 dP (A, X) ≤ B dP (A, X) = 1/2 d .

θ ATE (a) = E U,O [E [Y |U, O, A = a]] , θ ATE (a; a ′ ) = E U,O [E [Y |U, O, A = a]] .Furthermore, we can consider another causal parameter called conditional average treatment effect (CATE), which is a conditional average of the potential outcome given O = o; θ CATE (a; o) = E Y (a) O = o . Given exchangeability and no inference assumption, we have θ CATE (a; o) = E U |O=o [E [Y |U, O = o, A = a]] .

, we have θ ATE (a) = E X,O [g(a, O, X)] , θ ATT (a; a ′ ) = E X,O [g(a, O, X)|A = a ′ ] , θ CATE (a; o) = E X [g(a, o, X)|O = o] where g(a, o, x) = E [Y |A = a, O = o, X = x].Now, we present the deep adaptive feature embedding approach to this. We first learn conditional expectation ĝ as ĝ(a, o, x) = ŵ⊤ ( φA (a) ⊗ φO (o) ⊗ φX (x)), where ŵ, φA , φO , φX (x) = arg min 1 n n i=1

ATE (a) ≃ ŵ⊤ φA (a) ⊗ E X,O φO (O) ⊗ φX (X) , θ ATT (a; a ′ ) ≃ ŵ⊤ φA (a) ⊗ E X,O φO (O) ⊗ φX (X) A = a ′ , θ CATE (a; o) ≃ ŵ⊤ φA (a) ⊗ φO (o) ⊗ E φX (X) O = oTherefore, by estimating the feature embeddings, we have i ) ⊗ φX (x i ) , θATT (a; a ′ ) = ŵ⊤ φA (a) ⊗ f φO ⊗ φX (a ′ ) , θCATE (a; o) = ŵ⊤ φA (a) ⊗ φO (o) ⊗ f φX (o)

, we have θ ATE (a) = E A ′ E O E M |O,A=a [g(A ′ , O, M )] , θ ATT (a; a ′ ) = E O E M |O,A=a [g(a ′ , O, M )] , θ CATE (a; o) = E A ′ E M |O=o,A=a [g(A ′ , o, M )]where g(a, o, m) = E [Y |A = a, O = o, M = m] and A ′ follows the identical distribution as A. For front-door adjustment, we learn conditional expectation ĝ as ĝ(a, o, x) = ŵ⊤ ( φA (a)⊗ φO (o)⊗ φM (m)), where ŵ, φA , φO , φM (m) = arg min 1 n n i=1

Mean and standard error of the ATE prediction error.

ACKNOWLEDGEMENT

This work was supported by the Gatsby Charitable Foundation.

annex

Then, from Proposition 7, we haveThe conditional expectation E M |O=o,A=a φM (M ) is estimated as E M |O=o,A=a φM (M ) = f φM (o, a) , whereThen, by replacing the marginal expectation with the empirical average, we have

D EXPERIMENT DETAILS

Here, we describe the network architecture and hyper-parameters of all experiments. Unless otherwise specified, we used Adam with learning rate = 0.001, β 1 = 0.9, β 2 = 0.999 and ε = 10 -8 . For RKHS Embedding, we used Gaussian kernel for continuous variable where the bandwidth is determined by the median trick.

D.1 BINARY TREATMENT SCENARIO

In this scenario, all treatments are binary A ∈ {0, 1}. In RKHS Embedding and Neural Embedding, we used the feature ϕ A given asboth IHDP setting and ACIC setting. This is equivalent to learn two modelsIHDP Dataset We used the 1000 data used in (Chernozhukov et al., 2022b) , which is publicly available at Github page of the paper. The network structure for back-door feature ϕ X (X) is shown in Table 2 . Note that is much smaller network than Dragonnet or Riesznet, but increasing network size did not affect the result much. ACIC Dataset We used the 101 data used in (Shi et al., 2019) , which satisfies overlap assumption. (i.e. Not all data points has the extreme propensity score P (A = 1|X).) We noticed that some data contains a outliers and we only consider the data points with the outcome Y is in the range ofare 25%, 75%-quantile value of outcome, respectively, andWe run Dragonnet and RieszNet estimators with the same network architecture as IHDP dataset. The network structure for back-door feature ϕ X (X) is shown in Table 3 . Note that the same structure is used in Dragonnet and Riesznet to predict conditional expectation E [Y |X, A]. Here, we generate all dataset by ourselves from original dSprite dataset (Matthey et al., 2017) .Back-door ATE estimation The network features for the proposed method is summarized in Table 4. The network structures for RieszNet is given in Table 5 . Note that they share the similar feature extractor for images.Table 4 : Network structures of the neural embedding method in dSprite back-door adjustment experiment. For the input layer, we provide the input variable. For the fully-connected layers (FC), we provide the input and output dimensions. SN denotes Spectral Normalization (Miyato et al., 2018 Front-door ATT estimation Here, we used the same network architecture as in the back-door adjustment summarized in Table 4 .

