A NEURAL MEAN EMBEDDING APPROACH FOR BACK-DOOR AND FRONT-DOOR ADJUSTMENT

Abstract

We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This objective is attained by first estimating the conditional mean of the desired outcome variable given relevant covariates (the "first stage" regression), and then taking the (conditional) expectation of this function as a "second stage" procedure. We propose to compute these conditional expectations directly using a regression function to the learned input features of the first stage, thus avoiding the need for sampling or density estimation. All functions and features (and in particular, the output features in the second stage) are neural networks learned adaptively from data, with the sole requirement that the final layer of the first stage should be linear. The proposed method is shown to converge to the true causal parameter, and outperforms the recent state-of-the-art methods on challenging causal benchmarks, including settings involving high-dimensional image data.

1. INTRODUCTION

The goal of causal inference from observational data is to predict the effect of our actions, or treatments, on the outcome without performing interventions. Questions of interest can include what is the effect of smoking on life expectancy? or counterfactual questions, such as given the observed health outcome for a smoker, how long would they have lived had they quit smoking? Answering these questions becomes challenging when a confounder exists, which affects both treatment and the outcome, and causes bias in the estimation. Causal estimation requires us to correct for this confounding bias. A popular assumption in causal inference is the no unmeasured confounder requirement, which means that we observe all the confounders that cause the bias in the estimation. Although a number of causal inference methods are proposed under this assumption (Hill, 2011; Shalit et al., 2017; Shi et al., 2019; Schwab et al., 2020) , it rarely holds in practice. In the smoking example, the confounder can be one's genetic characteristics or social status, which are difficult to measure for both technical and ethical reasons. To address this issue, Pearl (1995) proposed back-door adjustment and front-door adjustment, which recover the causal effect in the presence of hidden confounders using a back-door variable or frontdoor variable, respectively. The back-door variable is a covariate that blocks all causal effects directed from the confounder to the treatment. In health care, patients may have underlying predispositions to illness due to genetic or social factors (hidden), which cause measurable symptoms. The symptoms can be used as the back-door variable if the treatment is chosen based on these. By contrast, a front-door variable blocks the path from treatment to outcome. In perhaps the bestknown example, the amount of tar in a smoker's lungs serves as a front-door variable, since it is increased by smoking, shortens life expectancy, and has no direct link to underlying (hidden) sociological traits. Pearl (1995) showed that causal quantities can be obtained by taking the (conditional) expectation of the conditional average outcome. While Pearl (1995) only considered the discrete case, this framework was extended to the continuous case by Singh et al. ( 2020), using two-stage regression (a review of this and other recent approaches for the continuous case is given in Section 5). In the first stage, the approach regresses from the relevant covariates to the outcome of interest, expressing the function as a linear combination of non-linear feature maps. Then, in the second stage, the causal parameters are estimated by learning the (conditional) expectation of the non-linear feature map used in the first stage. Unlike competing methods (Colangelo & Lee, 2020; Kennedy et al., 2017) , two-stage regression avoids fitting probability densities, which is challenging in high-dimensional settings (Wasserman, 2006, Section 6.5). Singh et al. ( 2020)'s method is shown to converge to the true causal parameters and exhibits better empirical performance than competing methods. One limitation of the methods in Singh et al. ( 2020) is that they use fixed pre-specified feature maps from reproducing kernel Hilbert spaces, which have a limited expressive capacity when data are complex (images, text, audio). To overcome this, we propose to employ a neural mean embedding approach to learning task-specific adaptive feature dictionaries. At a high level, we first employ a neural network with a linear final layer in the first stage. For the second stage, we learn the (conditional) mean of the stage 1 features in the penultimate layer, again with a neural net. The approach develops the technique of Xu et al. (2021a; b) and enables the model to capture complex causal relationships for high-dimensional covariates and treatments. Neural network feature means are also used to represent (conditional) probabilities in other machine learning settings, such as representation learning (Zaheer et al., 2017) and approximate Bayesian inference (Xu et al., 2022) . We derive the consistency of the method based on the Rademacher complexity, a result of which is of independent interest and may be relevant in establishing consistency for broader categories of neural mean embedding approaches, including Xu et al. (2021a; b) . We empirically show that the proposed method performs better than other state-of-the-art neural causal inference methods, including those using kernel feature dictionaries. This paper is structured as follows. In Section 2, we introduce the causal parameters we are interested in and give a detailed description of the proposed method in Section 3. The theoretical analysis is presented in Section 4, followed by a review of related work in Section 5. We demonstrate the empirical performance of the proposed method in Section 6, covering two settings: a classical backdoor adjustment problem with a binary treatment, and a challenging back-door and front-door setting where the treatment consists of high-dimensional image data.

2. PROBLEM SETTING

In this section, we introduce the causal parameters and methods to estimate these causal methods, namely a back-door adjustment and front-door adjustment. Throughout the paper, we denote a random variable in a capital letter (e.g. A), the realization of this random variable in lowercase (e.g. a), and the set where a random variable takes values in a calligraphic letter (e.g. A). We assume data is generated from a distribution P . Causal Parameters We introduce the target causal parameters using the potential outcome framework (Rubin, 2005) . Let the treatment and the observed outcome be A ∈ A and Y ∈ Y ⊆ [-R, R]. We denote the potential outcome given treatment a as Y (a) ∈ Y. Here, we assume no inference, which means that we observe Y = Y (a) when A = a. We denote the hidden confounder as U ∈ U and assume conditional exchangeability ∀a ∈ A, Y (a) ⊥ ⊥ A|U , which means that the potential outcomes are not affected by the treatment assignment. A typical causal graph is shown in Figure 1a . We may additionally consider the observable confounder O ∈ O, which is discussed in Appendix C. A first goal of causal inference is to estimate the Average Treatment Effect (ATE) 1 θ ATE (a) = E Y (a) , which is the average potential outcome of A = a. We also consider Average Treatment Effect on the Treated (ATT) θ ATT (a; a ′ ) = E Y (a) |A = a ′ , which is the expected potential outcome of A = a for those who received the treatment A = a ′ . Given no inference and conditional exchangeability assumptions, these causal parameters can be written in the following form. Proposition 1 (Rosenbaum & Rubin, 1983; Robins, 1986) . Given unobserved confounder U , which satisfies no inference and conditional exchangeability, we have 



θ ATE (a) = E U [E [Y |A = a, U ]] , θ ATT (a; a ′ ) = E U [E [Y |A = a, U ] |A = a ′ ] .

