ESTIMATING TREATMENT EFFECTS USING NEUROSYMBOLIC PROGRAM SYNTHESIS Anonymous

Abstract

Estimating treatment effects from observational data is a central problem in causal inference. Methods to solve this problem exploit inductive biases and heuristics from causal inference to design multi-head neural network architectures and regularizers. In this work, we propose to use neurosymbolic program synthesis, a data-efficient, and interpretable technique, to solve the treatment effect estimation problem. We theoretically show that neurosymbolic programming can solve the treatment effect estimation problem. By designing a Domain Specific Language (DSL) for treatment effect estimation problem based on the inductive biases used in literature, we argue that neurosymbolic programming is a better alternative to treatment effect estimation than traditional methods. Our empirical study reveals that our method, which implicitly encodes inductive biases in a DSL, achieves better performance on benchmark datasets than the state-of-the-art methods.

1. INTRODUCTION

Treatment effect (also referred to as causal effect) estimation estimates the effect of a treatment variable on an outcome variable (e.g., the effect of a drug on recovery). Randomized Controlled Trials (RCTs) are widely considered as the gold standard approach for treatment effect estimation (Chalmers et al., 1981; Pearl, 2009) . In RCTs, individuals are randomly split into the treated group and the control (untreated) group. This random split removes the spurious correlation between treatment and outcome variables before the experiment so that estimated treatment effect is unbiased. However, RCTs are often: (i) unethical (e.g., in a study to find the effect of smoking on lung disease, a randomly chosen person cannot be forced to smoke), and/or (ii) impossible/infeasible (e.g., in finding the effect of blood pressure on the risk of an adverse cardiac event, it is impossible to intervene on the same patient with and without high blood pressure with all other parameters the same) (Sanson-Fisher et al., 2007; Carey & Stiles, 2016; Pearl et al., 2016) . These limitations leave us with observational data to compute treatment effects. Observational data, similar to RCTs, suffers from the fundamental problem of causal inference (Pearl, 2009) , which states that for any individual, we cannot observe all potential outcomes at the same time (e.g., once we record a person's medical condition after taking a medicinal drug, we cannot observe the same person's medical condition with an alternate placebo). Observational data also suffers from selection bias (e.g., certain age groups are more likely to take certain kinds of medication compared to other age groups) (Collier & Mahoney, 1996) . For these reasons, estimating unbiased treatment effects from observational data can be challenging (Hernan & Robins, 2019; Farajtabar et al., 2020) . However, due to the many use cases in the real-world, estimating treatment effects from observational data is one of the long-standing central problems in causal inference (Rosenbaum & Rubin, 1983; 1985; Brady et al., 2008; Morgan & Winship, 2014; Shalit et al., 2017; Yoon et al., 2018; Shi et al., 2019; Yao et al., 2018; Zhang et al., 2021) . Earlier methods that estimate treatment effects from observational data are based on matching techniques that compare data points from treatment and control groups that are similar w.r.t. a metric (e.g., Euclidean distance in nearest-neighbor matching, or propensity score in propensity score matching) (Brady et al., 2008; Morgan & Winship, 2014) . Recent methods exploit inductive biases and heuristics from causal inference to design multi-head neural network (NN) models and regularizers (Hill, 2011; Farajtabar et al., 2020; Shi et al., 2019; Schwab et al., 2020; Chu et al., 2020; Shalit et al., 2017; Alaa & van der Schaar, 2017; Yoon et al., 2018; Bica et al., 2020; Künzel et al., 2019; Chernozhukov et al., 2018; Yao et al., 2018; Zhang et al., 2021) . Multi-head NN models are typically used when treatment variables are single-dimensional and categorical (Shi et al., 2019; Shalit et al., 2017; Farajtabar et al., 2020; Schwab et al., 2020) , and regularizers therein enforce constraints such as controlling for propensity score instead of pre-treatment covariates, i.e. covariates that are not caused by the treatment variable in the underlying causal data generating graph (Shi et al., 2019; Rosenbaum & Rubin, 1983) . However, each such model is well-suited to a certain kind of causal graph, and may not apply to all causal data generating processes. For example, as shown in . In practice, one may not be aware of the underlying causal model, making this more challenging. In this work, we instead propose to use a neurosymbolic program synthesis technique to compute treatment effect, which does not require such explicit regularizers or architecture redesign for each causal model. Such a technique learns to automatically synthesize differentiable programs that satisfy a given set of input-output examples (Shah et al., 2020; Parisotto et al., 2016) , and can hence learn the sequence of operations to estimate treatment effect for this set. We call our method as the NEuroSymbolic Treatment Effect EstimatoR (or NESTER). Neurosymbolic program synthesis is known to have the flexibility to synthesize different programs for different data distributions to optimize a performance criterion, while still abiding by the inductive biases studied in treatment effect estimation literature (see Sec 4.1 for more details). To describe further, one could view CFRNet/TARNet as implementing one ifthenelse program primitive with its two-headed NN architecture. NESTER will instead automatically synthesize the sequence of program primitives (from a domain-specific language of primitives) for a given set of observations from a causal model, and can thus generalize to different distributions. Program synthesis methods, in general, enumerate a set of programs and select (from the enumeration) a set of feasible programs that satisfy given input-output examples so that the synthesized programs generalize well to unseen inputs (see Appendix for an example) (Biermann, 1978; Gulwani, 2011; Parisotto et al., 2016; Valkov et al., 2018; Shah et al., 2020) . Usually, a Domain-Specific Language (DSL) (e.g., a specific context-free grammar) is used to synthesize relevant programs for a given domain and task. Recently, various NN-based techniques have been proposed to perform neurosymbolic program synthesis (Parisotto et al., 2016; Valkov et al., 2018; Gaunt et al., 2017; Bošnjak et al., 2017) . We use the neurosymbolic program synthesis paradigm where each program primitive (e.g., ifthenelse, α 1 + α 2 ) is a differentiable module (Parisotto et al., 2016; Shah et al., 2020) . Such differentiable programs simultaneously optimize program primitive parameters while learning the overall program structure and flow. The set of possible programs that can be synthesized using a DSL is often large (Parisotto et al., 2016) . Many methods have been proposed to search through the vast search space of programs efficiently (Gulwani et al., 2012; Parisotto et al., 2016; Valkov et al., 2018; Shah et al., 2020) . We use Neural Admissible Relaxation (Shah et al., 2020) in this work, which uses neural networks as relaxations of partial programs while searching the program space using informed search algorithms such as A * (Hart et al., 1968) . The final program can be obtained by training using gradient descent algorithms. Our key contributions are: • We study the use of neurosymbolic program synthesis as a practical approach for solving treatment effect estimation problems. To the best of our knowledge, this is the first such effort that applies neurosymbolic program synthesis to estimate treatment effects. • We propose a Domain-Specific Language (DSL) for treatment effect estimation, where each program primitive is motivated from basic building blocks of models for treatment effect estimation in literature.



Figure 1: IPM regularization in CFRNet (note that CFRNet is a combination of a simple two-head TAR-Net with IPM regularization) controls for covariates x which may lead to incorrect treatment effect w.r.t. causal model B. However, NESTER learns to synthesize different estimators for the two causal models.

Fig 1, CFRNet (Shalit et al., 2017), a popular NN-based treatment estimation model, controls pre-treatment covariates using a regularizer based on an Integral Probability Metric (IPM). It requires the representations of non-treatment covariates (denoted as x in Fig 1) with and without treatment to be similar. This is relevant for causal model A in the figure, but does not work for causal model B, where non-treatment covariates are caused by the treatment t and hence could vary for different values of t. One would ideally need a different regularizer or architecture to address causal model B (the same observation holds for TARNet (Shalit et al., 2017) too)

