CAUSAL ESTIMATION FOR TEXT DATA WITH (APPAR-ENT) OVERLAP VIOLATIONS

Abstract

Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome-e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and satisfies overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-absolute-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline. Code, demo data and a tutorial are available at https://github.com/gl-ybnbxb/TI-estimator.

1. INTRODUCTION

We consider the problem of estimating the causal effect of an attribute of a passage of text on some downstream outcome. For example, what is the effect of writing a polite or rude email on the amount of time it takes to get a response? In principle, we might hope to answer such questions with a randomized experiment. However, this can be difficult in practice-e.g., if poor outcomes are costly or take long to gather. Accordingly, in this paper, we will be interested in estimating such effects using observational data. There are three steps to estimating causal effects using observational data (See Chapter 36 Murphy (2023)). First, we need to specify a concrete causal quantity as our estimand. That is, give a formal quantity target of estimation corresponding to the high-level question of interest. The next step is causal identification: we need to prove that this causal estimator can, in principle, be estimated using only observational data. The standard approach for identification relies on adjusting for confounding variables that affect both the treatment and the outcome. For identification to hold, our adjustment variables must satisfy two conditions: unconfoundedness and overlap. The former requires the adjustment variables contain sufficient information on all common causes. The latter requires that the adjustment variable does not contain enough information about treatment assignment to let us perfectly predict it. Intuitively, to disentangle the effect of treatment from the effect of confounding, we must observe each treatment state at all levels of confounding. The final step is estimation using a finite data sample. Here, overlap also turns out to be critically important as a major determinant of the best possible accuracy (asymptotic variance) of the estimator Chernozhukov et al. (2016) . Since the treatment is a linguistic property, it is often reasonable to assume that text data has information about all common causes of the treatment and the outcome. Thus, we may aim to satisfy unconfoundedness in the text setting by adjusting for all the text as the confounding part. However, doing so brings about overlap violation. Since the treatment is a linguistic property determined by the text, the probability of treatment given any text is either 0 or 1. The polite/rude tone is determined by the text itself. Therefore, overlap does not hold if we naively adjust for all the text as the confounding part. This problem is the main subject of this paper. Or, more precisely, our goal is to find a causal estimand, causal identification conditions, and a robust estimation procedure that will allow us to effectively estimate causal effects even in the presence of such (apparent) overlap violations. In fact, there is an obvious first approach: simply use a standard plug-in estimation procedure that relies only on modeling the outcome from the text and treatment variables. In particular, do not make any explicit use of the propensity score, the probability each unit is treated. Pryzant et al. ( 2020) use an approach of this kind and show it is reasonable in some situations. Indeed, we will see in Sections 3 and 4 that this procedure can be interpreted as a point estimator of a controlled causal effect. Even once we understand what the implied causal estimand is, this approach has a major drawback: the estimator is only accurate when the text-outcome model converges at a very fast rate. This is particularly an issue in the text setting, where we would like to use large, flexible, deep learning models for this relationship. In practice, we find that this procedure works poorly: the estimator has significant absolute bias and (the natural approach to) uncertainty quantification almost never includes the estimand true value; see Section 5. The contribution of this paper is a method for robustly estimating causal effects in text. The main idea is to break estimation into a two-stage procedure, where in the first stage we learn a representation of the text that preserves enough information to account for confounding, but throws away enough information to avoid overlap issues. Then, we use this representation as the adjustment variables in a standard double machine-learning estimation procedure Chernozhukov et al. (2016; 2017a) . To establish this method, the contributions of this paper are: 1. We give a formal causal estimand corresponding to the text-attribute question. We show this estimand is causally identified under weak conditions, even in the presence of apparent overlap issues. 2. We show how to efficiently estimate this quantity using the adapted double-ML technique just described. We show that this estimator admits a central limit theorem at a fast ( n) rate under weak conditions on the rate at which the ML model learns the text-outcome relationship (namely, convergence at n 1/4 rate). This implies absolute bias decreases rapidly, and an (asymptotically) valid procedure for uncertainty quantification. 3. We test the performance of this procedure empirically, finding significant improvements in bias and uncertainty quantification relative to the outcome-model-only baseline. (2000) . There are also numerous applications using text to adjust for confounding (e.g., Olteanu et al., 2017; Hall, 2017; Kiciman et al., 2018; Sridhar et al., 2018; Sridhar & Getoor, 2019; Saha et al., 2019; Karell & Freedman, 2019; Zhang et al., 2020) . Of these, Pryzant et al. ( 2020) also address non-parametric es-



The most related literature is on causal inference with text variables. Papers include treating text as treatment Pryzant et al. (2020); Wood-Doughty et al. (2018); Egami et al. (2018); Fong & Grimmer (2016); Wang & Culotta (2019); Tan et al. (2014)), as outcome Egami et al. (2018); Sridhar & Getoor (2019), as confounder Veitch et al. (2019); Roberts et al. (2020); Mozer et al. (2020); Keith et al. (2020), and discovering or predicting causality from text del Prado Martin & Brendel (2016); Tabari et al. (2018); Balashankar et al. (2019); Mani & Cooper

