UNTANGLING EFFECT AND SIDE EFFECT: CONSIS-TENT CAUSAL INFERENCE IN NON-TARGETED TRIALS

Abstract

A treatment is usually appropriate for some group (the "sick" group) on whom it has an effect, but it can also have a side-effect when given to subjects from another group (the "healthy" group). In a non-targeted trial both sick and healthy subjects may be treated, producing heterogeneous effects within the treated group. Inferring the correct treatment effect on the sick population is then difficult, because the effect and side-effect are tangled. We propose an efficient nonparametric approach to untangling the effect and side-effect, called PCM (pre-cluster and merge). We prove its asymptotic consistency in a general setting and show, on synthetic data, more than a 10x improvement in accuracy over existing state-of-the-art.

1. INTRODUCTION

A standard approach to causal effect estimation is the targeted randomized controlled trial (RCT), see (8; 13; 15; 17; 23) . To test a treatment's effect on a sick population, subjects are recruited and admitted into the trial based on eligibility criteria designed to identify sick subjects. The trial subjects are then randomly split into a treated group that receives the treatment and a control group that receives the best alternative treatment (or a placebo). "Targeted" means only sick individuals are admitted into the trial via the eligibility criteria, with the implicit assumption that only a single treatment-effect is to be estimated. This ignores the possibility of treated subgroups among the sick population with heterogeneous effects. Further, one often does not have the luxury of a targeted RCT. For example, eligibility criteria for admittance to the trial may not unambiguously identify sick subjects, or one may not be able to control who gets into the trial. When the treatment is not exclusively applied on sick subjects, we say the trial is non-targeted and new methods are needed to extract the treatment effect on the sick, (25). Non-targeted trials are the norm whenever subjects self-select into an intervention, which is often the case across domains stretching from healthcare to advertising. We propose a nonparametric approach to causal inference in non-targeted trials, based on a pre-cluster and merge strategy. Assume a population is broken into ℓ groups with different expected treatment effects in each group. Identify each group with the level of its treatment effect, so there are effect levels c = 0, 1, . . . , ℓ-1. For example, a population's subjects can be healthy, c = 0, or sick, c = 1. We use the Rubin-Neyman potential outcome framework, (19). A subject is a tuple s = (x, c, t, y) sampled from a distribution D, where x ∈ [0, 1] d is a feature-vector such as [age, weight], c indicates the subject's level, t indicates the subjects treatment cohort, and y is the observed outcome. The observed outcome is one of two potential outcomes, v if treated or v if not treated. We consider strongly ignorable trials: given x, the propensity to treat is strictly between 0 and 1 and the potential outcomes {v, v} depend only on x, independent of t. In a strongly ignorable trial, one can use the features to identify counterfactual controls for estimating effect. The level c is central to the scope of our work. Mathematically, c is a hidden effect modifier which determines the distribution of the potential outcomes (c is an unknown and possibly complex function of x). The level c dichotomizes the feature space into subpopulations with different effects. One tries to design the eligibility criteria for the trial to ensure that the propensity to treat is non-zero only for subjects in one level. What to do when the eligibility criteria allow more than one level into the trial is exactly the problem we address. Though our work applies to a general number of levels, all the main ideas can be illustrated with just two levels, c ∈ {0, 1}. For the sake of concreteness, we denote these two levels healthy and sick. A trial samples n subjects, s 1 , . . . , s n . If subject i is treated, t i = 1 and the observed outcome y i = v i , otherwise t i = 0, and the observed outcome is vi (consistency). The treated group is T = {i | t i = 1}, the control group is C = {i | t i = 0}, and the sick group is S = {i | c i = 1}. Our task is to determine if the treatment works on the sick, and if there is any side-effect on the healthy. We wish to estimate the effect and side-effect, defined as EFF = E D [v -v | c = 1] (1) SIDE-EFF = E D [v -v | c = 0]. Most prior work estimates EFF using the average treatment effect for the treated, the ATT (1), ATT = average i∈T (v i ) -average i∈T (v i ), which assumes all treated subjects are sick. There are several complications with this approach. (i) Suppose a subject is treated with probability p(x, c), the propensity to treat. For a non-uniform propensity to treat, the treated group has a selection bias, and ATT is a biased estimate of EFF. Ways to address this bias include inverse propensity weighting, (18), matched controls, (1), and learning the outcome function y(x, t), see for example (2; 3; 10; 12; 22; 23). Alternatively, one can simply ignore this bias and accept that ATT is estimating E[v -v | t = 1]. (ii) The second term on the RHS in (2) can't be computed because we don't know the counterfactual v for treated subjects. Much of causal inference deals with accurate unbiased estimation of average i∈T (v i ), (4; 9). Our goal is not to improve counterfactual estimation. Hence, in our experiments, we use off-the-shelf counterfactual estimators. (iii) (Focus of our work) The trial is non-targeted and some (often most) treated subjects are healthy. To highlight the challenge in (iii) above, consider a simple case with uniform propensity to treat, p(x, c) = p. Conditioning on at least one treated subject, E[ATT] = P[sick] × EFF + P[healthy] × SIDE-EFF. The ATT is a mix of effect and side effect and is therefore biased when the treatment effect is heterogeneous across levels. In many settings, for example healthcare, P[sick] ≪ P[healthy] and the bias is extreme, rendering ATT useless. Increasing the number of subjects won't resolve this bias. State-of-the-art causal inference packages provide methods to compute ATT, specifically aimed at accurate estimates of the counterfactual average i∈T (v i ), (5; 21). These packages suffer from the mixing bias above. We propose a fix which can be used as an add-on to these packages. Our Contribution. Our main result is an asymptotically consistent distribution independent algorithm to extract the correct effect levels and associated subpopulations in non-targeted trials, when the number of effect-levels is unknown. Our main result is Theorem 1. Assume a non-targeted trial has a treated group with n subjects sampled from an unknown distribution D. There is an algorithm which identifies l effect-levels with estimated expected effect μc in level c, and assigns each subject s i to a level ĉi which, under mild technical conditions, satisfies: Theorem 1. All of the following hold with probability 1 -o(1): (1) l = ℓ, i.e., the correct number of effect levels ℓ is identified. (2 1), i.e., the effect at each level is estimated accurately. ) μc = E[v -v | c] + o( (3) The fraction of subjects assigned the correct effect level is 1 -o(1). The effect level ĉi is correct if µ ĉi matches, to within o(1), the expected treatment effect for the subject. For the formal assumptions, see Section 3. Parts (1) and (2) say the algorithm extracts the correct number of levels and their expected effects. Part (3) says the correct subpopulations for each level are extracted. Knowing the correct subpopulations is useful for post processing, for example to understand the effects in terms of the features. Our algorithm satisfying Theorem 1 is given in Section 2. The algorithm uses an unsupervised pre-cluster and merge strategy which reduces the task of estimating the effect-levels to a 1-dimensional optimal clustering problem that provably extracts the correct levels asymptotically as n → ∞. Our algorithm assumes an unbiased estimator of counterfactuals, for example some established method (5; 21). In practice, this means one can control for confounders. If unbiased counterfactual estimation is not possible, then any form of causal effect analysis is doomed. Our primary goal is untangling the heterogeneous effect levels, hence we use an off-the-shelf gradient boosting algorithm to get counterfactuals in our experiments (5).

