UNTANGLING EFFECT AND SIDE EFFECT: CONSIS-TENT CAUSAL INFERENCE IN NON-TARGETED TRIALS

Abstract

A treatment is usually appropriate for some group (the "sick" group) on whom it has an effect, but it can also have a side-effect when given to subjects from another group (the "healthy" group). In a non-targeted trial both sick and healthy subjects may be treated, producing heterogeneous effects within the treated group. Inferring the correct treatment effect on the sick population is then difficult, because the effect and side-effect are tangled. We propose an efficient nonparametric approach to untangling the effect and side-effect, called PCM (pre-cluster and merge). We prove its asymptotic consistency in a general setting and show, on synthetic data, more than a 10x improvement in accuracy over existing state-of-the-art.

1. INTRODUCTION

A standard approach to causal effect estimation is the targeted randomized controlled trial (RCT), see (8; 13; 15; 17; 23) . To test a treatment's effect on a sick population, subjects are recruited and admitted into the trial based on eligibility criteria designed to identify sick subjects. The trial subjects are then randomly split into a treated group that receives the treatment and a control group that receives the best alternative treatment (or a placebo). "Targeted" means only sick individuals are admitted into the trial via the eligibility criteria, with the implicit assumption that only a single treatment-effect is to be estimated. This ignores the possibility of treated subgroups among the sick population with heterogeneous effects. Further, one often does not have the luxury of a targeted RCT. For example, eligibility criteria for admittance to the trial may not unambiguously identify sick subjects, or one may not be able to control who gets into the trial. When the treatment is not exclusively applied on sick subjects, we say the trial is non-targeted and new methods are needed to extract the treatment effect on the sick, (25) . Non-targeted trials are the norm whenever subjects self-select into an intervention, which is often the case across domains stretching from healthcare to advertising. We propose a nonparametric approach to causal inference in non-targeted trials, based on a pre-cluster and merge strategy. Assume a population is broken into ℓ groups with different expected treatment effects in each group. Identify each group with the level of its treatment effect, so there are effect levels c = 0, 1, . . . , ℓ-1. For example, a population's subjects can be healthy, c = 0, or sick, c = 1. We use the Rubin-Neyman potential outcome framework, (19) . A subject is a tuple s = (x, c, t, y) sampled from a distribution D, where x ∈ [0, 1] d is a feature-vector such as [age, weight], c indicates the subject's level, t indicates the subjects treatment cohort, and y is the observed outcome. The observed outcome is one of two potential outcomes, v if treated or v if not treated. We consider strongly ignorable trials: given x, the propensity to treat is strictly between 0 and 1 and the potential outcomes {v, v} depend only on x, independent of t. In a strongly ignorable trial, one can use the features to identify counterfactual controls for estimating effect. The level c is central to the scope of our work. Mathematically, c is a hidden effect modifier which determines the distribution of the potential outcomes (c is an unknown and possibly complex function of x). The level c dichotomizes the feature space into subpopulations with different effects. One tries to design the eligibility criteria for the trial to ensure that the propensity to treat is non-zero only for subjects in one level. What to do when the eligibility criteria allow more than one level into the trial is exactly the problem we address. Though our work applies to a general number of levels, all the main ideas can be illustrated with just two levels, c ∈ {0, 1}. For the sake of concreteness, we denote these two levels healthy and sick. A trial samples n subjects, s 1 , . . . , s n . If subject i is treated, t i = 1 and the observed outcome y i = v i , otherwise t i = 0, and the observed outcome is vi (consistency). The treated group is T = {i | t i = 1}, the control group is C = {i | t i = 0}, and the sick group is S = {i | c i = 1}. Our task is to determine if the treatment works on the sick, and if there is any side-effect on the healthy. We wish to estimate the effect and side-effect, defined as EFF = E D [v -v | c = 1] SIDE-EFF = E D [v -v | c = 0]. Most prior work estimates EFF using the average treatment effect for the treated, the ATT (1), ATT = average i∈T (v i ) -average i∈T (v i ), which assumes all treated subjects are sick. There are several complications with this approach. (i) Suppose a subject is treated with probability p(x, c), the propensity to treat. For a non-uniform propensity to treat, the treated group has a selection bias, and ATT is a biased estimate of EFF. Ways to address this bias include inverse propensity weighting, (18) , matched controls, (1), and learning the outcome function y(x, t), see for example (2; 3; 10; 12; 22; 23) . Alternatively, one can simply ignore this bias and accept that ATT is estimating E[v -v | t = 1]. (ii) The second term on the RHS in (2) can't be computed because we don't know the counterfactual v for treated subjects. Much of causal inference deals with accurate unbiased estimation of average i∈T (v i ), (4; 9). Our goal is not to improve counterfactual estimation. Hence, in our experiments, we use off-the-shelf counterfactual estimators. (iii) (Focus of our work) The trial is non-targeted and some (often most) treated subjects are healthy. To highlight the challenge in (iii) above, consider a simple case with uniform propensity to treat, p(x, c) = p. Conditioning on at least one treated subject, E[ATT] = P[sick] × EFF + P[healthy] × SIDE-EFF. The ATT is a mix of effect and side effect and is therefore biased when the treatment effect is heterogeneous across levels. In many settings, for example healthcare, P[sick] ≪ P[healthy] and the bias is extreme, rendering ATT useless. Increasing the number of subjects won't resolve this bias. State-of-the-art causal inference packages provide methods to compute ATT, specifically aimed at accurate estimates of the counterfactual average i∈T (v i ), (5; 21) . These packages suffer from the mixing bias above. We propose a fix which can be used as an add-on to these packages. Our Contribution. Our main result is an asymptotically consistent distribution independent algorithm to extract the correct effect levels and associated subpopulations in non-targeted trials, when the number of effect-levels is unknown. Our main result is Theorem 1. Assume a non-targeted trial has a treated group with n subjects sampled from an unknown distribution D. There is an algorithm which identifies l effect-levels with estimated expected effect μc in level c, and assigns each subject s i to a level ĉi which, under mild technical conditions, satisfies: Theorem 1. All of the following hold with probability 1 -o(1): (1) l = ℓ, i.e., the correct number of effect levels ℓ is identified. (2 ) μc = E[v -v | c] + o(1), i.e. , the effect at each level is estimated accurately. (3) The fraction of subjects assigned the correct effect level is 1 -o(1). The effect level ĉi is correct if µ ĉi matches, to within o(1), the expected treatment effect for the subject. For the formal assumptions, see Section 3. Parts (1) and (2) say the algorithm extracts the correct number of levels and their expected effects. Part (3) says the correct subpopulations for each level are extracted. Knowing the correct subpopulations is useful for post processing, for example to understand the effects in terms of the features. Our algorithm satisfying Theorem 1 is given in Section 2. The algorithm uses an unsupervised pre-cluster and merge strategy which reduces the task of estimating the effect-levels to a 1-dimensional optimal clustering problem that provably extracts the correct levels asymptotically as n → ∞. Our algorithm assumes an unbiased estimator of counterfactuals, for example some established method (5; 21) . In practice, this means one can control for confounders. If unbiased counterfactual estimation is not possible, then any form of causal effect analysis is doomed. Our primary goal is untangling the heterogeneous effect levels, hence we use an off-the-shelf gradient boosting algorithm to get counterfactuals in our experiments (5) . We demonstrate that our algorithm's performance on synthetic data matches the theory. Subpopulation effect-analysis is a special case of heterogeneous treatment effects (HTE), (12; 20; 23) . Hence, we also compare with X-Learner, a state-of-the art algorithm for HTE (12) and Bayes optimal prediction of effect-level. In comparison to X-Learner, our algorithm extracts visually better subpopulations, and has an accuracy that is more than 10× better for estimating per-subject expected effects. Note, HTE algorithms do not extract subpopulations with effect-levels. They predict effect given the features x. One can, however, try to infer subpopulations from predicted effects. Our algorithm also significantly outperforms Bayes optimal based on individual effects, which suggests that some form of pre-cluster and merge strategy is necessary. This need for some form of clustering has been independently observed in (11, chapter 4) who studies a variety of clustering approaches in a non-distribution independent setting with a known number of levels.

2. ALGORITHM: PRE-CLUSTER AND MERGE FOR SUBPOPULATION EFFECTS (PCM)

Our algorithm uses a nonparametric pre-cluster and merge strategy that achieves asymptotic consistency without any user-specified hyperparameters. The inputs are the n subjects s 1 , . . . , s n , where {s i } n i=1 = {(x i , t i , y i , ȳi )} n i=1 . Note, both the factual y i and counterfactual ȳi are inputs to the algorithm. To use the algorithm in practice, of course, the counterfactual must be estimated, and for our demonstrations we use an out-of-the-box gradient boosting regression algorithm from (7; 16) to estimate counterfactuals. Inaccuracy in counterfactual estimation will be accommodated in our analysis. The need to estimate counterfactuals does impact the algorithm in practice, due to an asymmetry in most trials: the treated population is much smaller than the controls. Hence, one might be able to estimate counterfactuals for the treated population but not for the controls due to lack of coverage by the (small) treated population. In this case, our algorithm is only run on the treated population. It is convenient to define individual treatment effects ITE i = (y i -ȳi )(2t i -1), where y i is the observed factual and ȳi the counterfactual (2t i -1 = ±1 ensuring that the effect computed is for treatment versus no treatment). There are five main steps. 1: [PRE-CLUSTER] Cluster the x i into K ∈ O( √ n) clusters Z 1 , . . . , Z K . 2: Compute ATT for each cluster Z j , ATT j = average xi∈Zj ITE i . 3: [MERGE] Group the {ATT j } K j=1 into l effect-levels, merging the clusters at each level to get subpopulations X 0 , X 1 , . . . , X l-1 . (X c is the union of all clusters at level c.) 4: Compute subpopulation effects μc = average xi∈Xc ITE i , for c = 0, . . . , l -1. 5: Assign subjects to effect levels, update the populations X c and expected effects μc . We now elaborate on the intuition and details for each step in the algorithm. Step 1. The clusters in the pre-clustering step play two roles. The first is to denoise individual effects using in-cluster averaging. The second is to group like with like, that is clusters should be homogeneous, containing only subjects from one effect-level. This means each cluster-ATT will accurately estimate a single level's effect (we do not know which). We allow for any clustering algorithm. However, our theoretical analysis (for simplicity) uses a specific algorithm, boxclustering, based on an ε-net of the feature space. One could also use a standard clustering algorithm such as K-means. We compare box-clustering with K-means in the appendix. Step 2. Denoising of the individual effects using in-cluster averaging. Assuming clusters are homogeneous, each cluster ATT will approximate some level's effect. Step 3. Assuming the effects in different levels are well separated, this separation gets emphasized in the cluster-ATTs, provided clusters are homogeneous. Hence, we can identify effect-levels from the clusters with similar effects, and merge those clusters into subpopulations. Two tasks must be solved. Finding the number of subpopulations l and then optimally grouping the clusters into l subpopulations. To find the subpopulations, we use l-means with squared 1-dim clustering error. Our algorithm sets l to achieve an l-means error at most log n/n 1/2d . So, optimal 1-dim clustering error( l -1) > log n/n 1/2d optimal 1-dim clustering error( l) ≤ log n/n 1/2d Simultaneously finding l and optimally partitioning the clusters into l groups can be solved using a standard dynamic programming algorithm in O(K 2 l) time using O(K) space (24) . Note, our algorithm will identify the number of effect levels provided such distinct subpopulations exist in the data. If it is known that only two subpopulations exist, sick and healthy, then l can be hard-coded to 2. Step 4. Assuming each cluster is homogeneous and clusters with similar effects found in step 3 are from the same effect-level, the subpopulations formed by merging the clusters with similar effects will be nearly homogeneous. Hence, the subpopulation-ATTs will be accurate estimates of the effects at each level. Step 5. Each subject x i is implicitly assigned a level ĉi based on the subpopulation X c to which it belongs. However, we can do better. By considering the √ n nearest neighbors to x i , we can obtain a smoothed effect for x i . We use this smoothed effect to place x i into the subpopulation whose effect matches best, hence placing x i into a level. Unfortunately, running this algorithm for all n subjects is costly, needing sophisticated data structures to reduce the expected run time below O(n 2 ). As an alternative, we center an (1/n 1/2d )-hypercube on x i and smooth x i 's effect using the average effect over points in this hypercube. This approach requires O(n √ n) run time to obtain the effect-level for all subjects, significantly better than O(n 2 ) when n is large. Once the effect-levels for all subjects are obtained, one can update the subpopulations X c and the corresponding effect-estimates μc . The run time of the algorithm is O(nℓ + n √ n) (expected and with high probability) and the output is nearly homogeneous subpopulations which can now be post-processed. An example of useful post-processing is a feature-based explanation of the subpopulation-memberships. Note that we still do not know which subpopulation(s) are the sick ones, hence we cannot say which is the effect and which is the side effect. A post-processing oracle would make this determination. For example, a doctor in a medical trial would identify the sick groups from subpopulation-demographics. Note. The optimal 1-d clustering can be done directly on the smoothed ITEs from the (1/n 1/2d )hypercubes centered on each x i , using the same thresholds in step 3. One still gets asymptotic consistency, however the price is an increased run time to O(n 2 ℓ). This is prohibitive for large n.

3. ASYMPTOTIC CONSISTENCY: PROOF OF THEOREM 1

To prove consistency, we must make our assumptions precise. In some cases the assumptions are stronger than needed, for simplicity of exposition. A1. The feature space X is [0, 1] d and the marginal feature-distribution is uniform, D(x) = 1. More generally, X is compact and D(x) is bounded, 0 < δ ≤ D(x) ≤ ∆ (can be relaxed). A2. The level c is an unknown function of the feature x, c = h(x). Potential effects depend only on c. Conditioning on c, effects are well separated. Let µ c = E D [v -v|c]. Then, |µ c -µ c ′ | ≥ κ for c ̸ = c ′ A3. Define the subpopulation for level c as X c = h -1 (c). Each subpopulation has positive measure, P[x ∈ X c ] = β c ≥ β > 0. A4. For a treated subject x i with outcome y i , it is possible to produce an unbiased estimate of the counterfactual outcome ȳi . Effectively, we are assuming an unbiased estimate of the individual treatment effect ITE i = y i -ȳi is available. Any causality analysis requires some estimate of counterfactuals and, in practice, one typically gets counterfactuals from the untreated subjects after controlling for confounders (5; 21). A5. Sample averages concentrate. Essentially, the estimated ITEs are independent. This is true in practice because the subjects are independent and the counterfactual estimates use a predictor learned from the independent control population. For m i.i.d. subjects, let the average of the estimated ITEs be ν and the expectation of this average be ν. Then, P[|ν -ν| > ϵ] ≤ e -γmϵ 2 . The parameter γ > 0 is related to distributional properties of the estimated ITEs. Higher variance ITE estimates result in γ being smaller. Concentration is a mild technical assumption requiring the estimated effects to be unbiased well behaved random variables, to which a central limit theorem applies. Bounded effects or normally distributed effects suffice for concentration. A6. The boundary between the subpopulations has small measure. Essentially we require that two subjects that have very similar features will belong to the same level with high probability (the function c = h(x) is not a "random" function). Again, this is a mild technical assumption which is taken for granted in practice. Let us make the assumption more precise. Define an ε-net to be a subdivision of X into (1/ε) d disjoint hypercubes of side ε. A hypercube of an ε-net is impure if it contains points from multiple subpopulations. Let N impure be the number of impure hypercubes in an ε-net. Then ε d N impure ≤ αε ρ , where ρ > 0 and α is a constant. Note, d -ρ is the boxing-dimension of the boundary. In most problems, ρ = 1. A7. We use box-clustering for the first step in the algorithm. Given n, define ε (n) = 1/ n 1/2d . All points in a hypercube of an ε(n)-net form a cluster. Note that the number of clusters is approximately √ n. The expected number of points in a cluster is nε(n ) d ≈ √ n. We prove Theorem 1 via a sequence of lemmas. The feature space X = [0, 1] d is partitioned into levels X 0 , . . . , X ℓ-1 , where X c = h -1 (c) is the set of points whose level is c. Define an ε-net that partitions X into N ε = ε -d hypercubes of equal volume ε d , where ε is the side-length of the hypercube. Set ε = 1/ n 1/2d . Then, N ε = √ n(1 -O(d/n 1/2d )) ∼ √ n. Each hypercube in the ε-net defines a cluster for the pre-clustering stage. There are about √ n clusters and, since D(x) is uniform, there are about √ n points in each cluster. Index the clusters in the ε-net by j ∈ {1, . . . , N ε } and define n j as the number of points in cluster j. Formally, we have, Lemma 1. Suppose D(x) ≥ δ > 0. Then, P[min j n j ≥ 1 2 δ √ n] > 1 - √ n exp(-δ √ n/8). Proof. Fix a hypercube in the ε-net. Its volume is  ε d ≥ (1/n 1/2d ) d = 1/ √ n. A point P[Y < δ √ n/2] ≤ P[Y < E[Y ]/2] < exp(-E[Y ]/8) ≤ exp(-δ √ n/8). By a union bound over the N ε clusters, P[some cluster has fewer than δ √ n/2 points] < N ε exp(-δ √ n/8) ≤ √ n exp(-δ √ n/8). The lemma follows by taking the complement event. For uniform D(x), δ = 1 and every cluster has at least 1 2 √ n points with high probability. We can now condition on this high probability event that every cluster is large. This means that a cluster's ATT is an average of many ITEs, which by A5 concentrates at the expected effect for the hypercube. Recall that the expected effect in level c is defined as µ c = E D [v -v|c]. We can assume, w.l.o.g., that µ 0 < µ 1 • • • < µ ℓ-1 . Define ν j as the expected average effect for points in the hypercube j and ATT j as the average ITE for points in cluster j. since every cluster is large, every cluster's ATT j will be close to its expected average effect ν j . More formally, Lemma 2. P[max j |ATT j -ν j | ≤ 2 log n/γδ √ n] ≥ 1 -n -3/2 - √ n exp(-δ √ n/8). Proof. Conditioning on min j n j ≥ 1 2 δ √ n and using A5, we have P |ATT j -ν j | > 2 log n/γδ √ n min j n j ≥ 1 2 δ √ n ≤ exp(-2 log n) = 1/n 2 .

By a union bound, P[max

j |ATT j -ν j | > 2 log n/γδ √ n | min j n j ≥ 1 2 δ √ n] ≤ N ε /n 2 . For any events A, B, by total probability, P [A] ≤ P[A | B] + P[B]. Therefore, P[max j |ATT j -ν j | > 2 log n/γδ √ n] ≤ N ε /n 2 + P[min j n j < 1 2 δ √ n] To conclude the proof, use N ε ≤ √ n and Lemma 1. A hypercube in the ε-net is homogeneous if it only contains points of one level (the hypercube does not intersect the boundary between levels). Let N c be the number of homogeneous hypercubes for level c and N impure be the number of hypercubes that are not homogeneous, i.e., impure. Lemma 3. N impure ≤ αε ρ N ε and N c ≥ N ε (β/∆ -αε ρ ). Proof. A6 directly implies N impure ≤ αε ρ N ε . Only the pure level c or impure hypercubes can contain points in level c. Using A3 and ε d = 1/N ε , we have β ≤ P[x ∈ X c ] ≤ (N c + N impure )∆ε d ≤ (N c + αε ρ N ε )∆/N ε . The result follows after rearranging the above inequality. The main tools we need are Lemmas 2 and 3. Let us recap what we have. The cluster ATTs are close to the expected average effect in every hypercube. The number of impure hypercubes is an asymptotically negligible fraction of the hypercubes since ε ∈ O(1/n 1/2d ). Each level has an asymptotically constant fraction of homogeneous hypercubes. This means that almost all cluster ATTs will be close to a level's expected effect, and every level will be well represented. Hence, if we optimally cluster the ATTs, with fewer than ℓ clusters, we won't be able to get clustering error close to zero. With at least ℓ clusters, we will be able to get clustering error approaching zero. This is the content of the next lemma, which justifies step 3 in the algorithm. An optimal k-clustering of the cluster ATTs produces k centers θ 1 , . . . , θ k and assigns each cluster ATT j to a center θ(ATT j ) so that the average clustering error err(k ) = j (ATT j -θ(ATT j )) 2 /N ε is minimized. Given k, one can find an optimal k-clustering in O(N 2 ε k) time using O(N ε ) space. Lemma 4. With probability at least 1 -n -3/2 - √ n exp(-δ √ n/8 ), optimal clustering of the ATTs with ℓ -1 and ℓ clusters produces clustering errors which satisfy err(ℓ -1) ≥ (β/∆ -αϵ ρ ) κ/2 -2 log n/γδ √ n 2 for log n √ n < κ 2 γδ 16 err(ℓ) ≤ 1 4 αε ρ (µ ℓ-1 -µ 0 ) 2 + 4 log n(1 + αε ρ )/γδ √ n Proof. With the stated probability, by Lemma 2, all ATTs are within 2 log n/γδ √ n of the expected effect for their respective hypercube. This, together with Lemma 3 is enough to prove the bounds. First, the upper bound on err(ℓ). Choose cluster centers µ 0 , . . . , µ ℓ-1 , the expected effect for each level. This may not be optimal, so it gives an upper bound on the cluster error. Each homogeneous hypercube has a expected effect which is one of these levels, and its ATT is within 2 log n/γδ √ n of the corresponding µ. Assign each ATT for a homogeneous hypercube to its corresponding µ. The homogeneous hypercubes have total clustering error at most 4 log n(N ε -N impure )/γδ √ n. For an impure hypercube, the expected average effect is a convex combination of µ 0 , . . . , µ ℓ-1 . Assign these ATTs to either µ 0 or µ ℓ-1 , with an error at most (2 log n/γδ √ n + 1 2 (µ ℓ-1 -µ 0 )) 2 . Thus, N ε err(ℓ) ≤ 4 log n(N ε -N impure ) γδ √ n + N impure (2 log n/γδ √ n + 1 2 (µ ℓ-1 -µ 0 )) 2 ≤ 4 log n(N ε + N impure ) γδ √ n + N impure (µ ℓ-1 -µ 0 ) 2 The upper bound follows after dividing by N ε and using N impure ≤ αε ρ N ε . Now, the lower bound on err(ℓ -1). Consider any ℓ -1 clustering of the ATTs with centers θ 0 , . . . , θ ℓ-2 . At least N c ≥ N ε (β/∆ -αϵ ρ ) of the ATTs are within 2 log n/γδ √ n of µ c . We also know that µ c+1 -µ c ≥ κ. Consider the ℓ disjoint intervals [µ c -κ/2, µ c + κ/2]. By the pigeonhole principle, at least one of these intervals [µ c * -κ/2, µ c * + κ/2] does not contain a center. Therefore all the ATTs associated to µ c * will incur an error at least κ/2 -2 log n/γδ √ n when κ/2 > 2 log n/γδ √ n. The total error is N ε err(ℓ -1) ≥ N c * κ/2 -2 log n/γδ √ n 2 . Using N c * ≥ N ε (β/∆ -αϵ ρ ) and dividing by N ε concludes the proof. Lemma 4 is crucial to estimating the number of levels. The error is βκ 2 /4∆(1+o(1)) for fewer than ℓ clusters and 1 4 αε ρ (µ ℓ-1 -µ 0 ) 2 (1 + o( 1)) for ℓ or more clusters. Any function τ (n) that asymptotically separates these two errors can serve as an error threshold. The function should be agnostic to the parameters α, β, κ, ∆, ρ, . . .. In practice, ρ = 1 and since ε ∼ 1/n 1/2d , we have chosen τ (n) = log n/n ρ/2d . Since err(ℓ -1) is asymptotically constant, ℓ -1 clusters can't achieve error τ (n) (asymptotically). Since err(ℓ) ∈ O(ε ρ ), ℓ clusters can achieve error τ (n) (asymptotically). Hence, choosing l as the minimum number of clusters that achieves error τ (n) will asymptotically output the correct number of clusters ℓ, with high probability, proving part (1) of Theorem 1. We now prove parts ( 2) and (3) of Theorem 1, which follow from the accuracy of steps 4 and 5 in the algorithm. We know the algorithm asymptotically selects the correct number of levels with high probability. We show that each level is populated by mostly the homogeneous clusters of that level. Lemma 5. With probability at least 1 -n -3/2 - √ n exp(-δ √ n/8 ), asymptotically in n, all the N c ATTs from the homogeneous hypercubes of level c are assigned to the same cluster in the optimal clustering, and no ATTs from a different level's homogeneous hypercubes is assigned to this cluster. Proof. Similar to the proof of Lemma 4, consider the ℓ disjoint intervals [µ c -κ/4, µ c + κ/4]. One center θ c must be placed in this interval otherwise the clustering error is asymptotically constant, which is not optimal. All the ATTs for level c are (as n gets large) more than κ/2 away from any other center, and at most κ/2 away from θ c , which means all these ATTs get assigned to θ c . Similar to Lemma 1, we can get a high-probability upper bound of a √ n on the maximum number of points in a cluster. Asymptotically, the number of points in the impure clusters is n impure ∈ O(ε ρ √ nN ε ). Suppose these impure points have expected average effect µ (a convex combination of the µ c 's). The number of points in level c homogeneous clusters is n c ∈ Ω( √ nN ε ). Even if all impure points are added to level c, the expected average effect for the points in level c is E[ITE | assigned to level c] = n impure µ + n c µ c n impure + n c = µ c + O(ε ρ ). Part (2) of Theorem 1 follows from the next lemma after setting ε ∼ 1/n 1/2d and ρ = 1. Lemma 6. Estimate μc as the average ITE for all points assigned to level c (the cth order statistic of the optimal centers θ 0 , . . . , θ l-1 ). Then μc = µ c + O(ε ρ + log n/n) with probability 1 -o(1). Proof. Apply a Chernoff bound. We are taking an average of proportional to n points with expectation in (3). This average will approximate the expectation to within log n/n with probability 1 -o(1). The details are very similar to the proof of Lemma 2, so we omit them. Part (3) of Theorem 1 now follows because all but the O(ε ρ ) fraction of points in the impure clusters are assigned a correct expected effect. An additional fine-tuning leads to as much as 2× improvement in experiments. For each point, consider the ε-hypercube centered on that point. By a Chernoff bound, each of these n hypercubes has Θ( √ n) points, as in Lemma 1. All but a fraction O(ε ρ ) of these are impure. Assign each point to the center θ c that best matches its hypercube-"smoothed" ITE, giving new subpopulations X c and corresponding subpopulation-effects μc . This EM-style update can be iterated. Our simulations show the results for one E-M update.

4. DEMONSTRATION ON SYNTHETIC DATA

We use a 2-dimensional synthetic experiment with three levels to demonstrate our pre-cluster and merge algorithm (PCM). Alternatives to pre-clustering include state-of-the-art methods that directly predict the effect such as meta-learners, and the Bayes optimal classifier based on ITEs. All methods used a base gradient boosting forest with 400 trees to estimate counterfactuals. The subpopulations in our experiment are shown in Figure 1 , where black is effect-level 0, gray is level 1 and white is level 2. We present detailed results with n = 200K. Extensive results can be found in the appendix. Let us briefly describe the two existing benchmarks we will compare against. The treatment t is distributed randomly between the subjects. The outcome y, conditioned on c and t, is Gaussian with std. dev. 5: y(t, c) ∼ N (µ (t,c) , 5) The three sub-populations have treatment effects of 0,1,2. The expected potential outcome for treatment and level (t, c) are: X-learner (12) , is a meta-learner that estimates heterogeneous treatment effects directly from ITEs. µ (0,0) = 0 µ (1,0) = 0, µ (0,1) = 0 µ (1,1) = 1, µ (0,2) = 0 µ (1,2) = 2. For the outcome and effect models of X-Learner we use a base gradient boosting learner with 400 estimators ( 6) implemented in scikit-learn (16) . For the propensity model we use logistic regression. Bayes Optimal uses the ITEs to reconstruct the subpopulations, given the number of levels and the ground-truth outcome distribution y(t, c) from Figure 1 . The Bayes optimal classifier is: c Bayes = 0 if ITE ≤ 0.5, c Bayes = 1 if 0.5 < ITE ≤ 1.5, c Bayes = 2 if 1.5 < ITE. We also use these thresholds to reconstruct subpopulations for X-learner's predicted ITEs. Note: Neither the thresholds nor the number of levels are available in practice. We compare the benchmark subpopulations reconstructed with these thresholds to further showcase the power of our algorithm's subpopulations, which outperform the competition without access to the forbidden information. Let c i be the level of subject i and ITE i the estimated ITE. Our algorithm is about 10× better than existing benchmarks even though we do not use the forbidden information (number of levels and optimal thresholds). It is also clear that X-learner is significantly better than Bayes optimal with just the raw ITEs. The next table shows subpopulation effects, again red indicates the use of forbidden information on the number of levels and optimal thresholds. The ground truth effects are µ 0 = 0, Note that μ1 for X-learner and Bayes optimal are accurate, an artefact of knowing the optimal thresholds (not realizable in practice). A detailed comparison of our algorithm (PCM) with X-Learner and Bayes optimal subpopulations is shown in Figure 2 . PCM clearly extracts the correct subpopulations. X-Learner and Bayes optimal, even given the number of levels and optimal thresholds, does not come visually close to PCM. Note, X-learner does display some structure but Bayes optimal on just the ITEs is a disaster. This is further illustrated in the ITE-histograms in the second row. PCM clearly shows three levels, where as X-learner ITEs and the raw ITEs suggest just one high variance level. The 3rd row shows the confusion matrices for subpopulation assignment. The red indicates use of information forbidden in practice, however we include it for comparison. The confusion matrix for PCM without forbidden information clearly dominates the other methods which use forbidden information. The high noise in the outcomes undermines the other methods, while PCM is robust. In high noise settings, direct use of the ITEs without some form of pre-clustering fails. µ 1 = 1, µ 2 = 2. n PCM ( Summary of experiments with synthetic data. Our algorithm accurately extracts subpopulations at different effect-levels. Analysis of individual treatment effects fails when there is noise. Our experiments show that practice follows the theory (more detailed experiments, including how cluster homogeneity converges to 1, are shown in the appendix). We note that there is a curse of dimensionality, namely the convergence is at a rate O(n -1/2d ).

5. CONCLUSION

Our work amplifies the realm of causal analysis to non-targeted trials where the treated population can consist of large subpopulations with different effects. Our algorithm uses a plug-and-play precluster and merge strategy that provably untangles the different effects. Experiments on synthetic data show a 10× or more improvement over existing HTE-benchmarks. In our analysis, we did not attempt to optimize the rate of convergence. Optimizing this rate could lead to improved algorithms. Our work allows causal effects analysis to be used in settings such as health interventions, where wide deployment over a mostly healthy population would mask the effect on the sick population. Our methods can seemlessly untangle the effects without knowledge of what sick and healthy mean. This line of algorithms can also help in identifying inequities between the subpopulations. One significant contribution is to reduce the untangling of subpopulation effects to a 1-dim clustering problem which we solve efficently. This approach may be of independent interest beyond causaleffect analysis. The effect is just a function that takes on ℓ levels. Our approach can be used to learn any function that takes on a finite number of levels. It could also be used to learn a piecewise approximation to an arbitrary continuous function on a compact set. 

D CLUSTER HOMOGENEITY

To further show how practice reflects the theory, we plot average cluster homogeneity versus n. The cluster homogeneity is the fraction of points in a cluster that are from its majority level. Our entire methodology relies on the pre-clustering step producing a vast majority of homogeneous clusters. The rapid convergence to homogeneous clusters enables us to identify the correct subpopulations and the corresponding effects via pre-cluster and merge. 



Figure 1: Subpopulations for synthetic data.



Figure2: Top row. PCM reconstructs superior subpopulations without access to the forbidden information used by X-learner and Bayes optimal (number of levels and optimal thresholds). Middle row. The ITE-histogram for PCM clearly shows 3 distinct effects, while the other methods suggest a single high-variance effect. Bottom Row. Subpopulation confusion matrices show that PCM extracts the correct subpopulations. The other methods fail even with the forbidden information.

ITE HISTOGRAMSWe show the ITE histograms for n ∈ {20K, 200K, 2M }. DIFFERENT PRE-CLUSTERING METHODS We show the reconstructed subpopulations and effect errors for different pre-clustering methods. Box-clustering without any E-M step is also provably consistent. Our algorithm PCM uses boxclustering followed by an E-M step to improve the subpopulations using smoothed ITEs. We also show K-means pre-clustering, for which we did not prove any theoretical guarantees.

The error is |µ ci -ITE i |, and we report the mean absolute error in the table below. Our algorithm predicts a level ĉi and uses its associated effect μĉi as ITE i . The other methods predict ITE directly for which we compute mean absolute error. As mentioned above, we also show the error for the optimally reconstructed subpopulations, which is not possible in practice, but included for comparison (red emphasizes not available in practice).

A APPENDIX

We provide more detailed experimental results, specifically results for different n (20K, 200K and 2M) and a comparison of different clustering methods in the pre-clustering phase: box-only, PCM (box plus 1 step of E-M improvement) and K-means. To calculate the counterfactual for treated subjects, we train a gradient boosted forest on the control population. Even with just 20K points in this very noisy setting, PCM is able to extract some meaningful subpopulation structure, while none of the other methods can.

