GENERALIZATION BOUNDS AND ALGORITHMS FOR ES-TIMATING THE EFFECT OF MULTIPLE TREATMENTS AND DOSAGE Anonymous authors Paper under double-blind review

Abstract

Estimating conditional treatment effects has been a longstanding challenge for fields of study such as epidemiology or economics that require a treatment-dosage pair to make decisions, but may not be able to run randomized trials to precisely quantify their effect. This may be due to financial restrictions or ethical considerations. In the context of representation learning, there is an extensive literature relating model architectures with regularization techniques to solve this problem using observational data. However, theoretically motivated loss functions and bounds on generalization errors only exist in selected circumstances, such as in the presence of binary treatments. In this paper, we introduce new bounds on the counterfactual generalization error in the context of multiple treatments and continuous dosage parameters, which subsume existing results. This result, in a principled manner, guides the definition of new learning objectives that can be used to train representation learning algorithms. We show empirically new stateof-the-art performance results across several benchmark datasets for this problem, including in comparison to doubly-robust estimation methods.

1. INTRODUCTION

Treatment effect estimation is the problem of predicting the effect of an intervention (e.g. a treatmentdosage pair) on an outcome of interest to guide decision-making. The challenge for prediction models is to learn this map from observational data, which is formally generated from a different structural causal model in which treatment assignment varies according to an individual's covariates, instead of being fixed by the decision-maker. Counterfactuals define the outcome that would have been observed had the assigned treatment been different. For concreteness, consider designing a policy for the administration of chemotherapy regiments; not all cancer patients in the available data are equally likely to be offered the same type and dosage, with varied factors, e.g. age, wealth, etc., involved in the decision-making process. Evaluating a new treatment combination for a given patient is a data point that is invariably under-represented in the empirical distribution of the data. Treatment effect estimation is studied under a wide range of assumptions, including experimental designs that feature ignorability (Imbens, 2000; Imai & Van Dyk, 2004) , multiple treatments, sequential decision-making problems, and different generative models encoded in general causal graphs (Pearl, 2009) . There is a growing literature on several parts of this problem in the field of machine learning that attempts to define loss functions that are conducive to learning representations of covariates predictive of both observed and counterfactual outcomes. Existing methods could be generally categorized by the theoretical guarantees that inspire training objectives, driven either by bounds for the generalization error or by doubly-robustness guarantees. In the first line of research, Shalit et al. (2017); Johansson et al. (2020) showed in the binary treatment setting that the counterfactual error, that is not computable from data by design, could be instead bounded by the in-sample error plus a term that quantifies the difference in distributions between treated and untreated populations, leading to a differentiable loss function that can be used to train expressive neural networks. Several papers used this insight to investigate different neural network architectures for this problem. 2022) use neural differential equations. In turn, doubly-robust estimators combine expressive function approximators and inverse probability weighting leveraging statistical non-parametric asymptotic guarantees of both estimators (Funk et al., 2011; Kennedy, 2016; 2020) . In particular, when the direct estimate of the outcome is biased, such as when using nonparametric or high-dimensional regression, the doubly robust estimator weights the model residuals by inverse propensity weights in order to remove the bias. Its convergence and consistency for treatment effect estimation requires only that one of the estimators is consistent. In principle, any consistent function approximator could be used, which in the context of neural networks has led to several adaptations of loss functions and architectures. For example, Shi et al. ( 2019) adapted the architecture of Johansson et al. (2016) for this purpose introducing targeted regularization, and Nie et al. ( 2020) proposed varying coefficient networks in the context of continuously-valued dosage parameters. In both cases, however, the authors provide guarantees for population average treatment effect estimation, in contrast with conditional average treatment effect estimation. Despite the generality of these results, no guarantees and no theoretically motivated loss functions exist for learning representations for counterfactual estimation in the general setting of multiple treatment types and/or continuous treatment values or dosages. The challenge in the context of representation learning is that there is no notion of treatment group as each individual gets assigned a potentially different and unique treatment value. Lack of overlap in finite samples and subsequently large estimation variance for counterfactual predictions are exacerbated in this setting to the extreme that adjustments for distributional differences are, in principle, not applicable. In particular, the intuition for reducing variance by regularization deviates from previous proposals (that regularize representations of covariates to match distributions among groups with different treatment types (Shalit et al., 2017) ) as a potentially infinite set of counterfactuals for each individual must be considered. Even the analysis of multiple categorical treatments is currently an open question as, while pairwise comparisons between treatment specific distributions could be implemented in principle, it is not computationally tractable to do so in practice. At this moment, only heuristic neural network architectures for this problem have been proposed, including Dose Response networks that consist of multi-task layers for dosage sub-intervals defined on top of a common representation (Schwab et al., 2020) , variants of generative adversarial networks (Bica et al., 2020) , and varying coefficient networks (Nie et al., 2020) . In this paper, we investigate the design of representation learning-based algorithms for predicting (conditional average) treatment effects in the context of multiple treatments and continuous dosage parameters. Our analysis starts by extending definitions of loss and generalization error to this broader setting, over all possible treatment-dosage pairs. We then show by using the definition of integral probability metrics that the generalization error can be bounded by a term that is computable from data and that involves the factual error and a term that quantifies the statistical dependence between the pair of treatment-dosage random variables and observed confounders. In principle, any treatment space on which we can define a probability measure is consistently accounted for, which gives welldefined bounds on the generalization error for treatments with multiple types and continuous values, and in particular, our bound includes as a special case existing guarantees in the binary treatment case (Shalit et al., 2017) . This bound suggests new training objectives for learning representations conducive to counterfactual estimation. Moreover, such objectives are tractable: both avoiding combinatorial numbers of pairwise comparisons and avoiding binning dosage values into different sub-intervals. A further contribution we make is to design extensive numerical comparisons that compare both methods driven by bounds on the generalization error (that typically target conditional average treatment effects) and methods driven by doubly-robust guarantees (that typically target average treatment effects). Moreover, we do so independently of the adopted neural architecture which provides the first analysis of different objectives for the problem of treatment effect estimation with multiple, continuously-valued treatments. We hope these results can give some insight into the trade-offs of different approaches to this problem and demonstrate the ability of representation learning techniques to tackle wider ranging scenarios within treatment effect estimation.

2. BACKGROUND

We start by introducing the notation and definitions used throughout the paper. In particular, we use capital letters for random variables pXq, small letters for their values pxq, bold letters for sets of variables pXq and their values pxq, and Ω for the spaces where they are defined pΩ X q if not explicitly stated. To simplify notation, we consistently use the shorthand P pxq to represent probabilities or



For example, Johansson et al. (2016) proposed to use separate feed-forward prediction heads on top of a common representation, Zhang et al. (2022) use transformers, De Brouwer et al. (2022); Seedat et al. (

