VCNET AND FUNCTIONAL TARGETED REGULARIZA-TION FOR LEARNING CAUSAL EFFECTS OF CONTINU-OUS TREATMENTS

Abstract

Motivated by the rising abundance of observational data with continuous treatments, we investigate the problem of estimating the average dose-response curve (ADRF). Available parametric methods are limited in their model space, and previous attempts in leveraging neural network to enhance model expressiveness relied on partitioning continuous treatment into blocks and using separate heads for each block; this however produces in practice discontinuous ADRFs. Therefore, the question of how to adapt the structure and training of neural network to estimate ADRFs remains open. This paper makes two important contributions. First, we propose a novel varying coefficient neural network (VCNet) that improves model expressiveness while preserving continuity of the estimated ADRF. Second, to improve finite sample performance, we generalize targeted regularization to obtain a doubly robust estimator of the whole ADRF curve.

1. INTRODUCTION

Continuous treatments arise in many fields, including healthcare, public policy, and economics. With the widespread accumulation of observational data, estimating the average dose-response function (ADRF) while correcting for confounders has become an important problem (Hirano & Imbens, 2004; Imai & Van Dyk, 2004; Kennedy et al., 2017; Fong et al., 2018) . Recently, papers in causal inference (Johansson et al., 2016; Alaa & van der Schaar, 2017; Shalit et al., 2017; Schwab et al., 2019; Farrell et al., 2018; Shi et al., 2019) have utilized feed forward neural network for modeling. The success of using neural network model lies in the fact that neural networks, unlike traditional parametric models, are very flexible in modeling the complex causal relationship as shown by the universal approximation theorem (Csáji et al., 2001) . Also, unlike traditional non-parametric models, neural network has been shown to be powerful when dealing with high-dimensional input (i.e., Masci et al. (2011); Johansson et al. (2016) ), which implies its potential for dealing with high-dimensional confounders. A successful application of neural network to causal inference requires a specially designed network structure that distinguishes the treatment variable from other covariates, since otherwise the treatment information might be lost in the high dimensional latent representation (Shalit et al., 2017) . However, most of the existing network structures are designed for binary treatments and are difficult to generalize to treatments taking value in continuum. 2019) used separate prediction heads for the two treatment options and this structure is not directly applicable for continuous treatments as there is an infinite number of treatment levels. To deal with a continuous treatment, recent work (Schwab et al., 2019) proposed a modification called DRNet. DRNet partitions a continuous treatment into blocks and for each block, trains a separate head, in which the treatment is concatenated into each hidden layer (see Figure 2 ). Despite the improvements made by the building block of DRNet, this structure does not take the continuity of ADRF (Prichard & Gillam, 1971; Schneider et al., 1993; Threlfall & English, 1999) into account, and it produces discontinuous ADRF estimators in practice (see Figure 1 ). We propose a new network building block that is able to strengthen the influence of treatment but also preserve the continuity of ADRF. Under binary treatment, previous neural network models for treatment effect estimation use separate prediction heads to model the mapping from covariates to (expected) outcome under different treatment levels. When it comes to the continuous treatment case, by the continuity of ADRF, this mapping should change continuously with respect to the treatment. To achieve this, motivated by the varying coefficient model (Hastie & Tibshirani (1993 ), Fan et al. (1999) , Chiang et al. ( 2001)), one can allow the weights of the prediction head to be continuous functions of the treatment. This serves as the first contribution here, called Varying Coefficient Network (VCNet). In VCNet, once the activation function is continuous, the mapping defined by the network automatically produces continuous ADRF estimators as shown in Figure 1 but also prevents the treatment information from being lost in its high dimensional latent representation. The second contribution of this paper is to generalize targeted regularization (Shi et al., 2019) to obtain a doubly robust estimator of the whole ADRF curve, which improves finite sample performance. Targeted regularization was previously used for estimating a scalar quantity (Shi et al., 2019) and it associates an extra perturbation parameter to the scalar quantity of interest. While adapting it to a finitedimensional vector is not difficult, generalization to a curve is far less straightforward. Difficulties arise from the fact that ADRFs cannot be regularized at each treatment level independently because the number of possible levels is infinite and, thus, the model complexity cannot be controlled with the introduction of infinite extra perturbations parameters. Utilizing the continuity (and smoothness) of ADRF (Schneider et al., 1993; Threlfall & English, 1999) , we introduce smoothing to control model complexity. Its model size increases in a specific manner to balance model complexity and regularization strength. Moreover, the original targeted regularization in Shi et al. ( 2019) is not guaranteed to obtain a doubly robust estimator. By allowing regularization strength to depend on sample size, we obtain a consistent and doubly robust estimator under mild assumptions. Noticing the connection between targeted regularization and TMLE (Van Der Laan & Rubin, 2006) , a by-product of this work is that we give the first (to the best of our knowledge) generalization of TMLE to estimating a function. We do experiments on both synthetic and semi-synthetic datasets, finding that VCNet and targeted regularization boost performance independently. Using them jointly consistently achieves state-ofthe-art performance. Notation We denote the Dirac delta function by δ(•). We use E to denote expectation, P to denote population probability measure and we write P(f ) = f (z)dP(z). Similarly, we denote P n as the empirical measure and we write P n (f ) = f (z)dP n (z). We denote n as the least integer greater than or equal to n, and we denote n as the greatest integer less than or equal to n. We use τ to denote Rademacher random variables. We denote Rademacher complexity of a function class F : X → R as Rad n (F) = E sup f ∈F 1 n n i=1 τ i f (X i ) . Given two functions f 1 , f 2 : X → R, we define f 1 -f 2 ∞ = sup x∈X |f 1 (x) -f 2 (x)| and f 1 -f 2 L 2 = x∈X (f 1 (x) -f 2 (x)) 2 dx 1/2 . For a function class F, we define F ∞ = sup f ∈F f ∞ . We denote stochastic boundedness with O p



For example, Shalit et al. (2017); Louizos et al. (2017); Schwab et al. (2019); Shi et al. (

Figure 1: Estimated ADRF on testing set from a typical run of VCNet and DRNet. From left to right panels are results on simulation, IHDP and News dataset. Both VCNet and DRNet are well optimized. Blue points denote DRNet estimation and red points VCNet. The truth is shown in yellow solid line.

