VCNET AND FUNCTIONAL TARGETED REGULARIZA-TION FOR LEARNING CAUSAL EFFECTS OF CONTINU-OUS TREATMENTS

Abstract

Motivated by the rising abundance of observational data with continuous treatments, we investigate the problem of estimating the average dose-response curve (ADRF). Available parametric methods are limited in their model space, and previous attempts in leveraging neural network to enhance model expressiveness relied on partitioning continuous treatment into blocks and using separate heads for each block; this however produces in practice discontinuous ADRFs. Therefore, the question of how to adapt the structure and training of neural network to estimate ADRFs remains open. This paper makes two important contributions. First, we propose a novel varying coefficient neural network (VCNet) that improves model expressiveness while preserving continuity of the estimated ADRF. Second, to improve finite sample performance, we generalize targeted regularization to obtain a doubly robust estimator of the whole ADRF curve.

1. INTRODUCTION

Continuous treatments arise in many fields, including healthcare, public policy, and economics. With the widespread accumulation of observational data, estimating the average dose-response function (ADRF) while correcting for confounders has become an important problem (Hirano & Imbens, 2004; Imai & Van Dyk, 2004; Kennedy et al., 2017; Fong et al., 2018) . Recently, papers in causal inference (Johansson et al., 2016; Alaa & van der Schaar, 2017; Shalit et al., 2017; Schwab et al., 2019; Farrell et al., 2018; Shi et al., 2019) have utilized feed forward neural network for modeling. The success of using neural network model lies in the fact that neural networks, unlike traditional parametric models, are very flexible in modeling the complex causal relationship as shown by the universal approximation theorem (Csáji et al., 2001) . Also, unlike traditional non-parametric models, neural network has been shown to be powerful when dealing with high-dimensional input (i.e., Masci et al. (2011) ; Johansson et al. (2016) ), which implies its potential for dealing with high-dimensional confounders. A successful application of neural network to causal inference requires a specially designed network structure that distinguishes the treatment variable from other covariates, since otherwise the treatment information might be lost in the high dimensional latent representation (Shalit et al., 2017) . However, most of the existing network structures are designed for binary treatments and are difficult to generalize to treatments taking value in continuum. For example, Shalit et al. (2017) ; Louizos et al. (2017) ; Schwab et al. (2019) ; Shi et al. (2019) used separate prediction heads for the two treatment options and this structure is not directly applicable for continuous treatments as there is an infinite number of treatment levels. To deal with a continuous treatment, recent work (Schwab et al., 2019) proposed a modification called DRNet. DRNet partitions a continuous treatment into blocks and for each block, trains a separate head, in which the treatment is concatenated into each hidden layer (see Figure 2 ). Despite the improvements made by the building block of DRNet, this structure does not take the continuity of ADRF (Prichard & Gillam, 1971; Schneider et al., 1993; Threlfall & English, 1999) into account, and it produces discontinuous ADRF estimators in practice (see Figure 1 ). We propose a new network building block that is able to strengthen the influence of treatment but also preserve the continuity of ADRF. Under binary treatment, previous neural network models for treatment effect estimation use separate prediction heads to model the mapping from covariates to (expected) outcome under different treatment levels. When it comes to the continuous treatment case, by the continuity of ADRF, this mapping should change continuously with respect to the treatment. To achieve this, motivated by the varying coefficient model (Hastie & Tibshirani (1993) , Fan et al. (1999) , Chiang et al. (2001) ), one can allow the weights of the prediction head to be continuous functions of the treatment. This serves as the first contribution here, called Varying Coefficient Network (VCNet). In VCNet, once the activation function is continuous, the mapping defined by the network automatically produces continuous ADRF estimators as shown in Figure 1 but also prevents the treatment information from being lost in its high dimensional latent representation. The second contribution of this paper is to generalize targeted regularization (Shi et al., 2019) to obtain a doubly robust estimator of the whole ADRF curve, which improves finite sample performance. Targeted regularization was previously used for estimating a scalar quantity (Shi et al., 2019) and it associates an extra perturbation parameter to the scalar quantity of interest. While adapting it to a finitedimensional vector is not difficult, generalization to a curve is far less straightforward. Difficulties arise from the fact that ADRFs cannot be regularized at each treatment level independently because the number of possible levels is infinite and, thus, the model complexity cannot be controlled with the introduction of infinite extra perturbations parameters. Utilizing the continuity (and smoothness) of ADRF (Schneider et al., 1993; Threlfall & English, 1999) , we introduce smoothing to control model complexity. Its model size increases in a specific manner to balance model complexity and regularization strength. Moreover, the original targeted regularization in Shi et al. (2019) is not guaranteed to obtain a doubly robust estimator. By allowing regularization strength to depend on sample size, we obtain a consistent and doubly robust estimator under mild assumptions. Noticing the connection between targeted regularization and TMLE (Van Der Laan & Rubin, 2006) , a by-product of this work is that we give the first (to the best of our knowledge) generalization of TMLE to estimating a function. We do experiments on both synthetic and semi-synthetic datasets, finding that VCNet and targeted regularization boost performance independently. Using them jointly consistently achieves state-ofthe-art performance. Notation We denote the Dirac delta function by δ(•). We use E to denote expectation, P to denote population probability measure and we write P(f ) = f (z)dP(z). Similarly, we denote P n as the empirical measure and we write P n (f ) = f (z)dP n (z). We denote n as the least integer greater than or equal to n, and we denote n as the greatest integer less than or equal to n. We use τ to denote Rademacher random variables. We denote Rademacher complexity of a function class and convergence in probability with o p . Given two random variable X 1 and X 2 , X 1 ⊥ X 2 denotes X 1 and X 2 are independent. We use a n b n to denote that both a n /b n and b n /a n are bounded. F : X → R as Rad n (F) = E sup f ∈F 1 n n i=1 τ i f (X i ) . Given two functions f 1 , f 2 : X → R, we define f 1 -f 2 ∞ = sup x∈X |f 1 (x) -f 2 (x)| and f 1 -f 2 L 2 = x∈X (f 1 (x) -f 2 (x)) 2 dx 1/2 . For a function class F, we define F ∞ = sup f ∈F f ∞ . We denote stochastic boundedness with O p X z ⋮ ⋮ t ∈ [t a , t b ] ⋮ ⋮ t ∈ (t b , t c ] t t ⋮ ⋮ t t ⋯ t ∈ (t c , t d ] DRNet μ NN (t, X) X z π NN (t | X) ⋮ ⋮ θ(t) ⋮ ⋮ VCNet μ NN (t, X) μ NN (t, X)

2. PROBLEM STATEMENT AND BACKGROUND

Suppose we observe an i.i.d sample {(y i , x i , t i )} n i=1 where (y i , x i , t i ) is a realization of random vector (Y, X, T ) with support (Y × X × T ). Here X is a vector of covariates, T is a continuous treatment, and Y is the outcome. Without loss of generality, we assume T = [0, 1]. We want to estimate the Average Dose Response Function (ADRF) ψ(t) := E(Y | do(T = t)), which is the potential expected outcome that would have been observed under treatment level t. Suppose the conditional density of T given X is π(T | X). Throughout this paper, we make the following assumptions: Assumption 1. (a) There exists some constant c > 0 such that π(t | x) ≥ c for all x ∈ X and t ∈ T . (b) The measured covariate X blocks all backdoor paths of the treatment and outcome. Remark 1. Assumption (a) implies that treatment is assigned in a way that every subject has some chance of receiving every treatment level regardless of its covariates, which is a standard assumption to establish doubly robust estimators. Assumption (b) implies that the casual effect is identifiable, i.e., can be estimated using observational data.

3. VCNET: VARYING COEFFICIENT NETWORK STRUCTURE

Under Assumption 1, we have ψ(t) = E [E(Y | X, T = t)] . Thus, a naive estimator for ψ is to obtain an estimator μ of µ, and use ψ(t) = 1 n n i=1 μ(t, x i ). Here µ(t, x) := E(Y | X = x, T = t) and μ is its estimator. Following Shi et al. (2019) , we utilize the sufficiency of the generalized propensity score π(t | X) for estimating ψ (Hirano & Imbens, 2004) : ψ(t) = E [E(Y | π(t | X), T = t)] . It indicates that learning π(t | X) helps the removal of noise and distillation of useful information in X for estimating ψ. Similar to Shi et al. (2019) , we add a separate head for estimating π(t | X), and use the feature z extracted by it for downstream estimation of µ(t, x) (see Figure 2 ). Our contribution here is to propose the varying coefficient structure of the prediction head for µ(t, x), which addresses difficulties confronting continuous treatment as discussed in the following paragraph.

3.1. THE VARYING COEFFICIENT PREDICTION HEAD

Our aim is to predict µ(t, x) = E(Y | T = t, X = x). A naive method is to train a neural network which takes (t, x) as input in the first layer and outputs µ(t, x) in the last layer. However, the role of treatment t is different from that of x and the influence of t might be lost in the high-dimensional hidden features (Shalit et al., 2017) . Aware of this problem, previous work (Schwab et al., 2019) divides the range of treatment into blocks, and then use separate prediction heads for each block (see Figure 2 ). To further strengthen the influence of treatment t, Schwab et al. (2019) appends t to each hidden layer. One problem of this structure is that it destroys the continuity of µ by using different prediction heads for each block of treatment levels. In practice, DRNet indeed produces discontinuous curve (see Figure 1 ). In order to simultaneously emphasize the influence of treatment while preserving the continuity of ADRF, we propose a varying coefficient neural network (VCNet). In VCNet, the prediction head for µ is defined as µ NN (t, x) = f θ(t) (z) , where input z is the feature extracted by the conditional density estimator, and f θ(t) is a (deep) neural network with parameter θ(t) instead of a fixed θ. It means that the nonlinear function defined by the neural network depends on the varying treatment level t, and thus we call this structure the varying coefficient structure (Hastie & Tibshirani, 1993; Fan et al., 1999; Chiang et al., 2001) . For example, if f is an one-hidden-layer ReLU network, we have f θ(t) (z) = D i=1 a i (t)ReLU(b i (t) z) , where a i (t) and b i (t) are weights of the neural network, and θ(t) = [(a 1 (t), b 1 (t)), • • • , (a D (t), b D (t))] . Here we use splines to model θ(t). Suppose that t) , where d θ(t) is the dimension of θ(t). We have θ(t) = [θ 1 (t), ..., θ d θ (t)] ∈ R d θ( θ i (t) = L l=1 a i, ϕ NN (t), where {ϕ NN l } L =1 are the spline basis and a i, 's are the coefficients. Thus, we have θ(t) = AΦ(t) where A =    a 1,1 • • • a 1,L . . . . . . . . . a d θ ,1 • • • a d θ ,L    and Φ(t) = ϕ NN 1 (t), • • • , ϕ NN L (t) . It is worth mentioning that by choosing spline basis of the form I(t 0 ≤ t < t 1 ) with different t 0 , t 1 , we recover the structure in Schwab et al. (2019) , which has a separate prediction head for each block. It indicates that DRNet can also be viewed as a special case of VCNet (under a suboptimal choice of basis functions). In VCNet, the influence of treatment effect t on the outcome directly enters through parameters θ(t) of the neural network, which distinguishes treatment from the other covariates and avoids the treatment information from being lost. Under typical choices of spline basis such as B-spline, once the activation function is continuous, VCNet will automatically produce continuous ADRF estimators.

3.2. CONDITIONAL DENSITY ESTIMATOR

Recall that the input feature z to µ NN is extracted by the conditional density estimator for π(t | x). Here we propose a simple network to estimate π, which is a direct generalization of the conditional probability estimating head in Shi et al. (2019) . Notice that t ∈ [0, 1] and the conditional density π(t | x) is continuous with respect to treatment t for any given x. A continuous function can be effectively approximated by piecewise linear functions. Thus, we divide [0, 1] equally into B grids, estimate the conditional density π(• | x) on the (B + 1) grid points, and the conditional density for other t's are calculated via linear interpolation. To be more specific, we define the network π NN grid as π NN grid (x) = softmax(ω 2 z) ∈ R B+1 , where z = f ω1 (x). Here z ∈ R h is the hidden feature extracted by the network, ω 1 is the parameter for the nonlinear mapping f ω1 , ω 2 ∈ R (B+1)×h , π NN grid (x) = [π 0,NN grid (x), ...., π B,NN grid (x)] and π i,NN grid (x) is the estimated conditional density of T = i/B given X = x. This structure is analogous to a classification network by removing the last layer which outputs the class with highest softmax score. Estimation of conditional density at other t's given X = x are obtained via linear interpolation: π NN (t | x) = π t1,NN grid (x)+B π t2,NN grid (x) -π t1,NN grid (x) (t -t 1 ) , where t 1 = Bt , t 2 = Bt . This estimator π NN is continuous with respect to t for any given x and we finally rescale it to yield a valid density, i.e., π NN (t | x) ≥ 0, ∀t, x and 1 t=0 π NN (t | x)dt = 1, ∀x. There are other options to estimate the conditional density. Popular methods include the mixture density network (Bishop (1994)), the kernel mixture network (Ambrogioni et al. (2017) ) and normalizing flows (Rezende & Mohamed (2015) , Dinh et al. (2016 ), Trippe & Turner (2018) ). Here treatment levels are bounded, and thus Gaussian mixtures are not applicable. In current estimator, the linear interpolation for estimating conditional density on non-grid points can be replaced by kernel smoothing, which is more computationally intensive due to the calculation of normalizing constant. Other techniques including smoothness regularization and data normalization (Rothfuss et al. (2019) ) can be implemented to further enhance the performance. However, density estimation is not the main focus of this paper. Thus, for simplicity, in all experiments we use the aforementioned method without other techniques and it works quite well on datasets we tried.

3.3. TRAINING

Notice that our model requires π NN to extract good latent features z as the input for µ NN to predict. This can be achieved by training π NN to estimate the conditional density, which motivates us to train π NN and µ NN simultaneously by minimizing the following loss: L[µ NN , π NN ] = 1 n n i=1 y i -µ NN (t i , x i ) 2 - α n n i=1 log(π NN (t i | x i )). In loss (1), the first term measures the prediction loss from µ NN . The second term measures the loss from π NN and is the negative log likelihood. And α controls the relative weights of the two losses. Denote μ, π as the optimal solution of the above empirical risk minimization problem (1). After getting μ, one can estimate ψ(•) by ψ(•) = 1 n n i=1 μ(•, x i ). The correctness of this naive estimator relies on whether the truth µ is in the function space defined by the neural network model. However, we can plug μ and π into the non-parametric estimating equation (to be introduced later) to obtain a doubly robust estimator of ψ(t). In this way, we are able to produce an (asymptotically) correct estimator if any one of µ or π is in the model space of neural network. The next section discusses challenges in obtaining such a doubly robust estimator and provides our solution.

4. FUNCTIONAL TARGETED REGULARIZATION

In this section we improve upon the previous method by utilizing semiparametric theory on doubly robust estimators. Doubly robust estimators are built upon π(t | x) and μ(t, x), and it yields a consistent estimator for ψ even if one of them is inconsistent. When both π(t | x) and μ(t, x) are consistent, a doubly robust estimator leads to faster rates of convergence. Here our task is to estimate the whole ADRF curve, which comes with additional challenges. First, we need to find a doubly robust estimator of ψ(t 0 ) for any t 0 ∈ [0, 1].

4.1. DOUBLY ROBUST ESTIMATOR

Before we proceed, we define the following quantity: ζ t0 (Y, X, T, π, µ, ψ) = q t0 (Y, X, T, µ, π) + µ(t 0 , X) -ψ(t 0 ), where q t0 (Y, X, T, µ, π) = δ(T -t 0 ) Y -µ(T, X) π(T | X) . The following theorem serves as the basis for our subsequent estimators: Theorem 1. Under assumption 1 and assume that π(t | x) ≥ c > 0 for all x ∈ X and t ∈ T . For any t 0 ∈ T , ζ t0 is the efficient influence function for ψ(t 0 ). Moreover, ζ t0 is doubly robust in the sense that Pζ t0 (Y, X, T, π, μ, ψ) = 0 if either π = π or μ = µ. Further, if π -π ∞ = O p (r 1 (n)) and μ -µ ∞ = O p (r 2 (n)), we have sup t0∈T |Pζ t0 (Y, X, T, π, μ, ψ)| = O p (r 1 (n)r 2 (n)). Theorem 1 shows that for any t 0 ∈ T , P (q t0 (Y, X, T, μ, π) + μ(t 0 , X)) is a doubly robust estimator for ψ(t 0 ). Under some mild assumptions, one way to obtain a doubly robust estimator of ψ is to utilize the two-stage procedure from Kennedy et al. ( 2017) by regressing Y -μ(T, X) π(T | X) X π(T | x)dP n (x) + μ(T, X) on T using any nonparametric regression methods like kernel regression or spline regression.

4.2. TARGETED REGULARIZATION FOR INFERRING A FINITE DIMENSIONAL QUANTITY

However, as discussed in Shi et al. (2019) , when estimating the term P (q t0 (Y, X, T, μ, π)), the π(T | X) in the denominator might make the finite sample estimator unstable, especially in cases where Assumption 1(a) is nearly violated. Targeted regularization is proposed by Shi et al. (2019) to solve this issue. The key intuition of targeted regularization is to learn μ and π such that P (q t0 (Y, X, T, μ, π)) ≈ 0 and thus the estimation of this term is no more needed. In the binary treatment case where T = {0, 1}, if we want to estimate ψ(1), which is a single quantity, targeted regularization simultaneously optimizes over µ NN , π NN and an extra scalar perturbation parameter using the following loss L TR [µ NN , π NN , ] = L[µ NN , π NN ] + βR TR [µ NN , π NN , ], where R TR [µ NN ,π NN , ] = 1 n n i=1 y i -µ NN (t i , x i ) - t i π NN (1 | x i ) 2 , with L[µ NN , π NN ] defined in (1) . Assume the complexity of the function space of µ NN and π NN is finite and since the complexity of the function space of the introduced perturbation is also finite, we have P [q 1 (Y, X, T, μTR , π)] = P [q 1 (Y, X, T, μTR , π)] + 1 2 ∂ ∂ R TR [μ, π, ] | =ˆ = (P -P n ) q 1 (Y, X, T, μ, π) -ˆ δ(T -t 0 )/π 2 (T | X) = o p (1), where μTR (t, x) := μ(t, x)+ˆ t π(t|x) and (μ, π, ˆ ) is the minimizer of (3). Notice that the first equality holds because at the convergence of the optimization, ∂ ∂ R TR [μ, π, ] | =ˆ = 0. And the last equality is by uniform concentration inequality. This implies that 1 n n i=1 μTR (1, x) is a doubly robust estimator for ψ(1) and for this estimator, no conditional density estimator presents at denominator and thus it has more stable finite sample performance.

4.3. FUNCTIONAL TARGETED REGULARIZATION FOR INFERRING THE WHOLE ADRF

Notice that in loss (3), the scalar is associated with a scalar quantity for inference. One can generalize targeted regularization to estimate a d dimensional vector by using d separate 's (see Theorem 3 in the Appendix). Generalizing to a curve, however, is more challenging. We need to optimize over a function : T → R where (•) is the perturbation associated with ψ(•). Optimizing over the function space of all mappings from T to R is not feasible in practice, and its high complexity will lead to overfitting. Our solution is to utilize the smoothness of µ and π (Prichard & Gillam, 1971; Schneider et al., 1993; Threlfall & English, 1999) , which allows us to use splines {ϕ k } Kn k=1 with K n basis functions to approximate (•). Here the subscript n in K n denotes that the number of basis functions might change with the sample size n. Define n (•) = Kn k=1 α k ϕ k (•). We use the following loss with Functional Targeted Regularization (FTR). L FTR [µ NN , π NN , n ] = L[µ NN , π NN ] + β n R FTR [µ NN , π NN , n ] where R FTR [µ NN , π NN , n ] = 1 n n i=1 y i -µ NN (t i , x i ) - n (t i ) π NN (t i | x i ) 2 . Here R FTR denotes the FTR term and β n → 0 when n → ∞. Remark 2 (On β n ). The targeted regularization proposed by Shi et al. (2019) uses a fixed β. However, using fixed β might lead to the estimator constructed by targeted regularization no more consistent when µ NN is mis-specified, which means the estimator is no more doubly robust. To overcome this issue, we make a slight change on β by allowing β to depend on n. Specifically, we find that once β n = o(1), we are able to make sure that targeted regularization gives doubly robust estimator. See discussion at Remark 4 and Appendix A.1 for more details. Demonstrating the asymptotic correctness of FTR is more challenging than analyzing traditional targeted regularization. One reason is that we no more have ∂ ∂ R TR [μ, π, ] | =ˆ = 0. With some additional efforts, we will establish convergence rate for our estimator using loss (4) in Theorem 2. Before we proceed, let us pause a bit and introduce some definitions, which will be used in the main theorem. Denote μ, π and ˆ n as the minimizer of (4). We use π and μ to denote fixed functions to which π and μ converge in the sense that π -π ∞ = o p (1) and μ -μ ∞ = o p (1). We define g t : X → R, x → µ NN (x, t). We denote G, Q, U as the function space in which g t , µ NN , π NN lies. We denote B Kn as the closed linear span of basis ϕ Kn = {ϕ k } Kn k=1 . The key intuition of the asymptotic correctness of FTR is that: once π NN and π are uniformly upper/lower bounded and some other weak regularization conditions hold, we can show that ˆ n (•) - * (•) L 2 = o p (1) where * (•) := E [(Y -μ) /π | T = •] /E π-2 | T = • . And thus letting μFTR := μ + ˆ n /π, we have P [q t0 (Y, X, T, μFTR , π)] = P ((Y -μFTR ) /π | T = t 0 ) π(t 0 ) ≈ [E ((Y -μ - * /π) /π | T = t 0 )] π(t 0 ) = 0. Assumption 2. We consider the following assumptions: (i) There exists constant c > 0 such that for any t ∈ T , x ∈ X , and (v) B Kn equals the closed linear span of B-spline with equally spaced knots, fixed degree, and dimension K n n 1/6 . π NN ∈ U, we have 1/c ≤ π NN (t | x) ≤ c, 1/c ≤ π(t | x) ≤ c, Q ∞ ≤ c and µ ∞ ≤ c. (ii) Y = µ(X, T ) + V where EV = 0, V ⊥ X, V ⊥ T , Theorem 2. Under Assumption 1 and 2, let ψ( •) := 1 n n i=1 μ(x i , •) + ˆ n (•) π(•|xi) , we have ψ -ψ L 2 = O p n -1/3 log n + r 1 (n)r 2 (n) . where π -π ∞ = O p (r 1 (n)) and μ -µ ∞ = O p (r 2 (n)). Remark 3. In Theorem 2, assumption (i), (iii) and the first half of (v) are weak and standard conditions for establishing convergence rate of spline estimators (Huang et al., 2003; 2004) . Assumption (ii) bounds the tail behavior of V . The second half of (v) restricts the growth rate of K n , which is a typical assumption (Huang et al., 2003; 2004) but with different rate in order to obtain uniform bound. The first half of assumption (iv) states that at least one of μ, π should be consistent. The second half of assumption (iv) considers the complexity of model space, and is a common assumption for problems with nuisance functions (Kennedy et al., 2017) . Remark 4. We want to point out that adding targeted regularization does not affect the limit of μ and π in large sample asymptotics. That is, the limit of μ and π using loss (4) will be the same as using loss (1). We refer the reader to Appendix A.1 for a more detailed discussion and proof. Notice that our proof for Theorem 2 can also be adapted for analyzing modified one-step TMLE (Van Der Laan & Rubin, 2006) . With very similar assumptions, we could obtain double robustness and the same consistency rate for TMLE estimator. Theorem 2 guarantees that if we appropriately control the model complexity, under some mild assumptions, the estimator ψ from targeted regularization is doubly robust, and when both π and μ are consistent, the rate of convergence of ψ to the truth is faster than the individual convergence rate of π or μ. Thus, using targeted regularization theoretically helps us obtain a better estimator of ψ. 

5. RELATED WORK

The Varying Coefficient Structure. Varying coefficient (linear) model is first proposed as an extension of linear model (Hastie & Tibshirani, 1993; Fan et al., 1999) and is usually used for modeling longitudinal data (Huang et al., 2004; Zhang & Wang, 2015; Li et al., 2017; Ye et al., 2019) . The key motivation of varying coefficient model is a dimension reduction technique that avoids the curse of dimensionality for statistical estimation. Different from existing models, our varying coefficient structure is applied on a complex neural network model with a different motivation of enhancing the expressiveness of the treatment effect. Besides, building a hierarchical structure on the network parameter is also explored by the HyperNetwork (Stanley et al., 2009; Ha et al., 2016) . Hypernetworks provide an abstraction that mimics the biology structure: the relationship between a genotype (the hypernetwork), and a phenotype (the main network). The weight of the main network is also a function of a latent embedding variable, which is learned with end-to-end training. Hypernetwork trains a much smaller network to generate the weights of a larger main network in order to reduce search space, while our network directly trains the main network whose weights are linear combinations of spline functions of treatment. Moreover our network is proposed in order to appropriately incorporate treatment into modelling, which is not touched upon in HyperNetwork. Neural Network Structure for Treatment Effect Estimation. We refer readers to the introduction for the connections and comparisons with previous developments using feed forward neural network for treatment effect estimation. In addition to feed forward neural network, previous results also utilize other networks to learn treatment effects. For example, 

6. EXPERIMENTS

Dataset. Since the true ADRF are rarely available for real-world data, previous methods on treatment effect estimation often use synthetic/semi-synthetic data for empirical evaluation. Following this convention, we consider one synthetic and two semi-synthetic datasets: IHDP (Hill, 2011) and News (Newman, 2008) the detailed generating scheme included in the Appendix. IHDP contains binary treatment with 747 observations on 25 covariates, and News consists of 3000 randomly sampled news items from the NY Times corpus (Newman, 2008) . Both IHDP and News are widely used benchmarking datasets for binary treatment effect estimation, but here we focus on continuous treatment and thus we need to generate the continuous treatment as well as outcome by ourselves. The generating scheme is in the Appendix. For IHDP and news, we randomly split into training set (67%) and testing set (33%). Baselines and Settings. For neural network baselines, we compare against Dragonnet (Shi et al., 2019) and DRNet (Schwab et al., 2019) . We improve upon the original Dragonnet and DRNet by (a) using separate heads for T in different blocks for Dragonnet, and (b) adding a conditional density estimation head for DRNet, since it has been suggested by Shi et al. (2019) that adding a conditional density estimation head improves the performance. For non-neural-network baselines, we consider causal forest (Wager & Athey, 2018) , Bayesian Additive Regression Tree (BART) (Chipman et al., 2010) , and GPS (Imbens, 2000) . For VCNet, we use truncated polynomial basis with degree 2 and two knots at {1/3, 2/3} (thus altogether 5 basis). Dragonnet and DRNet use 5 blocks and thus the model complexity of neuralnetwork models are the same. In practice we may vary the degree and number of knots in VCNet, here the choice is made simply for fair comparison against Dragonnet and DRNet, ensuring the number of parameters of the compared models is the same. The other hyper-parameters of each method on each data are tuned on 20 separate tuning sets. Due to space limit, we refer readers to the Appendix A.4 for more details on experimental settings. Estimator and Metrics. To evaluate the effectiveness of targeted regularization, for all neuralnetwork methods we implement four versions: naive version (with conditional density estimator head, trained using loss (1)), doubly robust version (Kennedy et al., 2017) by regressing (2) on treatment with μ, π trained using loss (1), TMLE (Van Der Laan & Rubin, 2006) version with initial estimator trained using loss (1), and TR version trained using loss (4). For non-neural-network based models, we use the usual estimator. For evaluation metric, following Schwab et al. (2019) , we use the average mean squared error (AMSE) on test set, where AMSE =foot_0 S S s=1 T [ ψs (t) -ψ(t)] 2 π(t)dt and ψs (t) is the estimated ψ(t) in the s-th simulation. Results. Table 1 compares neural network based methods. Comparing results in each column, we observe a performance boost from the varying coefficient structure. Comparing results in each row, we find that naive versions consistently perform the worst, targeted regularization often achieves the best performance, whereas performance of doubly robust estimator and TMLE varies across datasets. In Table 2 , we compare our approach with traditional statistical models. We observe that in simulation and IHDP, VCNet + targeted regularization outperforms baselines by a large margin. In News, its performance is close to the best one. The implementation can be found in an open source repository 1 .

7. CONCLUSION

This work proposes a novel varying coefficient network and generalizes targeted regularization to a continuous curve. We provide theorems showing its consistency and double robustness. Experiments show that VCNet structure and targeted regularization boost performance independently and when used together, it improves over existing methods by a large margin.

A APPENDIX A.1 ON THE CONSISTENCY OF μ AND π

We show that adding targeted regularization does not affect the limit of μ and π in large sample asymptotics. That is, the limit of μ and π using loss (4) will be the same as using loss (1). Denote P (µ NN , π NN ) = P (y -µ NN ) 2 + α log π NN (t | x) , P n (µ NN , π NN ) = 1 n n i=1 (y i -µ NN (t i , x i )) 2 + απ NN (t i | x i ) . Lemma 1. Suppose that (µ , π ) is the minimizer of loss P (µ NN , π NN ) and (μ, π, ˆ n ) is the minimizer of L FTR , then we have P (μ, π) -P (µ , π ) = o(1) + O p (n -1/2 ). Proof. We have P (μ, π) -P (µ , π ) ≤ P n (μ, π) -P n (µ , π ) + |(P -P n ) (μ, π)| + |(P -P n ) (µ , π )| (a) ≤ P n (μ, π) -P (µ , π ) + O p (n -1/2 ) = (P n (μ, π) + β n R FTR [μ, π, ˆ n ]) -(P (µ , π ) + β n R FTR [µ , π , 0]) + O p (n -1/2 ) + β n (R FTR [µ , π , 0] -R FTR [μ, π, ˆ n ]) (b) ≤ β n (R FTR [µ , π , 0] -R FTR [μ, π, ˆ n ]) + O p (n -1/2 ) ≤ β n R FTR [µ , π , 0] + O p (n -1/2 ) (c) = o(1) + O p (n -1/2 ), where (a) follows from uniform concentration inequality using the fact that µ NN , π NN is uniformly bounded, Rad n (Q), Rad n (U) = O(n -1/2 ) and the Lipschitz constant of log(x) is bounded when x ∈ [1/c, c] for some finite c. (b) follows from the fact that (μ, π, ˆ n ) is the minimizer of the empirical risk (with FTR). (c) follows from the fact that R[µ , π , 0] = 1 n n i=1 (y i -µ ) 2 = E (y -µ ) 2 + O p (n -1/2 ) = E(V 2 ) + E(µ -µ ) 2 + O p (n -1/2 ) = O(1). Now we prove that μ -µ L 2 + π -π L 2 = o p (1). (5) For simplicity, we ignore the unidentifiability of neural network parameterization and assume (µ , π ) is the unique minimizer in the sense that for any > 0, there exists η( ) > 0 such that inf µ NN -µ L 2 + π NN -π L 2 > P (µ NN , π NN ) -P (µ , π ) > η( ). If Equation ( 5) is not true, then there exists s > 0 such that for any N > 0, there exists n > N such that μ -µ L 2 + π -π L 2 ≥ s. From Lemma 1, we know that P (μ, π) -P (µ , π ) ≤ η(s) when n is sufficiently large. Thus, there exists n 0 such that μ -µ L 2 + π -π L 2 ≥ s, P (μ, π) -P (µ , π ) ≤ η(s), which contradicts (6). Published as a conference paper at ICLR 2021

A.2 TECHNICAL PROOFS

In this section, we prove the two main theorems (Theorem 1 and Theorem 2) and give some additional results which are mentioned briefly in the main text.

A.2.1 NOTATIONS AND DEFINITIONS

We denote δ t0 as the Dirac measure centered on t 0 and recall that δ(•) denotes the Dirac delta function. We use a n b n to denote that a n ≤ Cb n for some C > 0 for all sufficiently large n. We denote 1 n = (1, 1, • • • , 1) T ∈ R n . For any function f and function spaces F 1 , F 2 , we write F 1 + F 2 = {f 1 + f 2 : f 1 ∈ F 1 , f 2 ∈ F 2 }, F 1 F 2 = {f 1 f 2 : f 1 ∈ F 1 , f 2 ∈ F 2 }, f F = {f h : h ∈ F}, f • F = {f • h : h ∈ F}, and F a = {f a : f ∈ F}, ∀a ∈ R. We define ˇ n (•) = P [(Y -μn )/π n | T = •]/P π-2 n | T = • , where (μ n , πn , ˆ n ) is the minimizer of loss (4). We denote ˆ n (•) = K k=1 αk ϕ k (•) as the spline regression estimator of (•). With some slight abuse of notation, we denote ϕ Kn (t) = (ϕ 1 (t), ϕ 2 (t), • • • , ϕ Kn (t)) T ∈ R Kn , B n = ϕ Kn (t 1 ), • • • , ϕ Kn (t n ) T ∈ R n×Kn . We define Π n = diag (π n (t 1 | x 1 ), πn (t 2 | x 2 ), • • • , πn (t n | x n )) , Πn = diag P π-2 n (T | X) | T = t 1 -1/2 , • • • , P π-2 n (T | X) | T = t n -1/2 . We define Z n = (z 1 , z 2 , • • • , z n ) T ∈ R n where z i = y i -μn (t i , x i ) πn (t i | x i ) , Zn = (z 1 , z2 , • • • , zn ) T ∈ R n where zi = P Y -μn (T, X) πn (T | X) | T = t i . Notice that α = B T n Π -2 n B n -1 B T n Z n , and we denote α = B T n Π -2 n B n -1 B T n Π -2 n Π2 n Zn .

A.2.2 USEFUL LEMMAS

This section gives lemmas which are used in our proofs of main theorems. Recall that Theorem 1 consists of two parts: the efficient influence function of ψ(t 0 ) and its double robustness. Notice that ψ(t 0 ) can be written as a special case of a more general parameter Γ = T γ(t; P T )ψ(t)dP T (t) (see Remark 5). Here γ(t; P T ) is a function of t which depends on the probability measure P T . For brevity we also write γ(t) := γ(t; P T ) when the corresponding P T is the true probability measure of treatment T . The following Lemma gives the efficient influence function of Γ. Lemma 2. The efficient influence function for Γ is ζ(Y, X, T, π, µ, Γ) = γ(T )ξ(Y, X, T, π, µ) -Γ + T µ(t, X)γ(t)dP T (t) + γ ε (T ) l ε (T ; 0) X µ(T, x)dP(x) - T X γ(t)µ(t, x)dP(x)dP T (t) -E T γ (T ) l (T ; 0) X µ(T, x)dP(x) , where ξ(Y, X, T, π, µ, Γ) = Y -µ(T,X) π(T |X) X π(T | x)dP(x) + X µ(T, x)dP(x), γ ε (t) = dγ(t ; P T ,ε ) dε | ε=0 , ε (t ; 0) = ∂ log P T ,ε (t) ∂ε | ε=0 and P T,ε is a parametric submodel with parameter ε ∈ R and P T,0 (•) = P T (•). •) , we get Γ = ψ(t 0 ). Setting γ(•) = 1, we get Γ = T ψ(t)dP T (t), which is the average outcome under a randomized trial and is a quantity of interest in its own right (Kennedy et al., 2017) •) , we get Γ = ψ(1) -ψ(0), which is the average treatment effect under binary treatment setting. Remark 5. Setting γ(•) = dδt 0 (•) dP T ( . Setting γ(•) = dδ1(•) dP T (•) -dδ0(•) dP T ( Recall that in Theorem 2, ˆ n is the spline regression estimator of . In order to establish the convergence rate of our final estimator ψ, we need to establish the convergence rate of ˆ n first. Lemma 3. Under assumptions in Theorem 2, we have ˆ n -ˇ n L 2 = O p n -1/3 log n . Remark 6. For fixed objective function, the convergence rate of the B-spline estimator to the truth is a standard result (Huang et al., 2004; 2003) , which is O p (n -2/5 ) when choosing K n n -1/5 . However, Lemma 3 gives a uniform bound on a class of functions in order to deal with the fact that πn and μn are NOT fixed and dependent on the observations. The bound is thus of a larger order n -1/3 √ log n when choosing the optimal K n n -1/6 . Lemma 4. Under assumption in Theorem 2, there exist positive constants M 1 and M 2 such that except on an event whose probability tends to zero, all the eigenvalues of (K n /n) B T n Π -2 n B n fall between M 1 and M 2 , and consequently, (K n /n) B T n Π -2 n B n is invertible. Lemma 5. Assume F 1 ∞ < ∞ and F 2 ∞ < ∞, we have Rad n (F 1 F 2 ) ≤ 1 2 (Rad n (F 1 ) + Rad n (F 2 ))( F 1 ∞ + F 2 ∞ ).

A.2.3 PROOF OF THEOREM 1

Proof. Denote γ(t; P T ) = dδt 0 (t) dP T (t) . Then the efficient influence function ζ t0 of ψ(t 0 ) is obtained by plugging the definition of γ(t; P T ) in Lemma 2 and some simplifications using γ ε (t; P T ) l ε (t; 0) = -γ(t; P T ). So here we only need to prove that the efficient influence function is doubly robust. We have Pζ t0 (Y, X, T, π, μ, Γ) = P δ(T -t 0 ) Y -μ(T, X) π(T | X) + μ(t 0 , X) -ψ(t 0 ) (a) = π(x)π(t | x)δ(t -t 0 ) µ(t, x) -μ(t, x) π(t | x) dxdt + μ(t 0 , x)π(x)dx -µ(t 0 , x)π(x)dx (b) = X π(t 0 | x) π(t 0 | x) -1 (µ(t 0 , x) -μ(t 0 , x)) dP(x), where (a) follows from iterated expectation, (b) follows from π(x | t) = π(t | x)π(x)/π(t). From the last line of Equation ( 8), it is obvious that the desired conclusions hold.

A.2.4 PROOF OF THEOREM 2

Proof. First, from condition (i), we have ˆ (•) X 1 π(• | X) dP n (X) -P δ(T -•) Y -μ(T, X) π(T | X) L 2 = ˆ (•) X 1 π(• | X) dP n (X) -π(•)P Y -μn (X, T ) πn (X, T ) | T = • L 2 ≤ (ˆ (•) -ˇ (•)) X 1 π(• | X) dP n (X) L 2 + ˇ (•) X 1 π(• | X) dP n (X) -π(•)P 1 π2 (• | X) | T = • L 2 ˆ -ˇ L 2 + ˇ (•) X 1 π(• | X) d (P n -P) (X) + ˇ (•) X 1 π(• | X) dP(X) -π(•)P 1 π2 (• | X) | T = • L 2 ˆ -ˇ L 2 + ˇ (•) X 1 π(• | X) d (P n -P) (X) L 2 + P µ(X, T ) -μn (X, T ) πn (X, T ) | T = • X π(• | X) -π(• | X) π(• | X) 1 π(• | X) dP(X) L 2 (a) = O p n -1/3 log n + r 1 (n)r 2 (n) . (9) where (a) follows from Lemma 3, which says ˆ n -ˇ n L 2 = O p (n -1/3 √ log n). From generalization bound and condition (iv), we know sup t0∈[0,1] 1 n n i=1 μn (x i, t 0 ) -Pμ n (X, t 0 ) = sup t0∈[0,1] |P n ĝt0 (X) -Pĝ t0 (X)| = O p (n -1/2 ). Thus, 1 n n i=1 μn (x i, •) -Pμ n (X, •) L 2 . = O p (n -1/2 ) (10) Recall that Theorem 1 says that if sup t∈[0,1] sup X∈X |π n (t | X) -π(t | X)| = O p (r 1 (n)), sup t∈[0,1] sup X∈X |μ n (t, X) -Q(t, X)| = O p (r 2 (n)), then sup t0∈[0,1] P δ(T -t 0 ) Y -μn (T, X) πn (T | X) + μn (t 0 , X) -ψ(t 0 ) = O p (r 1 (n)r 2 (n)). Combining Equation ( 9), Equation (10) and Equation ( 11), using triangle inequality, we have ˆ (•) X 1 π(• | x) dP n (x) + 1 n n i=1 μn (x i , •) -ψ(•) L 2 = O p (n -1/3 log n + r 1 (n)r 2 (n)). So if we set ψ(t 0 ) = ˆ (t 0 ) X 1 π(t 0 | x) dP n (x) + 1 n n i=1 μn (x i , t 0 ) = 1 n n i=1 μ(x i , t 0 ) + ˆ (t 0 ) π(t 0 | x i ) we have ψ -ψ L 2 = O p n -1/3 log n + r 1 (n)r 2 (n) . A.2.5 PROOF OF LEMMA 2 Proof. The proof follows Kennedy et al. (2017) . Denote Γ(ε) = T γ ε (t) X Y yπ(y | x, t; ε)π(x; ε)π(t; ε)dydxdt, where we write γ ε (•) := γ(• ; P T,ε ) for brevity. Then, by definition, the efficient influence function for Γ is the unique function ζ(Y, X, T ) such that Γ ε (0) = E (ζ(Y, X, T ) ε (Y, X, T ; 0)) where ε (y, x, t; 0) = d log P Y,X,T ,ε (y,x,t) dε | ε=0 , P Y,X,T,ε (y, x, t) is a parametric submodel with parameter ε ∈ R, and P Y,X,T,0 (y, x, t) = P Y,X,T (y, x, t), and Γ (0) = dΓ(ε)/dε | ε=0 . We have Γ ε (0) = T γ(t) X Y y [π ε (y | x, t; 0)π(x) + π(y | x, t)π ε (x; 0)] π(t)dydxdt + T γ ε (t) X Y yπ(y | x, t; )π(x; ε)π ε (t; 0)dydxdt + T γ ε (t) X Y yπ(y | x, t; ε)π(x; ε)π(t; ε)dydxdt = I 1 + I 2 + I 3 . From ε (y | x, t; 0) = π ε (y | x, t; 0)/π(y | x, t; 0) and definition of ψ(t), we have I 1 := T γ(t) X Y y [ ε (y | x, t; 0)π(y | x, t)π(x) + π(y | x, t) ε (x; 0)π(x)] π(t)dydxdt = T γ(t) E X E Y |X,T (y ε (y | x, t; 0)) + E X (µ(x, t) (x; 0)) π(t)dt, I 2 := T γ(t)ψ(t)π ε (t; 0)dt = T γ(t)ψ(t) ε (t; 0)π(t)dt, I 3 := T γ ε (t)ψ(t)π(t; 0)dt. Thus, we have Γ ε (0) = T γ(t)E X E Y |X,T (y (y | x, t; 0)) + E X (µ(x, t) (x; 0)) + γ(t)ψ(t) (t; 0) + γ (t)ψ(t) π(t)dt. (13) Meanwhile, for the right hand size term E X,T,Y (ζ(Y, X, T ) ε (Y, X, T ; 0)) in Equation ( 12), from ε (Y, X, T ; 0) = ε (Y |X, T ; 0) + ε (X, T ; 0), where ε (Y |X, T ; 0) and ε (X, T ; 0) are defined similar to ε (Y, X, T ; 0), we know that E X,T,Y (ζ(Y, X, T ) ε (Y, X, T ; 0)) = E X,T,Y (ζ(Y, X, T ) ε (Y |X, T ; 0))+E X,T,Y (ζ(Y, X, T ) ε (X, T ; 0)) . (14) Now we bound each term in Equation ( 14) separately. Recall that ζ(Y, X, T, π, µ) = γ(T ) Y -µ(T, X) π(T | X) X π(T | x)dP(x) + γ(T ) X µ(T, x)dP(x) -Γ + T µ(t, X)γ(t)dP T (t) + γ ε (T ) l ε (T ; 0) X µ(T, x)dP(x) - T X γ(t)µ(t, x)dp(x)dP T (t) -E T γ ε (T ) l ε (T ; 0) X µ(T, x)dP(x) . Thus, for the first term in Equation ( 14), we have E X,T,Y (ζ(Y, X, T ) ε (Y |X, T ; 0)) =E X,T E Y |X,T (ζ(Y, X, T ) ε (Y |X, T ; 0)) (a) = E X,T E Y |X,T γ(T ) Y π(T | X) X π(T | x)dP(x) ε (Y |X, T ; 0) (b) = T ×X γ(t) E Y |x,t [Y ε (Y |x, t; 0)] π(t | x) X π(t | x)dP(x) p(x)π(t | x) dtdx = T ×X γ(t)E Y |x,t [Y ε (Y |x, t; 0)] p(t)p(x)dtdx = T γ(t)E x E Y |x,t [Y ε (Y |x, t; 0)] p(t)dt, where (a) follows from the fact that E Y |X,T ( (Y |X, T ; 0)) = 0, (b) uses law of iterated expectations. For the second term in Equation ( 14), we have E X,T,Y (ζ(Y, X, T ) ε (X, T ; 0)) (a) = E X,T,Y γ(T ) Y -µ(T, X) π(T | X) X π(T | x)dP(x) ε (X, T ; 0) + E X,T γ(T ) X µ(T, x)dP(x) + γ ε (T ) l ε (T ; 0) X µ(T, x)dP(x) ( ε (X|T ; 0) + ε (T ; 0)) + E X,T T µ(t, X)γ(t)dP(t) ( ε (T |X; 0) + ε (X; 0)) + E X,T - T γ(t)ψ(t)dP(t) -E T γ ε (T ) l ε (T ; 0) X µ(T, x)dP(x) -Γ ( ε (T, X; 0)) (b) =E T γ(T ) X µ(T, x)dP(x) ε (T ; 0) + E T γ ε (T ) l ε (T ; 0) X µ(T, x)dP(x) ε (T ; 0) + E X T µ(t, X)γ(t)dP(t) ε (X; 0) (c) = T γ(t)ψ(t) ε (t; 0)p(t)dt + T E X [µ(t, X) ε (X; 0)] γ(t)dP(t) + T γ ε (t)ψ(t)p(t)dt where (a) follows from the fact that ε (X, T ; 0) = ε (X|T ; 0) + ε (T ; 0) = ε (T |X; 0) + ε (X; 0), (b) follows from the fact that E Y |X,T Y = µ(T, X), E T |X ε (T |X; 0) = E X|T ε (X|T ; 0) = E X,T ε (X, T ; 0) = 0 and law of iterated expectations, (c) follows from the definition of ψ(t) = X µ(t, x)dP(x). Comparing Equation (15), Equation ( 16) against Equation (13), we immediately get Γ ε (0) = E X,T,Y (ζ(Y, X, T ) ε (Y, X, T ; 0)) , which implies that ζ(Y, X, T ) is indeed an efficient influence function of Γ.

A.2.6 PROOF OF LEMMA 3

Proof. This proof follows from Huang et al. ( 2004). Let us start with the following decomposition: ˆ n -ˇ n L 2 ≤ ˇ n -˜ n L 2 + ˆ n -˜ n L 2 where ˜ n = ϕ Kn (t) T α. The first term ˇ n -˜ n L 2 is the bias and the second term ˆ n -˜ n L 2 is the variance.

Bound on bias term

Let α ∈ R Kn be such that ( α) T ϕ Kn -ˇ n ∞ = inf f ∈B Kn f -ˇ n ∞ . Then we have ˇ n -˜ n L 2 = ˇ n -( α) T ϕ Kn + ( α) T ϕ Kn -˜ n L 2 ≤ ˇ n -( α) T ϕ Kn L 2 + ( α) T ϕ Kn -˜ n L 2 . By definition of α and properties of B-spline space, we have a bound on the first term ˇ n -( α) T ϕ Kn L 2 = O p (ρ n ) , Control of the first term in Equation ( 18): denote δ = (δ 1 , • • • , δ n ) T := Z n -Zn , then we have B T n Π -2 n B n -1 B T n Z n -Zn 2 2 = Z n -Zn T B n B T n Π -2 n B n -2 B T n Z n -Zn (a) K 2 n n 2 δ T B n B T n δ = K 2 n n 2 n i=1 ϕ Kn (t i )δ i 2 2 = K 2 n n 2 Kn k=1 n i=1 ϕ k (t i )δ i 2 ≤K 2 n Kn k=1 sup π,μ 1 n n i=1 ϕ k (t i )δ i 2 (19) where (a) is from Lemma 4. By definition we know δ i = y i -μn (x i , t i ) π(t i | x i ) -P Y -μn (X, T ) π(T | X) | T = t i = µ(x i , t i ) -μn (x i , t i ) π(t i | x i ) -P µ(X, T ) -μn (X, T ) π(T | X) | T = t i + v i π(t i | x i ) = u i + ṽi , where u i = µ(xi,ti)-μn(xi,ti) π(ti|xi) -P µ(X,T )-μn(X,T ) π(T |X) | T = t i , E (u i | t i ) = 0 and ṽi = vi π(ti|xi) , E (ṽ i | t i , x i ) = 0. Thus, from union bound, we have Prob   Kn k=1 sup π,μ 1 n n i=1 ϕ k (t i )δ i 2 > a   ≤ Kn k=1 Prob   sup π,μ 1 n n i=1 ϕ k (t i )δ i 2 > a K n   = Kn k=1 Prob sup π,μ 1 n n i=1 ϕ k (t i )δ i > a K n = Kn k=1 Prob sup π,μ 1 n n i=1 ϕ k (t i )(u i + ṽi ) > a K n = Kn k=1 Prob sup π,μ 1 n n i=1 ϕ k (t i )u i > 1 2 a K n + Kn k=1 Prob sup π,μ 1 n n i=1 ϕ k (t i )ṽ i > 1 2 a K n . From Lemma 5, we know that Rad n ((Q + µ)U -1 ) ≤ 1 2 ( Q ∞ + U -1 ∞ ) Rad n (Q) + Rad n (U -1 ) (a) ≤ 1 2 ( Q ∞ + U -1 ∞ ) Rad n (Q) + max c 2 2 , 2 (c -1/c) 2 Rad n (U - 1 2c ) + 2c n = O(n -1/2 ), where (a) follows from plugging h : x → 1 x-1/2c + 2c in Theorem 12(4) in Bartlett & Mendelson (2002) . Similarly, write A = (Q + µ)U -1 , from Lemma 5, we have Rad n (ϕ k A) ≤ 1 2 ( ϕ k ∞ + A ∞ )(Rad n (ϕ k ) + Rad n (A)) = O(n -1/2 ). Thus, we bound the first term of (20) using Prob sup π,μ 1 n n i=1 ϕ k (t i )u i > 1 2 a K n (a) E sup π,μ 1 n n i=1 ϕ k (t i )u i 1 2 a Kn (b) K n an , where (a) follows Markov Inequality, and (b) follows from the definition of Rademacher complexity. We bound the second term of Equation ( 20) using union bound: for any M n > 0, Prob sup π,μ 1 n n i=1 ϕ k (t i )ṽ i > 1 2 a K n ≤Prob sup π,μ 1 n n i=1 ϕ k (t i )ṽ i I(|v i | > M n ) > 1 4 a K n + Prob sup π,μ 1 n n i=1 ϕ k (t i )ṽ i I(|v i | ≤ M n ) > 1 4 a K n . We have from Markov Inequality that Prob sup π,μ 1 n n i=1 ϕ k (t i )ṽ i I(|v i | ≤ M n ) > 1 4 a K n E sup π,μ 1 n n i=1 ϕ k (t i )/π(t i | x i )v i I(|v i | ≤ M n ) a/K n (a) K n an M n , where (a) follows from Lemma 5. Also we have Prob sup π,μ i.e., 1 n n i=1 ϕ k (t i )ṽ i I(|v i | > M n ) > 1 4 a K n E sup π,μ 1 n n i=1 ϕ k (t i )ṽ i I(|v i | > M n ) a/K n ≤ E sup π,μ 1 n n i=1 ϕ k (ti) π(ti|xi) |v i |I(|v i | > M n ) a/K n E [|v|I(|v| > M n )] a/K n E ζ i d dε j | ε=0 = dΓ i dε j | ε=0 , ∀i, j = 1, 2, • • • , d Notice that the efficient influence function ζ i (Y, X, T, π, µ, Γ i ) for Γ i does not depend on ε. Thus for each i, j = 1, 2, • • • , d, the above equation can be proved using similar arguments as that in Section A.2.5.

A.4.1 NETWORK STRUCTURE

For all methods, we implement the conditional density estimator as a neural network with two hidden fully connected layers, each consisting of 50 hidden units using ReLU activation. Hidden feature z is defined as the latent representation extracted after the second ReLU activation. We set the number of grids B = 10. The estimation of π(t | x) is computed as introduced in Section 3. Following Schwab et al. (2019) , we use 5 blocks for Dragonnet and DRNet. Structure of prediction head for each block is the same as the prediction head µ for VCNet, except that Dragonnet and DRNet do not use treatment-dependent weights. In VCNet, the prediction head for µ(t, x) is a neural network with two hidden fully connected layers stacking over the hidden feature z. Each hidden layer consists of 50 hidden units with ReLU activation. We use B-spline with degree two and two knots placed at {1/3, 2/3} (altogether 5 basis). In this way, all methods have the same complexity, i.e., the number of parameters. We also tried different structures and found the relative performance of different methods to be similar. Thus, all reported results below are based on this structure. All networks are trained for 800 epochs.

A.4.2 PARAMETER SETTING

For each dataset we tune parameters based on 20 runs. In each run we simulate data, randomly split into training and testing, and use AMSE on testing data for evaluation. We tune the following parameters. For all methods: network learning rate lr ∈ {0.05, 0.005, 0.001, 0.0005, 0.0001} and α ∈ {1, 0.5}. For TR: learning rate for (t): lr ∈ {0.001, 0.0001}, β ∈ {20, 10, 5} × n -1/2 . We found that performance is not sensitive to α. In estimator of (t), we use B-spline with degree 2 and tune the number of knots across {5, 10, 20} (all equally spaced at [0, 1]). For TMLE and doubly robust estimator: we tune parameters of B-spline in the same way as in TR version. During tuning, all networks are trained for 800 epochs.



https://github.com/lushleaf/varying-coefficient-net-with-functional-tr



Figure 1: Estimated ADRF on testing set from a typical run of VCNet and DRNet. From left to right panels are results on simulation, IHDP and News dataset. Both VCNet and DRNet are well optimized. Blue points denote DRNet estimation and red points VCNet. The truth is shown in yellow solid line.

Figure 2: Comparison of network structure between DRNet and VCNet.

and V follows sub-Gaussian distribution. (iii) π, µ, π NN and µ NN have bounded second derivatives for any π NN ∈ Q and µ NN ∈ U. (iv) Either π = π or μ = µ. And Rad n (G), Rad n (Q), Rad n (U) = O n -1/2 .

|v| ≥ max(M n , w))dw a/K n (b) ∞ 0 e -σ[max(Mn,w)] (w)dw and we set W = |v|I(|v| > M ), (b) utilizes the fact that v follows sub-Gaussian distribution, (c) uses Mills ratio.Plugging Equation (21), (23), (24) into (20), and taking M √ log n, a Kn log n n plugging back into Equation (19), givesB T n Π -2 n B n -1 B T n Z n -Zn2 states and proves efficient influence function of a multidimensional vector, which is briefly mentioned in the main text. Suppose Γ= (Γ 1 , Γ 2 , • • • , Γ d ) T ∈ R d where Γ j = T γ (j) (t; P T ) X Y yp(y | x, t)p(x)p(t)dydxdt.Then we have the following theorem: Theorem 3. The efficient influence function for the d-dimensional vector Γ isζ(Y, X, T, π, µ, Γ) = (ζ 1 (Y, X, T, π, µ, Γ 1 ), ζ 2 (Y, X, T, π, µ, Γ 2 ), • • • , ζ d (Y, X, T, π, µ, Γ d )) T ∈ R d where ζ j (Y, X, T, π, µ, Γ j ) is the efficient influence function for Γ j , j = 1, 2, • • • , d. Proof. Define Γ(ε) := (Γ 1 (ε), Γ 2 (ε), • • • , Γ d (ε)) T ∈ R d where for i = 1, 2, • • • , d, Γ i (ε) = T γ (i) (t; P T,ε ) X Y yp(y | x, t; ε)p(x; ε)p(t; ε)dydxdt.Define (Y, X, T ; ε) = log P Y,X,T ;ε where P Y,X,T ;ε is a parametric submodel with parameter ε ∈ R d and P Y,X,T ;0 = P Y,X,T . Then, the efficient influence function ζ is defined as the unique function such that

± 0.00094 0.026 ± 0.0012 0.037 ± 0.00086 0.028 ± 0.00088 Drnet 0.042 ± 0.00090 0.023 ± 0.0011 0.035 ± 0.00083 0.027 ± 0.00086 Vcnet 0.018 ± 0.00098 0.022 ± 0.0013 0.016 ± 0.00082 0.014 ± 0.00091 Experiment result comparing neural network based methods. TR refers to targeted regularization. Numbers reported are AMSE of testing data based on 100 repeats for Simulation and IHDP and 20 repeats for News, and numbers after ± are the estimated standard deviation of the average value.

. The synthetic dataset contains 500 training points and 200 testing points, with Comparison of VCNet against non-neural-network based baselines. Reported AMSE are averaged over 100 experiments for simulation and IHDP, and 20 experiments for News. Numbers after ± are estimated standard deviation of the average AMSE.

annex

where ρ n = inf f ∈Span{ϕ Kn } sup t∈[0,1] |ˇ n (t) -f (t)|. Notice that the second term can also be bounded:( α)where (a) follows from properties of B-spline basis functions, (b) follows from Lemma 4, (c) follows from properties of B-spline space such thatn Zn , (d) is from the upper and lower boundedness of πn . Following proof of Lemma A.6 of Huang et al. (2004) , for any a >where (a) uses union bound, (b) follows from Hoeffding's Inequality for bounded random variables. Since Eϕ k (T )Thus, we can bound the bias termBound on variance term From properties of B-spline space, we haveNotice thatControl of the second term in Equation ( 18): Notice that each coordinate of Zn -Π -2 n Π2 n Zn is bounded, thus using similar arguments as that of Equation ( 21), we know thatCombining Equation ( 26) and ( 27) into (18), we know thatand thus,Combining the rate on bias and variance term, we getwhere (a) follows from assumption (iii), givingwhen taking K n n 1/6 . A.2.7 PROOF OF LEMMA 4Proof. Suppose the SVD decomposition of .3 of Huang et al. (2004) , we know that all diagonal elements of (K n /n) Λ T Λ fall between some positive constants. Notice that the eigenvalues ofFrom the upper and lower boundedness of πn , we can get the desired conclusion.A.2.8 PROOF OF LEMMA 5Published as a conference paper at ICLR 2021

A.4.3 STATISTICAL BASELINES

We implement Causal forest (Wager & Athey, 2018) using R package 'grf' (Tibshirani et al., 2018) , BART using R package 'bartMachine' (Kapelner et al., 2016) , and GPS using R package 'causaldrf' (Galagate et al., 2015) . We tune the paramters of each method on each dataset using 20 separate tuning sets, including the number of trees for BART, the number of trees and minimum node size for causal forest, and the number of knots for GPS. The other hyper-parameters are set to the default value of the R packages.

Synthetic Dataset

We generate data as follows:∼ Unif[0, 1], where x j is the j-th dimension of x ∈ R 6 , andwhere t = (1 + exp(-t)) -1 . Notice that π(t | x) only depends on x 1 , x 2 , x 3 , x 4 , x 5 while Q(t, x) only depends on x 1 , x 3 , x 4 , x 6 . As discussed in Shi et al. (2019) , this allows us to observe the improvement using VCNet when noise covariates exist. Results are reported in Table 1 .IHDP The original semi-synthetic IHDP dataset from Hill (2011) contains binary treatments with 747 observations on 25 covariates. To allow comparison on continuous treatments, we randomly generate treatment and response using:+ 2 max(x 3 , x 5 , x 6 ) 0.2 + min(x 3 , x 5 , x 6 ) + 2 tanh 5 i∈Sdis,2 (x i -c 2 ) |S dis,2 | -4 + N (0, 0.25), y | x, t = sin(3πt) 1.2 -t tanh 5 i∈Sdis,1 (x i -c 1 ) |S dis,1 | + exp(0.2(x 1 -x 6 )) 0.5 + 5 min(x 2 , x 3 , x 5 ) + N (0, 0.25),where t = (1 + exp(-t)) -1 , S con = {1, 2, 3, 5, 6} is the index set of continuous features, S dis,1 = {4, 7, 8, 9, 10, 11, 12, 13, 14, 15}, S dis,2 = {16, 17, 18, 19, 20, 21, 22, 23, 24 , 25} and S dis,1 ∪S dis,2 =[25] -S con . Here. Notice that all continuous features are useful for π(t | x) and Q(t, x) but only S dis,1 is useful for Q and only S dis,2 is useful for π. Following Hill (2011), covariates are standardized with mean 0 and standard deviation 1 and the generated treatments are normalized to lie between [0, 1]. Results are summarized in Table 1 .News The News dataset consists of 3000 randomly sampled news items from the NY Times corpus (Newman, 2008) , which was originally introduced as a benchmark in the binary treatment setting (Johansson et al., 2016) . We generate the treatment and outcome in a similar way as Bica et al. (2020) . We first generate v 1 , v 2 and v 3 from N (0, 1) and then set v i = v i / v i 2 for i = {1, 2, 3}. Given x, we generate t from Beta 2, 

