MULTI-TREATMENT EFFECT ESTIMATION WITH PROXY: CONTRASTIVE LEARNING AND RANK WEIGHTING

Abstract

We study the treatment effect estimation problem for continuous and multidimensional treatments, in the setting with unobserved confounders, but highdimension proxy variables for unobserved confounders are available. Existing methods either directly adjust the relationship between observed covariates and treatments or recover the hidden confounders by probabilistic models. However, they either rely on a correctly specified treatment assignment model or require strong prior of the unobserved confounder distribution. To relax these requirements, we propose a Contrastive regularizer (Cr) to learn the proxy representation that contains all the relevant information in unobserved confounders. Based on the Cr, we propose a novel Rank weighting method (Rw) to de-bias the treatment assignment. Combining Cr and Rw, we propose a neural network framework named CRNet to estimate the effects of multiple continuous treatments under unobserved confounders, evaluated by the Average Dose-Response Function. Empirically, we demonstrate that CRNet achieves state-of-the-art performance on both synthetic and semi-synthetic datasets.



Causal inference is widely applied for explanatory analysis and decision making, e.g., Precision Medicine (Raita et al., 2021) , Advertisement (Lada et al., 2019) , Education (Johansson et al., 2016) and Digital Economy (Nazarov, 2020) . With accessible observation data, many existing algorithms accurately estimate the effect of binary treatment by adjusting the confounders (i.e., the common causes of treatments and outcomes) which rely on unconfoundedness assumption that all confounders are observed. However, continuous and multi-dimensional treatments and unmeasured confounders are common in practice. For instance, practitioners seek to develop precise medicine by studying the response of multiple drug dosages (i.e., treatment) on patient health state (i.e., outcome) (Shi et al., 2020) . Besides, due to technique and manipulation issues, some key variables, associated with the treatments and outcomes, like patient's immunity maybe missing in the historical data, which are referred to as unmeasured confounders. To detect and adjust unmeasured confounders, practitioners would record some proxy variables (noised unobserved confounders, e.g., antibodies) which don't have a direct effect on treatments and outcome of interest but has a spurious association through shared common confounders (Fig. 1(a) ). In continuous treatments setting, under unconfoundedness assumption, recent works discretize the continuous treatment into multi-valued treatment (Hill, 2011; Wager & Athey, 2018) to traditional models, or develop generalize balancing methods for continuous scenario (Hirano & Imbens, 2004; Vegetabile et al., 2021; Huling et al., 2021) . Among them, state-of-the-art works (Wu & Fukumizu, 2021; Schwab et al., 2020; Nie et al., 2021) learn a low-dimensional representation for raw data and balance it using minimizing mutual information, which discard the imbalance part of raw data and lose most information for predictive task in practice. In fact, the technique implements a trade-off decreasing the estimator variance at the price of increasing the bias. Furthermore, with unobserved confounders, if we control the proxy rather than unobserved variables, the effect estimation will induce additional bias, referred as recovery bias. To deal with this bias, instead of balancing representations and discarding information to block the relationship between observed covariates and treatments, we propose a novel Contrastive regularizer (Cr) to learn a proxy representation for capturing all the relevant information in unobserved confounders with contrastive learning (He et al., 2020; Chen et al., 2020; Grill et al., 2020) which regularize representation space by positive and negative pairs. In Cr, we define the positive pair is the pair of treatments and proxies from the same sample, and the negative pair is the pair of treatment from one sample and proxies from different samples. And with an ideally representation for confounders, we would adopt a balancing methods to eliminate confounding bias, such as generalized propensity score (Hirano & Imbens, 2004 ). However, one limitation is that the covariate balancing methods rely on the correct specified models. If we don't have any prior for the models of propensity score, i.e., the conditional distribution of treatment conditioning on the covariates, the effect estimation would still be biased, especially for high-dimensional data and continuous treatment. Besides, balancing methods still suffer from extreme values problem. Although recent methods (Fong et al., 2018; Vegetabile et al., 2021; Huling et al., 2021) propose to clip the score value or optimize balancing weights directly, they still fail in complex data, especially, under multi-continuous treatment setting. So a balancing method that have no extreme values and adapted to unobserved confounders is urgently needed. Therefore, to control for bias from treatment assignment, we propose to rank the weights obtained from inverse propensity score for more effective balancing weighting. Based on the proxy representation learned above, we sort the propensity score based weights in descending order and record their rank (the order in sorted data) as rank weights (Rw), which is an effective and robust weights for treatment effect estimation, theoretically. Combining Contrastive regularizer (Cr) and Rank weighting (Rw) methods, we propose a neural network framework CRNet to alleviate the outcome approximate bias in estimating the Average Dose-Response Function (ADRF). CRNet can accurately estimate the effects of multiple continuous treatments with high-dimension proxy variables. Empirically, we demonstrate that CRNet achieves state-of-the-art performance on both synthetic and semi-synthetic datasets.

2. RELATED WORK

Causal effect identification with proxy methods Proxy (Guo et al., 2020) assumes that the unobserved confounders can be recovered from the observed covariates. CEVAE (Louizos et al., 2017) , intact-VAE (Wu & Fukumizu, 2021) recover unobserved confounders with VAE (Kingma et al., 2019) constraint. Negative controls (Lipsitch et al., 2010) assume that there exist two negative control variables: one is related to treatments and confounders, and another is related to outcomes and confounders. DFPV (Xu et al., 2021) introduces neural networks to model the bridge function (Miao et al., 2018) for estimating the causal effect. The setting of this paper is similar to the proxy. But our method need no data distribution prior and outperforms others in performance. Estimation methods for continuous treatments For estimating the continuous treatment effect, a branch of methods include spline (Imai & Van Dyk, 2004) , kernel methods (Flores et al., 2012) , ensemble methods (Hill, 2011; Wager & Athey, 2018) , representation-based methods (Schwab et al., 2020; Nie et al., 2021; Bica et al., 2020) model the relationship between treatments and outcomes. There is also a branch of methods (Hirano & Imbens, 2004; Imai & Van Dyk, 2004; Robins et al., 2000; Vegetabile et al., 2021; Arbour et al., 2021; Huling et al., 2021) 

3. PRELIMINARIES

Notation For self-consistency, we use uppercase for random variables (e.g., A) and lowercase for their realization (e.g., a). We suggest bold the character A as a vector, otherwise a scalar. Given a variable A p i , superscript p represents the dimension of A, and subscript i denotes the i-th sample of A. D A refers to the total number of dimensions of A. N A means the total number of samples of A. Besides, a real valued sample {X p i , T q i , U r i , Y i } ∈ R p+q+r+1 denotes a random sample with observed covariates X i ∈ R p ,treatments T i ∈ R q , unobserved confounders U i ∈ R r and outcome Y i ∈ R. And {X i , T i , U i , Y i } n i=1 refers to a set of {X i , T i , U i , Y i } with n samples. E is denoted as expectation and P represents density distribution function. A calligraphic letter H is denoted as a hypothesis space.

3.1. PROBLEM SETUP

As Fig. 1 (a) shown, in this paper, we focus on the treatment effect estimation problem for Continuous and multi-dimensional Treatments, in the setting with unobserved confounders, but high-dimension Proxy variables for unobserved confounders are available (Briefly, CTP problem). Specifically, we formalize proxy as: Definition 1 (Proxy) The observed covariate and the noise view of unobserved confounder. Formally, X = f * (U, ϵ 1 ), where the noise item ϵ 1 ⊥ {T, U} and f * means the true function. Without loss of generality, we focus on estimating the Average Dose-Response Function (ADRF) curve in this paper. We denote ADRF * as the true ADRF and define that Definition 2 (ADRF) The potential outcome of continuous treatments over the population: ADRF * = E[Y i (T i = t)] = E[ϕ * (U, do(t))], (1) where do(t) means the do operation on treatment that do(t) ⊥ {U}.

3.2. MOTIVATION

To analyse the complex CTP problem, we simplify the ADRF estimation considering an additive regression model ϕ(X i |t) given the observed t with no sample selection bias following Imai et al. (2008) : ϕ(X i |t) = φ(U i |t) + ĥ(ϵ 1i |t) And the the estimated ÂDRF of ϕ(X i |t) can be formulated as ÂDRF = E[ϕ(X i |t)], (3) where t means the observed treatment that t ̸ ⊥ {U}.

We set

ÂDRF as baseline and define the estimation error as ∆ = E[ϕ * (U, do(t)) -ϕ * (U|t)] + E[ϕ * (U|t) -φ(U i |t)] -E[ ĥ(ϵ 1i |t)] ) Given the Eq.(4) (the detailed derivation process is in the Appendix), we denote the first error term ∆ T = E[ϕ * (U, do(t)) -ϕ * (U|t)] as the bias from treatment assignment, the second term ∆ Y = E[ϕ * (U|t)-φ(U i |t)] as the bias from outcome approximate and the third term ∆ ϵ1 = -E[ ĥ(ϵ 1i |t)] as the bias from the recovery of U. Thus, we decompose the estimation error ∆ into ∆ = ∆ ϵ1 + ∆ T + ∆ Y (5) Then we the divide ADRF estimation with multi-continuous treatments and proxy problem into three component: 1. Reduce the bias of recovery error ∆ ϵ1 on ADRF estimation. 2. Reduce the bias ∆ T from treatment assignment of T on U. 3. Reduce the bias ∆ Y from the outcome approximation.

3.3. ASSUMPTIONS

Throughout this paper, we assume the two common assumptions Assumption 1 Stable Unit Treatment Value Assumption, SUTVA and Assumption 2 Overlap/Positivity assumption (Imbens & Rubin, 2015) are satisfied. Moreover, we assume the following assumptions. Assumption 3 (Latent unconfoundedness) The potential outcome is independent of treatment assignment given the unobserved confounders. Formally, Y (t) ⊥ T|U. Assumption 4 (Proxy assumption) The proxy is independent of treatment and outcome given unobserved confoudner. Formally, X ⊥ {T, Y }|U. To eliminate the recovery error ∆ ϵ1 , following Louizos et al. (2017) , we consider it as a self-supervised representation learning problem: Recovering latent representation U from P(t, X, y), which means estimating P(U|t, X, y), and we formulate this problem as P(y|t, X) = U P(y|t, X, U)P(U|X, t)dU = U P(y|t, U)P(U|X, t)dU. (For the discussion of proxy identification, see Appendix). Then we make assumptions that Assumption 5 (Recoverability) The density P(U, t, y) of the latent confounders U can be approximately recovered solely from the observations {X, t, y}. Assumption 6 (Proxy representation) With proxies X, there exists some representations E(X|t) (briefly, E(X)) such that E(X) ∼ P (U), which means Y (t) ⊥ U | E(X) for a potential outcome with the specific treatments t.

4.1. CONTRASTIVE REGULARIZER

Based on Assumption 6, the latent representation U will be obtained when approximating the representation E(X). Existing methods (Louizos et al., 2017; Bica et al., 2020) address this problem by VAE (Kingma et al., 2019) , GAN (Goodfellow et al., 2020) etc. They all rely on strong prior of the density form of U. In this paper, inspired by Eq.( 5), we propose a novel contrastive learning model to preserve U and eliminate ∆ ϵ1 from data without explicit distribution prior. Contrastive Learning There exists two functions f ∈ F and g ∈ G, which encode X representations f (X, ϵ 1 ) and g(T, ϵ 2 ), satisfying s(f (X i , ϵ 1 ), g(T i , ϵ 2 )) >> s(f (X j , ϵ 1 ), g(T i , ϵ 2 )), where i ̸ = j. s(•, •) is a function that measures the similarity between representations. Contrastive learning approximates the latent representations by constructing contrastive samples (similar and dissimilar instances), by which similar instances are closer in the projection space, while dissimilar instances are further away in the projection space to maximize the lower bound of the mutual information. Under CTP setting, even if we can control the observed proxies X, the spurious association derived from U still can not be completely eliminated based on traditional representation algorithms. Therefore, we no longer rely on the representation balancing algorithm to cut off the relationship between Xand T (even if we do, we cannot guarantee accurate estimation). Instead, we propose to strengthen the association between X and T using contrastive learning (Jaiswal et al., 2020) to model proxy representation E(X) with neural network E(•) to represent the information from U. The essential part for the contrastive approach is the contrastive pairs for modeling representations. Inspired by Arbour et al. (2021) ; Li et al. (2020) , we construct contrastive pairs with no need of discretizing the treatments in causal inference: X and T in original sample as positive pairs {(X i , T i )} and X and T in permuted sample (shuffle X and T of data to obtain the permuted data) as negative pairs {X j , T i }. Then, as Fig. 2 shown, we set E(X|t) = s(f (X), g(t)) and adopt the NLL (Negative Log-Likelihood) loss (Chen et al., 2020) to design a novel contrastive loss to model P(U|t, X): ℓ Cr (f (X), g(t)) = -log e (s(f (Xi),g(t))) N j=1 e (s(f (Xj ),g(t))) , where s(f (X), g(t)) denotes the cosine similarity f (X)•g(t) ∥f (X)∥∥g(t)∥ . In contrastive aspect, representations {g(T)} are queries and representations {f (X)} are keys. For a query g(T i ), the positive key is f (X i ) of the sample i and the cosine similarity s(f (X i ), g(T i )) value in the numerator in Eq. ( 6) is high. In contrast, the representation f (X j ) is the negative key of g(T i ) where i ̸ = j and the X T f (X) g (T) F(…) G(…) X T f (X) g (T) F(…) G(…) Construct Paris Input Sample 1 … Input Sample N … • • • f (X) g (T) f (X) g (T) f (X) g (T) f (X) g (T) f (X) g (T) f (X) g (T) f (X) g (T) f (X) g (T) f (X) g (T) p X p T p X p T p X p T p X p T p X p T p X p T positive pairs negative pairs S(…) 𝑳 𝑪𝑹 = -log ------------------- Exp( S( ) ) Exp( S( ) ) + Exp( S( ) ) Contrastive loss for Sample Pairs: X T 𝑷𝑿(…) 𝑷 𝑻 (…) p X p T p X p T p X p T Figure 2: Contrastive regularizer. The covariates X are transformed to f (X) via MLPs F. In practice (Chen et al., 2020) , the representation f (X) is not directly constrained by contrastive loss. f (X) transforms to p X through projection head P X . The treatments T are operated in a similar way. p T and p X are constrained by ℓ Cr (f (X), g(T)). For the sake of brevity, we use f (X) and g(T) in the context to represent p X and p T . cosine similarity s(f (X j ), g(T i )) value in the denominator of Eq. ( 6) is low. Constrained by ℓ Cr , we capture the proxy representations {f (X), g(T)} from the data {X, T}. In this section, we propose to strengthen association between treatments T and covariates X to recover unmeasured common causes U using ℓ Cr , which has two responsibilities: (1) strengthen the association between X and T in the same sample, (2) constrain the representation space using X and T in the permuted samples. Benefiting from contrastive learning, these two responsibilities complete each other. With contrastive learning constraints, learned representation E(X) refuse the information of ϵ 1 and maintain the information of U, which means we eliminate the error item ∆ ϵ1 in Eq.( 5). Next, we consider the error term ∆ T in Eq.(5).

4.2.1. MULTIPLE TREATMENT SCORE WEIGHT

Estimating ADRF given proxy representation, existing methods usually apply the balancing methods 1 to approximate the density P(T|U) for balancing score weights. That is, adopting the inverse of the approximated P(T|U) as the sample weights: W i = 1 P(Ti|Ui) . However, P(T|U) is sensitive to correct specified and nearly can not be estimated accurately, especially for high-dimensional U and multi-dimensional continuous T. To approximate the P(T|U) under CTP setting, we adopt a mixture density network (MDN, which uses neural network to learn the Gaussian mixture model, (Bishop, 1994) ) to model P(T|U) = Q q=1 P (t q | U). As the Fig. 3 shown, we apply MDN to approximate P(T|U) and obtain the sample weight W i as W i = 1 P(T i |U i ) , P(T i |U) = K k=1 α k N T i | µ k , Σ 2 k (7) 1 Given that both matching and stratification methods can be considered as special cases of weighted methods, and that the first two methods require discretization of T when it is continuous values, this paper focuses on weighting methods. The loss function of R w is: l Rw = - 1 n n i=1 log K k=1 α k N T i | µ k , Σ 2 k . ( ) where K is the number of sub-Gaussian models N (•) in the Gaussian mixture model, (µ k , Σ k ) is the mean vector and covariance matrix of the kth sub-Gaussian model, and α k is the probability that the observation belongs to the kth sub-Gaussian model. 

E(X)

MDN(…) 𝛼 ! 𝜇 ! 𝜎 ! " • • • Rw ! ℙ (t|U) 𝛼 # 𝜇 $ 𝜎 $ "

4.2.2. RANK WEIGHT

However, the sample weight W from P(T|U) still suffers from extreme values problem. Although recent methods (Fong et al., 2018; Vegetabile et al., 2021; Huling et al., 2021) propose to clip the score value or optimize balance weights directly, they still face a dilemma when the data gets more complex, especially, under multi-continuous treatment setting. So a balancing method that have no extreme values and adapted to unobserved confounders is urgently needed. Therefore, based on the proxy representation learned above, the core contribution of this paper is the rank weights for more effective balancing weighting. Motivated by the balancing score problem, we normalize the IPW weights obtained from Eq. ( 7) {W i } n i=1 ∈ [0, 1] and sort them in descending order. We record their rank (the order in sorted data) as R i ∈ N and the difference between adjacent W as stride δ ∈ R. We define the IPW as ξ(R, δ) = 1 P(t|U) . It's clear that when sample size n is limited, the large δ causes extreme values and when n → ∞, δ → 0, so we define the form of rank weight as ξ(R) and propose Proposition 1 There exists some rank weight ξ(R) that when n → ∞, e -ξ(R) Z → 1 P(t|U) , where Z = e -ξ(R) is the normalizing constant of ξ(R). The Proposition 1 shows that the causal effect estimation with rank weight approximate to the unbiased estimation of causal effect. We detail the definition and proposition in Appendix. When n is limited, to eliminate the stride, we set it to a constant δ = 1 n , and obtain the rank weight ξ(R i ) ≈ R i . The Eq. ( 9) shows that Rw method is adapted to Cr and can be applied to data of any dimension because it only depends on the rank information of weights. when n → ∞, the rank weight approximates to 1 P(t|U) . With limited data samples, the rank weight don't rely on specified models and address extreme values problem. Fig. 3 shows the process of rank weighting: After training of MDNfoot_0 , we inference P(T|U) = K k=1 α k N T | µ k , σ 2 k . Then we sort P(T|U) in descending order and get the rank weight Rw i = ξ(R i ). Then the ∆ T of Eq.( 5) can be eliminated by weighted regression with rank weight.

4.3. CRNET

Combining the Contrastive regularizer and Rank weight, we propose a neural network framework CRNet to estimate ADRF under CTP setting to eliminate ∆ Y . As Fig. 4 shown, the overall CRNet T X F(…) f (X) g (T) G(…) H(…) Construct Paris 𝑙 !"#$% 𝑙 &' 𝑙 ()

Rw

Figure 4 : CRNet. For training procedure, the proxy representation E(X|t) = {f (X), g(t)} constrained by contrastive loss l Cr (X, t) are concatenated and input to MLPs H and MDN D to obtain the estimated outcome Ŷ and the rank weights Rw. The final objective is to minimize the weighted loss in Eq. 10. For inference procedure, the estimated ADRF is obtained by h(f (X), g(t)). architecture contains three components: (1) a contrastive regularizer, which contains two MLPs heads that encode proxy and the observed treatments into representations {f (X), g(T)}. (2) A sample weight learner named tank weighting. This module optimizes the rank weights on ℓ Rw in Eq.( 8) using the representations {f (X), g(T)}. (3) A base MLPs encoder H that concatenates f (X) and g(T) and transforms them as the estimated Ŷ to approximate the observed Y by the weighted regression loss ℓ f inal (W, X, T, Y ). Combining ℓ Cr (X, T) and ℓ Rw (W ), the final loss is defined as: ℓ f inal = N i=1 Rw i * (Y i -Ŷi ) 2 + α * ℓ Er (X, T) + β * ℓ Rw (W ), where Rw i is the rank weight optimized by ℓ Rw (W ). α and β are the hyperparameters of Cr and Rw, respectively.

5. EXPERIMENTS

To evaluate the performance of CRNet for CTP problem, we compare 6 statistical methods and 5 deep-based methods as baselines in ten simulation data and four semi-synthesis data from IHDP & News. All experiments are implemented using PyTorch (Paszke et al., 2019) on Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz. Baselines We compare our model with following baselines: For statistical methods, we use ( 1 (Hainmueller, 2012) . ( 6)DCOWS (Huling et al., 2021) , a balancing method based on the distance covariance (Székely & Rizzo, 2009) . For representation based methods, we apply (7) NN, a neural network with fully MLPs. ( 8) MDN (Bishop, 1994) , a mixture density network for modelling the density. ( 9) DRNet (Schwab et al., 2020) , a multi-head deep model stratified according to T, we use a modified version (Nie et al., 2021) for estimating ADRF. ( 10 Datasets We evaluate the performance of CRNet in ten simulation data and four semi-synthesis data. For simulation experiments, we design 10 simulation datasets and named five of them Data X D T D X (e.g., Data X 1 5 means a simulation with 1 treatment, 5 covariates and no unobserved confounders). We name the other 5 of them Data U D T D X with unobserved confounders. We also conduct 4 semi-synthetic experiments on 2 real-world datasets: IHDPfoot_1 and Newsfoot_2 . IHDP contains 747 observations on 25 covariates. Following Schwab et al. (2020) , we sample 5000 samples with 2870 covariates from News. The semi-synthetic experiment on IHDP is named as IHDP D T . The other 3 semi-synthetic experiments on News are named as News D T . For detailed descriptions of datasets, models, and hyperparameters, see Appendix. Metrics For all experiments, we perform 30 replications (E = 30) to report the average mean squared error (MSE) and the standard deviations (SD) of the average dose-response function estimation. For correlation measurement, we adopt distance correlation (dCor) to evaluate the quality of the proxy representation E(X). dCor(X, Y ) = dCov(X,Y ) √ d Var(X) dVar(Y ) with Var(X) = dCov(X, X), dCov(X, T) := 1 n 2 n i=1 n j=1 A i,j B i,j , where A i,j := a i,j -ā i. -ā .j +ā .. , a i,j = ∥X i -X j ∥ 2 , āi• = 1 n n j=1 a i,j , ā•j = 1 n n j=1 a i,j , ā.. = 1 n 2 n i,j=1 a i,j . And the form of B i,j is similar.

Experimental results on simulation datasets

To assess the performance of CRNet, we conduct simulation experiments increasing the dimensions of treatments and proxies. As Table 1 shown, all methods except GPS perform well in the low-dimensional Data U 1 5. All baselines fails when treatments are multiple in Data U 2 200 and Data U 5 200. Increasing the dimension of treatments as Data U 2 200 and Data U 5 200, we found that CEVAE, which performs well in low dimensions, fails in convergence, which has also been demonstrated in Rissanen & Marttinen (2021) . The performance of CRNet outperforms others across different settings. We further verify the effectiveness of Cr module in CRNet below. Experimental results on the effectiveness of Cr block In the setting with multiple continuous treatments and proxies, we propose the Cr to model E(X) to hold onto the information from unobserved confounders. Practitioners use representation-based approaches to map proxies into a lowdimension representation space which will lose information predictive of the predicted treatment variable. As shown in Fig. 5 (a)foot_3 , the correlation between E(X) and T from conventional methods is still weak. To retain the information predictive, Cr regularize the proxy representation E(X) by contrastive learning, the correlation from CRNet and CRNet(ft)foot_4 is strong. In the experiment of Fig. 5 (b), we first train a U-to-T prediction network to obtain the representation T(U) to represent the relationship between T and U. X T(U) denotes the dCor of X and T(U). DRNet refers to the dCor between T(U) and representation f (X). Others are operated similarly. It demonstrates that Cr successfully regularized the representation E(X) between X and T and other methods not. Experimental results on the effectiveness of Rw block Our downstream block for estimation is Rw. It is reliable iff the proxy representation is accurately measured. To evaluate the performance of Rw block, we conduct 5 simulation experiments with no unobserved confounders. As shown in Experimental results on real-world datasets We further verify the performance of CRNet in real-world datasets IHDP & News. As shown in Table 3 , the traditional methods Bart and Causal Forest cannot estimate the treatment effect accurately and suffer from the high-dimensional proxy imbalanced between different treatments. CRNet obtain a high-quality representation E(X) and retain predictive information of the predicted treatments in representation using contrastive regularizer, but other deep-based methods fails to capture the rich information between high-dimensional covariates and treatments. Therefore, CRNet shows robust performance and achieves the state-ofthe-art in all real-world experiments. 

A APPENDIX

A.1 IDENTIFICATION Given latent unconfoundedness and proxy assumption, we know that the causal effect is not identified conditioning on X. Because ϵ 1 ̸ ⊥ U|X, then Y (t) ̸ ⊥ T|X. We also show this problem in ADRF adjustment formula. The true ADRF is identified as ADRF * = E[Y(t)]] = E U [E[Y(t) | U]]] = E U [E[Y(t) | T = t, U]] = E U [E[Y | T = t, U]] When proxy exists, ADRF is identified as ÂDRF = E[Y(t)]] = E X [E[Y(t) | X]]] = E X [E[Y(t) | T = t, X]] = E X [E[Y | T = t, X]] ̸ = E U [E[Y | T = t, U]] = ADRF * . ( ) It is clear that the adjustment formula of ADRF from proxy is different from that of true ADRF because Y (t) ̸ ⊥ T|X, it will induce the recovery error ∆ ϵ1 in the ADRF estimation phase if the unmeasured confounder U is not correct specified. The performance of using proxy to estimate ADRF depends on the degree of recovery of U. A.2 PROOF OF EQUATION (4) We set ÂDRF as baseline and define the estimation error as ∆ = ADRF * - ÂDRF = E[Y i (T i = t)] -E[ϕ(X i |t)] = E[ϕ * (U, do(t))] -E[ φ(U i |t) + ĥ(ϵ 1i |t)] = E[ϕ * (U, do(t))] -E[ φ(U i |t)] -E[ ĥ(ϵ 1i |t)] = E[ϕ * (U, do(t))] -E[ϕ * (U|t)] + E[ϕ * (U|t)] -E[ φ(U i |t)] -E[ ĥ(ϵ 1i |t)] = E[ϕ * (U, do(t)) -ϕ * (U|t)] + E[ϕ * (U|t) -φ(U i |t)] -E[ ĥ(ϵ 1i |t)]

A.3 ANALYSIS FOR RANK WEIGHT

Based on the definition rank, we define the corresponding index function of rank I(R i ) = i. And we record the stride δ Ri = W I(Ri-1) -W i 0 < R i < n W i R i = 0 as the difference between two adjacent weights of the sorted data. Then we can build a sequence model of W as: W i = W I(Ri-1) -δ Ri 0 < R i < n δ Ri R i = 0 Supposing an extreme value example that the maximum weight is much larger than others, this is because there are many unmeasured weights between the largest weight and the second largest weight we obtained. Combining Eq.( 15) and the continuity of the probability density, it is clear that the excessive stride causes the extreme value of the weights. Moreover, the sample weights with large stride will also cause the sensitivity to misspecified because slight misspecifed of the large stride induce significant bias. Then is just using the rank information enough to make a covariate balance? To answer this question, we formulate the model in Eq.( 15) as: W = ψ(R, I(R), δ) = ξ(R, δ), ( ) where W is the sample weight, R is the rank, I is the index function of R and δ is stride. We reduce the second line to third line because I is a deterministic function of R. Based on Eq.( 14) we notice that in {W i } n i=1 ∈ [0, 1], if n → ∞, then stride → 0 because W Ri - W Ri-1 → 0. It is similar to the unnormalized density. Therefore, we formulate an IPW via Gibbs sampling: 1 P (T|U) = e -W Z = e -ξ(R,δ) Z , where Z = e -ξ(R,δ) . We set n → ∞, then e -ξ(R)) Z → e -ξ(R,δ) Z = 1 P(t|U) .

So we propose that

Proposition There exists some rank weight ξ(R) that when n → ∞, e -ξ(R) Z → 1 P(t|U) , where Z = e -ξ(R) is the normalizing constant of ξ(R). The proposition and Eq.( 16) show that when n → ∞, using rank weight ξ(R) is enough to make a covariate balance because the weights Rw approximate to the IPW of density P(T|U). It means when n → ∞, the causal effect estimation with Rw approximates to the unbiased estimation of causal effect (for unbiased estimation with IPW of P(T|U), see Imbens (2000) ). And when the data size is limited, rank function ξ(R) can effectively avoid the extreme value problem and alleviate the sensitivity to the model misspecified. So we direct eliminate delta by setting δ = 1 n to obtain ξ(R) ≈ R i . The operation can be considered as enforce the distribution of delta is Uniform, which is biased. But as n → ∞, the obtained ξ(R) ≈ R i is approximating to the true IPW.

A.4 EXPERIMENTAL DETAILS

The rules for defining symbols in this section are the same as in the main body. Please note that when the superscript is specified as A p=2 , it means the dimension of A is 2, and when it is not specified (e.g., A 2 ), it means the power of A is 2.

A.4.1 DETAILS ON DATASETS

The dataset split and dimension information corresponding to the data name are expressed in Table 4 . Synthetic datasets we construct synthetic datasets following EB (Vegetabile et al., 2021) . For all simulation datasets, the true covariates U p=1•••200 are constructed as: U p=1•••200 ∼ N (0, 1). For Data X 1 5, Data X 1 50, Data X 1 200 datasets, T p=1 is constructed as: T p=1 = 0.5N (3, 1) + 0.5N (6, 0.5) + 1.5 * And Y p=1 is constructed as:  Y p=1 = 1 e -p=2 p=1 T p + e U p=1



Note that the rank weight is not only adapted to MDN, it can be applied to any balancing weights. https://www.fredjo.com https://paperdatasets.s3.amazonaws.com/news.db X T refers the dCor between the proxies X and treatments T. CRNet(ft) means the correlation between f (X) and g(T).



Figure 1: (a) Causal Structure of Raw Data, i.e., Y ⊥ T | U; (b) Target Relationship from proxy representation, i.e., Y (T) ⊥ U | E(X).

Figure 3: Rank weighting. The proxy representation E(X) are transformed to Gaussian mixture distribution with K Gaussian submodels N (α k , µ k , Σ 2 k ) via MLPs MDN. Then we infer the estimated density P(t|U) and sort it in descending order to get the rank weight Rw.

) VCNet(Nie et al., 2021), a deep model which considers T as a varying coefficient. (11) CEVAE(Louizos et al., 2017), a VAE-based model to constrain the representation of covariates.

Figure 5: Correlation of treatments, unobserved confounders, and covariates. In both figures, the abscissa represents the sample size, and the ordinate represents the value of dCor.

p=1 T p + e X p=1 + 2.1 * X p=2 + 2.2 * X p=3 + 2.3 * X p=4 + X p=5 + 4.0 * p=200 p=151 X p + I(D T==5 * (3 * cos( Data U 1 5, Data U 1 50, Data U 1 200 datasets, T p=1 is constructed as: T p=1 = 0.5N (3, 1) + 0.5N (6, 0.5) + 1.5 * p=3 p=1 U p + 0.5 * p=D U p=151 U p .And Y p=1 is constructed as:Y p=1 = 1 e -p=2 p=1 T p + e U p=1 + 2.1 * U p=2 + 2.2 * U p=3 + 2.3 * U p=4 + U p=5 + I(D U > 5) D U isthe dimension of U and I is the indicator function. For Data U 2 200 and Data U 5 200 datasets, T p=1•••5 is constructed as: p=2 = N (4, 1) + 1.5 * U p=5 . T p=3•••5 = N (p, 0.5) + 100+p q=100 U q .

U p=2 + 2.2 * U p=3 + 2.3 * U p=4 + U p=5 + I(D U > 5)The observed covariates X p=1•••200 of Data U 1 5, Data U 1 50, Data U 1 200, Data U 2 200 and Data U 5 200 are formulated asX p=1•••5 = U p + linespace(0, p 10 , N X ) X p=5•••200 = U p=5•••200 , where linespace(0, p 10 , N X ) means samples N X data from [0, p 10 ].

aim at balancing the covariates shifts. Few previous works take into account of unobserved variables with continuous treatment assignment bias. In this paper, we propose the contrastive regularizer to gain the balancing methods with the presence of unobserved confounders. Also, we propose a new rank weighting method which have no extreme values and is not much sensitive to model misspecified. Combining Cr and Rw, we design a framework CRNet to estimate continuous treatment with proxy.

Results (MSE±SD) on simulation DataU D T D X

all experiments use the same backbone NN. In all experiments, GPS and MDN which direct model the density of P(T|X) induce excessive bias in ADRF estimation. The direct rank weight without Cr performs well in all experiments. And with Cr, our rank weighting method outperforms other weighting methods. It demonstrates that our CRNet is state-of-the-art even with no unobserved confounders.

Results (MSE±SD) on simulation Data X D T D X

Results (MSE±SD) on semi-simulation Real-Data D T , we formulate the estimation error into three terms from recovery of unobserved confounder, treatment assignment and approximation of outcome. We propose the contrastive regularizer to constrain the proxy representation in representation space for the bias from recovery of unobserved confounder. Based on Cr, we propose a rank weighting method to eliminate the extreme values problem and alleviate the sensitivity problem to model misspecified in treatment assignment model. Combining Cr and Rw, we elaborate a CRNet adapted to CTP problem to reduce the outcome approximation bias. CRNet achieves the state-of-the-art performance in estimating ADRF of both synthetic and semi-synthetic data. Brian G Vegetabile, Beth Ann Griffin, Donna L Coffman, Matthew Cefalu, Michael W Robbins, and Daniel F McCaffrey. Nonparametric estimation of population average dose-response curves using entropy balancing weights for continuous exposures. Health Services and Outcomes Research Methodology, 21(1):69-110, 2021.

Dataset descriptionN train /N test D T D X D U D Y

annex

IHDP The generation process of T and Y are formulated as: p=3,5,6 , 1) 2.1 + min(U p=3,5,6 , 1) + N (0, 0.25), Y = sin(3T )U P T = 4 1.2 -T + p=15 p=8U p + N (0, 0.25).U are standardized to N (0, 1) and T are normalized to [0, 1].The observed covariates X p=1•••25 are formulated asNews The generation process of T and Y are formulated as:For News 2, ϵ = 2, for News 5 ϵ = 4. U are standardized to N (0, 1) and T are normalized to [0, 1]., I{mod((p -1), 5) ≡ 2} 0.1(10 + abs(U p ) , I{mod((p -1), 5) ≡ 3} abs(U p ), I{mod((p -1), 5) ≡ 4}Given f X , the observed covariates X p=1•••2870 are formulated as

A.4.2 DETAILS ON MODELS

We construct CRNet with depth 5. As Fig. 2 shown, F consists of 5 FCs with {256, 128, 128, 128, 128} hidden units. G consists of 5 FCs with {32, 64, 64, 32, 32} hidden units. The MDN module consists of 3 FCs with {20, 20, 20}.NN consists of 4 FCs with {32, 32, 32, 1} hidden units. We implement GPS, Bart, CF and DRNet following Schwab et al. (2020) . We improve on DRNet and implement VCNet following Nie et al. (2021) . We implement CEVAE following Louizos et al. (2017) . We normalize simulation data to [0,1] for the conditional density estimator in DRNet and VCNet.

A.4.3 DETAILS ON HYPERPARAMETERS

For GPS , Bart and CF, we use the default hyperparameters as Schwab et al. (2020) . For all representation-based models, we fixed the random seed and search for the best performance with SGD or Adam. We also adjust the learning rate with {0.1, 0.01, 0.001, 0.0001, 0.00001}. For DRNet and VCNet, we adjust the hyperparameters knots with {[0.33, 0.66], [0.2, 0.4, 0.6, 0.8], [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]} and α with {100, 10, 1, 0.1, 0.01, 0.001}.For VC-Net+TR, we adjust the hyperparameters β with {10, 1, 0.1} and the learning rate of TR with {0.1, 0.01, 0.001}.Besides, for CRNet, we adjust hyperparameters α with {100, 10, 1, 0.1, 0.01, 0.001} and β = 1 consistently.

