DOUBLE GENERATIVE ADVERSARIAL NETWORKS FOR CONDITIONAL INDEPENDENCE TESTING Anonymous

Abstract

In this article, we consider the problem of high-dimensional conditional independence testing, which is a key building block in statistics and machine learning. We propose a double generative adversarial networks (GANs)-based inference procedure. We first introduce a double GANs framework to learn two generators, and integrate the two generators to construct a doubly-robust test statistic. We next consider multiple generalized covariance measures, and take their maximum as our test statistic. Finally, we obtain the empirical distribution of our test statistic through multiplier bootstrap. We show that our test controls type-I error, while the power approaches one asymptotically. More importantly, these theoretical guarantees are obtained under much weaker and practically more feasible conditions compared to existing tests. We demonstrate the efficacy of our test through both synthetic and real data examples.

1. INTRODUCTION

Conditional independence (CI) is a fundamental concept in statistics and machine learning. Testing conditional independence is a key building block and plays a central role in a wide variety of statistical learning problems, for instance, causal inference (Pearl, 2009) , graphical models (Koller & Friedman, 2009) , dimension reduction (Li, 2018) , among others. In this article, we aim at testing whether two random variables X and Y are conditionally independent given a set of confounding variables Z. That is, we test the hypotheses: H 0 : X ⊥ ⊥ Y | Z versus H 1 : X ⊥ ⊥ Y | Z, given the observed data of n i.i.d. copies {(X i , Y i , Z i )} 1≤i≤n of (X, Y, Z). For our problem, X, Y and Z can all be multivariate. However, the main challenge arises when the confounding set of variables Z is high-dimensional. As such, we primarily focus on the scenario with a univariate X and Y , and a multivariate Z. Meanwhile, our proposed method can be extended to the multivariate X and Y scenario as well. Another challenge is the limited sample size compared to the dimensionality of Z. As a result, many existing tests are ineffective, with either an inflated type-I error, or not having enough power to detect the alternatives. See Section 2 for a detailed review. We propose a double generative adversarial networks (GANs, Goodfellow et al., 2014) -based inference procedure for the CI testing problem (1). Our proposal involves two key components, a double GANs framework to learn two generators that approximate the conditional distribution of X given Z and Y given Z, and a maximum of generalized covariance measures of multiple combinations of the transformation functions of X and Y . We first establish that our test statistic is doubly-robust, which offers additional protections against potential misspecification of the conditional distributions (see Theorems 1 and 2). Second, we show the resulting test achieves a valid control of the type-I error asymptotically, and more importantly, under the conditions that are much weaker and practically more feasible (see Theorem 3). Finally, we prove the power of our test approaches one asymptotically (see Theorem 4), and demonstrate it is more powerful than the competing tests empirically.

2. RELATED WORKS

There has been a growing literature on conditional independence testing in recent years; see (Li & Fan, 2019) for a review. Broadly speaking, the existing testing methods can be cast into four main categories, the metric-based tests, e.g., (Su & White, 2007; 2014; Wang et al., 2015) , the conditional randomization-based tests (Candes et al., 2018; Bellot & van der Schaar, 2019) , the kernel-based tests (Fukumizu et al., 2008; Zhang et al., 2011) , and the regression-based tests (Hoyer et al., 2009; Zhang et al., 2018; Shah & Peters, 2018) . There are other types of tests, e.g., Bergsma (2004) ; Doran et al. (2014) ; Sen et al. (2017; 2018) ; Berrett et al. (2019) , to mention a few. The metric-based tests typically employ some kernel smoothers to estimate the conditional characteristic function or the distribution function of Y given X and Z. Kernel smoothers, however, are known to suffer from the curse of dimensionality, and as such, these tests are not suitable when the dimension of Z is high. The conditional randomization-based tests require the knowledge of the conditional distribution of X|Z (Candes et al., 2018) . If unknown, the type-I error rates of these tests rely critically on the quality of the approximation of this conditional distribution. Kernel-based test is built upon the notion of maximum mean discrepancy (MMD, Gretton et al., 2012) , and could have inflated type-I errors. The regression-based tests have valid type-I error control, but may suffer from inadequate power. Next, we discuss in detail the conditional randomization-based tests, in particular, the work of Bellot & van der Schaar (2019) , the regression-based and the MMD-based tests, since our proposal is closely related to them.

2.1. CONDITIONAL RANDOMIZATION-BASED TESTS

The family of conditional randomization-based tests is built upon the following basis. If the conditional distribution P X|Z of X given Z is known, then one can independently draw X (1) i ∼ P X|Z=Zi for i = 1, . . . , n, and these samples are independent of the observed samples X i 's and Y i 's. Write X = (X 1 , . . . , X n ) , X (1) = (X (1) 1 , . . . , X (1) n ) , Y = (Y 1 , . . . , Y n ) , and Z = (Z 1 , . . . , Z n ) . Here we use boldface letters to denote data matrices that consist of n samples. The joint distributions of (X, Y , Z) and (X (1) , Y , Z) are the same under H 0 . Any large difference between the two distributions can be interpreted as the evidence against H 0 . Therefore, one can repeat the process M times, and generate X (m) i ∼ P X|Z=Zi , i = 1, . . . , n, m = 1, . . . , M . Write X (m) = (X (m) 1 , . . . , X (m) n ) . Then, for any given test statistic ρ = ρ(X, Y , Z), its associated p-value is p = 1 + M m=1 I{ρ(X (m) , Y , Z) ≥ ρ(X, Y , Z)} /(1 + M ) , where I(•) is the indicator function. Since the triplets (X, Y , Z), (X (1) , Y , Z), . . . , (X (M ) , Y , Z) are exchangeable under H 0 , the p-value is valid, and it satisfies that Pr(p ≤ α|H 0 ) ≤ α + o(1) for any 0 < α < 1. In practice, however, P X|Z is rarely known, and Bellot & van der Schaar (2019) proposed to approximate it using GANs. Specifically, they learned a generator G X (•, •) from the observed data, then took Z i and a noise variable v (m) i,X as input to obtain a sample X (m) i , which minimizes the divergence between the distributions of (X i , Z i ) and ( X (m) i , Z i ). The p-value is then computed by replacing X (m) by X (m) = ( X (m) 1 , . . . , X n ) . They called this test GCIT, short for generative conditional independence test. By Theorem 1 of Bellot & van der Schaar (2019) , the excess type-I error of this test is upper bounded by Pr (p ≤ α|H 0 ) -α ≤ Ed TV ( P X|Z , P X|Z ) = E sup A |Pr(X ∈ A|Z) -Pr( X (m) ∈ A|Z)| ≡ D, (2) where d TV is the total variation norm between two probability distributions, the supremum is taken over all measurable sets, and the expectations in (2) are taken with respect to Z. By definition, the quantity D on the right-hand-side of (2) measures the quality of the conditional distribution approximation. Bellot & van der Schaar (2019) argued that this error term is negligible due to the capacity of deep neural nets in estimating conditional distributions. To the contrary, we find this approximation error is usually not negligible, and consequently, it may inflate the type-I error and invalidate the test. We consider a simple example to further elaborate this. Example 1. Suppose X is one-dimensional, and follows a simple linear regression model, X = Z β 0 + ε, where the error ε is independent of Z and ε ∼ N (0, σ 2 0 ) for some σ 2 0 > 0. Suppose we know a priori that the linear regression model holds. We thus estimate β 0 by ordinary least squares, and denote the resulting estimator by β. For simplicity, suppose σ 2 0 is known too. For this simple example, we have the following result regarding the approximation error term D. Proposition 1 Suppose the linear regression model holds. The derived distribution P X|Z is N (Z β, σ 2 0 I n ), where I n is the n × n identity matrix. Then D is not o(1). To facilitate the understanding of the convergence behavior of D, we sketch a few lines of the proof of Proposition 1. A detailed proof is given in Appendix F.1. Let P X|Z=Zi denote the conditional distribution of X (m) i given Z i , which is N (Z i β, σ 2 0 ) in this example. If D = o(1), then, D ≡ n 1/2 Ed 2 TV ( P X|Z=Zi , P X|Z=Zi ) = o(1). In other words, the validity of GCIT requires the root mean squared total variation distance in (3) to converge at a faster rate than n -1/2 . However, this rate cannot be achieved in general. In our simple Example 1, we have D ≥ c for some universal constant c > 0. Consequently, D in (2) is not o(1). Proposition 1 shows that, even if we know a priori that the linear model holds, D is not to decay to zero as n grows to infinity. In practice, we do not have such prior model information. Then it would be even more difficult to estimate the conditional distribution P X|Z . Therefore, using GANs to approximate P X|Z does not guarantee a negligible approximation error, nor the validity of the test.

2.2. REGRESSION-BASED TESTS

The family of regression-based tests is built upon a key quantity, the generalized covariance measure, GCM(X, Y ) = 1 n n i=1 X i -E(X i |Z i ) Y i -E(Y i |Z i ) , where E(X|Z) and E(Y |Z) are the predicted condition mean E(X|Z) and E(Y |Z), respectively, by any supervised learner. When the prediction errors of E(X|Z) and E(Y |Z) satisfy certain convergence rates, Shah & Peters (2018) proved that GCM is asymptotically normal. Under H 0 , the asymptotic mean of GCM is zero, and its asymptotic standard deviation can be consistently estimated by some standard error estimator, denoted by s(GCM). Therefore, at level α, we reject H 0 , if |GCM|/ s(GCM) exceeds the upper α/2th quantile of a standard normal distribution. Such a test is valid. However, it may not have sufficient power to detect H 1 . This is because the asymptotic mean of GCM equals GCM * (X, Y ) = E{X -E(X|Z)}{Y -E(Y |Z)}. The regressionbased tests require |GCM * | to be nonzero under H 1 to have power. However, there is no guarantee of this requirement. We again consider a simple example to elaborate. Example 2. Suppose X * , Y and Z are independent random variables. Besides, X * has mean zero, and X = X * g(Y ) for some function g. For this example, we have E(X|Z) = E(X), since both X * and Y are independent of Z, and so is X. Besides, E(X) = E(X * )E{g(Y )} = 0, since X * is independent of Y and E(X * ) = 0. As such, GCM * (X, Y ) = E{X -E(X)}{Y -E(Y |Z)} = 0 for any function g. On the other hand, X and Y are conditionally dependent given Z, as long as g is not a constant function. Therefore, for this example, the regression-based tests would fail to discriminate between H 0 and H 1 .

2.3. MMD-BASED TESTS

The family of kernel-based tests often involves the notion of maximum mean discrepancy as a measure of independence. For any two probability measures P , Q and a function space F, define MMD(P, Q|F) = sup f ∈F {Ef (W 1 ) -Ef (W 2 )} , W 1 ∼ P, W 2 ∼ Q. Let H 1 , H 2 be some function spaces of X and Y . Define φ XY = MMD(P XY , Q XY |H 1 ⊗ H 2 ), where ⊗ is the tensor product, P XY is the joint distribution of (X, Y ), and Q XY is the conditionally independent distribution with the same X and Y margins as P XY . Then following the calculations in Appendix D, we have, φ XY = sup h1∈H1,h2∈H2 E[h 1 (X)-E{h 1 (X)|Z}][h 2 (Y )-E{h 2 (Y )|Z}]. We see that φ XY measures the average conditional association between X and Y given Z. Under H 0 , it equals zero, and hence an estimator of this measure can be used as a test statistic for H 0 .

3. A NEW DOUBLE GANS-BASED TESTING PROCEDURE

We propose a double GANs-based testing procedure for the conditional independence testing problem (1). Conceptually, our test integrates GCIT, regression-based and MMD-based tests. Meanwhile, our new test addresses the limitations of the existing ones. Unlike GCIT that only learned the conditional distribution of X|Z, we learn two generators G X and G Y to approximate the conditional distributions of both X|Z and Y |Z. We then integrate the two generators in an appropriate way to construct a doubly-robust test statistic, and we only require the root mean squared total variation norm to converge at a rate of n -κ for some κ > 1/4. Such a requirement is much weaker and practically more feasible than the condition in (3). The notion of doubly-robustness property comes from the classical semiparametric theory in statistics (Tsiatis, 2007) . Specifically, a doubly robust procedure applies both types of models simultaneously, and produces a consistent estimateif either of the two models has been consistently estimated. Moreover, to improve the power of the test, we consider a set of the GCMs, {GCM(h 1 (X), h 2 (Y )) : h 1 , h 2 }, for multiple combinations of transformation functions h 1 (X) and h 2 (Y ). We then take the maximum of all these GCMs as our test statistic. This essentially yields φ XY , which is connected with the notion of MMD. To see why the maximum-type statistic can enhance the power, we quickly revisit Example 2. When g is not a constant function, there exists some nonlinear function h 1 such that h * 1 (Y ) = E{h 1 (X)|Y } is not a constant function of Y . Set h 2 = h * 1 . We have GCM * = E[h 1 {X * g(Y )}{Y -E(Y )}] = Var{h * 1 (Y )} > 0. This enables us to discriminate H 1 from H 0 . We next detail our testing procedure. A graphical overview is given in Figure 1 .

3.1. TEST STATISTIC

We begin with two function spaces, H 1 = h 1,θ1 : θ 1 ∈ R d1 and H 2 = h 2,θ2 : θ 2 ∈ R d2 , indexed by some parameters θ 1 and θ 2 , respectively. In our implementation, we set H 1 and H 2 to the classes of neural networks with a single-hidden layer, finitely many hidden nodes, and the sigmoid activation function. We then randomly generate B functions, h 1,1 , . . . , h 1,B ∈ H 1 , h 2,1 , . . . , h 2,B ∈ H 2 , where we independently generate i.i.d. multivariate normal variables θ 1,1 , . . . , θ 1,B ∼ N (0, 2I d1 /d 1 ), and θ 2,1 , . . . , θ 2,B ∼ N (0, 2I d2 /d 2 ). We then set h 1,b = h 1,θ 1,b , and h 2,b = h 2,θ 2,b , b = 1, . . . , B. Consider the following maximum-type test statistic, max b1,b2∈{1,...,B} σ -1 b1,b2 1 n n i=1 h 1,b1 (X i ) -E{h 1,b1 (X i )|Z i } h 2,b2 (Y i ) -E{h 2,b2 (Y i )|Z i } , (4) where σ 2 b1,b2 is a consistent estimator for √ nGCM(h 1 (X), h 2 (Y )) . See Section A for definition. To compute (4), however, we need to estimate the conditional means E{h 1,b1 (X)|Z}, E{h 2,b2 (Y )|Z} for b 1 , b 2 = 1, . . . , B. In theory, B should diverge to infinity to guarantee the power property of the test. Separately applying supervised learning algorithms 2B times to compute these means is computationally very expensive. Instead, we propose to implement this step based on the generators G X and G Y estimated using GANs, which is computationally much more efficient. Specifically, for i = 1, . . . , n, we randomly generate i.i.d. random vectors {v (m) i,X } M m=1 , {v (m) i,Y } M m=1 and output the pseudo samples X (m) i = G X (Z i , v (m) i,X ), Y (m) i = G Y (Z i , v i,Y ), for m = 1, . . . , M , to approximate the conditional distributions of X i and Y i given Z i . We then compute E{h 1,b1 ( X i )|Z i } = M -1 M m=1 h 1,b1 (X (m) i ), and E{h 2,b2 (Y i )|Z i } = M -1 M m=1 h 2,b2 ( Y (m) i ), for b 1 , b 2 = 1, . . . , B. Plugging those estimated means into (4) produces our test statistic, T ≡ max b1,b2 n -1/2 n i=1 ψ b1,b2,i , where ψ b1,b2,i = σ -1 b1,b2 h 1,b1 (X i ) - 1 M M m=1 h 1,b1 X (m) i h 2,b2 (Y i ) - 1 M M m=1 h 2,b2 Y (m) i . Input: Number of functions B, number of pseudo samples M , and number of data splits L. Step 1: Divide {1, . . . , n} into L folds I (1) , . . . , I (L) . Let I (-) = {1, . . . , n} -I ( ) . Step 2: For = 1, . . . , L, train two generators G ( ) X and G ( ) Y based on {(X i , Z i )} i∈I (-) and {(Y i , Z i )} i∈I (-) , to approximate the conditional distributions of X|Z and Y |Z. Step 3: For = 1, . . . , L and i ∈ I , generate i. i.d. random noises v (m) i,X M m=1 , v (m) i,Y M m=1 . Set X (m) i = G ( ) X Z i , v (m) i,X , and Y (m) i = G ( ) Y Z i , v (m) i,Y , m = 1, . . . , M . Step 4: Randomly generate h 1,1 , . . . , h 1,B ∈ H 1 and h 2,1 ,  . . . , h 2,B ∈ H 2 . Step 5: Compute the test statistic T . Algorithm 1: Algorithm for computing the test statistic. To help reduce the type-I error of our test, we further employ a data splitting and cross-fitting strategy, which is commonly used in statistical testing (Romano & DiCiccio, 2019) . That is, we use different subsets of data samples to learn GANs and to construct the test statistic. We summarize our procedure of computing the test statistic in Algorithm 1.

3.2. BOOTSTRAPPING THE p-VALUE

Next, we propose a multiplier bootstrap method to approximate the distribution of √ nT under H 0 to compute the corresponding p-value. The key observation is that ψ b1,b2 = n -1/2 n i=1 ψ b1,b2,i is asymptotically normal with zero mean under H 0 ; see the proof of Theorem 3 in Appendix F.3 for details. As such, √ nT = max b1,b2 |n -1/2 n i=1 ψ b1,b2,i | is to converge to a maximum of normal variables in absolute values. To approximate this limiting distribution, we first estimate the covariance matrix of a B 2dimensional vector formed by {ψ b1,b2 } b1,b2 using the sample covariance matrix Σ. We then generate i.i.d. random vectors with the covariance matrix equal to Σ, and compute the maximum elements of each of these vectors in absolute values. Finally, we use these maximum absolute values to approximate the distribution of T under the null. We summarize this procedure in Algorithm 2.

3.3. APPROXIMATING CONDITIONAL DISTRIBUTION VIA GANS

We adopt the proposal in Genevay et al. (2017) to learn the conditional distributions P X|Z and P Y |Z . Recall that P X|Z is the distribution of pseudo outcome generated by the generator G X given Z. We consider estimating P X|Z by optimizing min G X max c D c, (P X|Z , P X|Z ), where D c, denotes the Sinkhorn loss function between two probability measures with respect to some cost function c and some regularization parameter > 0. A detailed definition of D c, is given in Appendix B. Intuitively, the closer the two probability measures, the smaller the Sinkhorn loss. As such, maximizing the loss with respect to the cost function learns a discriminator that can better discriminate the samples generated between P X|Z and P X|Z . On the other hand, minimizing the maximum cost with respect to the generator G X makes it closer to the true distribution P X|Z . This yields the minimax Input: Number of bootstrap samples J, and {ψ b1,b2,i } b1,b2,i . Step 1: Compute a B 2 × B 2 matrix Σ whose {b 1 + B(b 2 -1), b 3 + B(b 4 -1)}th entry is given by (n -1) -1 n i=1 (ψ b1,b2,i -ψ b1,b2 )(ψ b3,b4,i -ψ b3,b4 ). Step 2: Generate i.i.d. standard normal variables Z j,b for j = 1, . . . , J, b = 1, . . . , B 2 . Set Z j = (Z j,1 , . . . , Z j,B 2 ) , and T j = Σ 1/2 Z j ∞ , where Σ 1/2 is a positive semi-definite matrix that satisfies Σ 1/2 Σ 1/2 = Σ, and • ∞ is the maximum element of a vector in absolute values. Step 3: Compute the p-value, p = J -1 J j=1 I(T ≥ T j ). Algorithm 2: Algorithm for computing the p-value. formulation min G X max c D c, (P X|Z , P X|Z ) that we target. In practice, we approximate the cost and the generator based on neural networks. Integrations in the objective function D c, (P X|Z , P X|Z ) are approximated by sample averages. A pseudocode detailing our learning procedure is given in Appendix B. The conditional distribution P Y |Z is estimated similarly. We make two remarks. First, our proposed framework is general, and any GANs learning procedure can be applied. In our implementation, we have chosen the Sinkhorn GANs because of its competitive performance. We did not use the original GANs due to its instability and the mode collapse issue. We did implement WGANs, but found the estimated distributions may suffer from the large bias resulting from the gradient penalty enforced in WGANs. Second, it is important to check the "goodness-of-fit" of the generator. In practice, this could be achieved by comparing the conditional histogram of the generated samples to that of the true samples. See Appendix C for details. } m . We call the resulting T * an "oracle" test statistic.

4. ASYMPTOTIC THEORY

We next establish the double-robustness property of T , which helps us better understand why our proposed test can relax the requirement in (3). Informally speaking, the double-robustness means that T is asymptotically equivalent to T * when either the conditional distribution of X|Z, or that of Y |Z is well approximated by GANs. It guarantees T converges to T * at a faster rate than the estimated conditional distribution. In contrast, the convergence rate of the GCIT test statistic is the same as the estimated conditional distribution. As such, our procedure requires a weaker condition. Theorem 1 (Double-robustness) Suppose M is proportional to n, and B = c 0 n c for some con- stants c, c 0 > 0. Then T -T * = o p (1), when either E[d 2 TV { Q ( ) X (•|Z), Q X (•|Z)}] 1/2 = o(log -1/2 n), or E[d 2 TV { Q ( ) Y (•|Z), Q Y (•|Z)}] 1/2 = o(log -1/2 n). The conditions on M and B are mild, as these are user-specified parameters. As we have discussed, when both total variation distances converge to zero, the test statistic T converges at a faster rate than those total variation distances. Therefore, we can greatly relax the condition in (3), and replace it with, for any = 1, . . . , L, E{d 2 TV ( P ( ) X|Z , P X|Z )} 1/2 = O(n -κ ), and E{d 2 TV ( P ( ) Y |Z , P Y |Z )} 1/2 = O(n -κ ), for some constant 0 < κ < 1/2, where P ( )

X|Z and P ( )

Y |Z denote the conditional distributions approximated via GANs trained on the -th subset. The next theorem summarizes this discussion. Theorem 2 Suppose the conditions in Theorem 1 and 5 holds. Then T - T * = O p (n -2κ log n). Since κ > 0, the convergence rate of (T -T * ) is faster than that in (5). To ensure √ n(T -T * ) = o p (1) , it suffices to require κ > 1/4. In contrast to (3), this rate is achievable. We consider three examples to illustrate, while the condition holds in a wide range of settings. Example 3 (Parametric setting). Suppose the parametric forms of Q X and Q Y are correctly specified. Then the requirement κ > 1/4 holds if k = O(n t0 ) for some t 0 < 1/4, where k is the dimension of the parameters defining the parametric model. Example 4. (Nonparametric setting with binary data). Suppose X, Y are binary variables. Then it suffices to estimate the conditional means of X and Y given Z. The requirement κ > 1/4 holds if the mean squared prediction errors of both nonparametric estimators are O(n κ0 ) for some κ 0 > 1/4. Example 5. (Nonparametric setting with general data). Suppose X, Y are continuous variables. We apply GANs to learn the condition distributions. Chen et al. (2020) established the statistical properties of GANs, and one can apply their technical tools to check the convergence rates in (5). In general, the convergence rate depends on the smoothness of the conditional density function. The smoother the conditional density, the faster the rate. Next, we establish the size and the power properties of our proposed test. Theorem 3 Suppose the conditions in Theorem 1 hold. Suppose (5) holds for some κ > 1/4. Then the p-value from Algorithm 2 satisfies that Pr(p ≤ α|H 0 ) = α + o(1). Theorem 3 shows that our proposed test can control the type-I error. Next, to derive the power property, we introduce the pair of hypotheses based on the notion of weak CI (Daudin, 1980) : H * 0 : ECov(f (X), g(Y )|Z) = 0, ∀f ∈ L 2 X , g ∈ L 2 Y versus H * 1 : ECov(f (X), g(Y )|Z) = 0, ∃f, g, where L 2 X and L 2 Y denote the class of all squared integrable functions of X and Y , respectively. We note that conditional independence implies weak conditional independence, i.e., H 0 implies H * 0 . Consequently, H * 1 implies H 1 . The next theorem shows that our proposed test is consistent against the alternatives in H * 1 , but not against all alternatives in H 1 . Theorem 4 Suppose the conditions in Theorem 3 hold. Then the p-value from Algorithm 2 satisfies that Pr(p ≤ α|H * 1 ) → 1, as n → ∞. Finally, we remark that our test is constructed based on φ XY . Meanwhile, we may consider another test based on φ XY Z = MMD(P XY Z , Q XY Z |H 1 ⊗ H 2 ⊗ H 3 ), where P XY Z is the joint distribution of (X, Y, Z), Q XY Z = P X|Z P Y |Z P Z , and H 3 is a neural network class of functions of Z. This type of test is consistent against all alternatives in H 1 . However, in our numerical experiments, we find it less powerful compared to our test. This agrees with the observation by Li & Fan (2019) in that, even though the tests based on weak CI cannot fully characterize CI, they potentially benefit from an improved power.

5. NUMERICAL STUDIES

We give the implementation details of our testing procedure in Appendix A. Type-I error under H 0 . We vary the dimension of Z as d Z = 50, 100, 150, 200, 250, and consider two generation distributions. We first generate Z from a standard normal distribution, then from a Laplace distribution. We set the significance level at α = 0.05 and 0.1. Figure 2 top panels report the empirical size of the tests aggregated over 500 data replications. We make the following observations. First, the type-I error rates of our test and RCIT are close to or below the nominal level in nearly all cases. Second, KCIT and DL-CIT fail in that their type-I errors are considerably larger than the nominal level in all cases. Third, GCIT and CCIT have inflated type-I errors in some cases. Take GCIT as an example. When Z is normal, d Z = 250 and α = 0.1, its empirical size is close to 0.15. This is consistent with our discussion in Section 2.1, as GCIT requires a very strong condition to control the type-I error. Powers under H 1 . We generate Z from a standard normal distribution, with d Z = 100, 200, and vary the value of b = 0.3, 0.45, 0.6, 0.75, 0.9 that controls the magnitude of the alternative. Figure 2 bottom panels report the empirical power of the tests over 500 data replications. We observe that our test is the most powerful, and the empirical power approaches 1 as b increases to 0.9, demonstrating the consistency of the test. Meanwhile, both GCIT and RCIT have no power in all cases. We did not report the power of KCIT and DL-CIT, because as we show earlier, they can not control the size, and thus they empirical powers are meaningless.

5.2. ANTI-CANCER DRUG DATA EXAMPLE

We illustrate our proposed test with an anti-cancer drug dataset from the Cancer Cell Line Encyclopedia (Barretina et al., 2012) . We concentrate on a subset, the CCLE data, that measures the treatment response of drug PLX4720. It is well known that the patient's cancer treatment response to drug can be strongly influenced by alterations in the genome (Garnett et al., 2012) This data measures 1638 genetic mutations of n = 472 cell lines, and the goal of our analysis is to determine which genetic mutation is significantly correlated with the drug response after conditioning on all other mutations. The same data was also analyzed in Tansey et al. (2018) and Bellot & van der Schaar (2019) . We adopt the same screening procedure as theirs to screen out irrelevant mutations, which leaves a total of 466 potential mutations for our conditional independence testing. The ground truth is unknown for this data. Instead, we compare with the variable importance measures obtained from fitting an elastic net (EN) model and a random forest (RF) model as reported in Barretina et al. (2012) . In addition, we compare with the GCIT test of Bellot & van der Schaar (2019) . Table 1 reports the corresponding variable importance measures and the p-values, for 10 mutations that were also reported by Bellot & van der Schaar (2019) . We see that, the p-values of the tests generally agree well with the variable important measures from the EN and RF models. Meanwhile, the two conditional independence tests agree relatively well, except for two genetic mutations, MAP3K5 and FTL3. GCIT concluded that MAP3K5 is significant (p < 0.001) but FTL3 is not (p = 0.521), whereas our test leads to the opposite conclusion that MAP3K5 is insignificant (p = 0.794) but FTL3 is (p = 0). Besides, both EN and RF place FTL3 as an important mutation. We then compare our findings with the cancer drug response literature. Actually, MAP3K5 has not been previously reported in the literature as being directly linked to the PLX4720 drug response. Meanwhile, there is strong evidence showing the connections of the FLT3 mutation with cancer response (Tsai et al., 2008; Larrosa-Garcia & Baer, 2017) . Combining the existing literature with our theoretical and synthetic results, we have more confidence about the findings of our proposed test.

A ADDITIONAL DETAILS ABOUT THE PROPOSED TESTING PROCEDURE

We first discuss the computation time. In our synthetic data example, it took about 2.5 minutes to compute our test, 2 minutes to compute CCIT, and 20 seconds to compute GCIT and DL-CIT. We next discuss some implementation details. For the number of functions B in Algorithm 1, it represents a trade-off. By Theorem 4, B should be as large as possible to guarantee a good power. In practice, the computation complexity increases as B increases. Our numerical studies suggest that the value of B between 30 and 50 achieves a good balance between the power and the computational cost, and we fix B = 30. For the number of pseudo samples M , and the number of sample splittings L, we find the results are not overly sensitive to their choices, and thus we fix M = 100 and L = 3. For the GANs, we use a single-hidden layer neural network to approximate both the discriminator and generator. The number of nodes in the hidden layer is set at 128. The dimension of the input noise ν (m) i,X and ν (m) i,Y is set at 10. The performance of GANs is largely affected by the regularization parameter and the number of Sinkhorn iterations R. In our experiments, we set = 0.8 and R = 30. In practice, we suggest to tune these parameters by investigating the goodness-of-fit of the resulting generator. This could be achieved by comparing the conditional histogram of the generated samples to that of the true samples. See Appendix C for details. We use stochastic gradient descent (SGD) to update the parameters in GANs. The batch size N is set to 64. We find the resulting GANs work reasonably well in our experiments. Finally, we introduce the variance estimator. Define σ 2 b1,b2 = max 1 n -1 n i=1 h 1,b1 (X i ) -E{h 1,b1 (X i )|Z i } h 2,b2 (Y i ) -E{h 2,b2 (Y i )|Z i } -GCM{h 1,b1 (X), h 2,b2 (Y )} 2 , 0 , for any b 1 and b 2 , where 0 denotes some sufficiently small constant. The constant 0 is to guarantee that the denominator of ψ b1,b2,i is strictly greater than zero, such that the proposed test has the desired size and power properties.

B ADDITIONAL DETAILS FOR CONDITIONAL DISTRIBUTION APPROXIMATION USING GANS

We first introduce the notion of optimal transport (OT). Let X and Y be some closed subsets of R N , µ and ν be the probability measures on X and Y, respectively. The Kantorovich formulation of optimal transport is defined by, D c (µ, ν) := inf π∈Π(µ,ν) x,y c(x, y)π(dx, dy), where D is the OT of µ into ν with respect to a cost function c(•, •), and Π(µ, ν) is a set containing all probability measures π whose marginal distributions on X and Y correspond to µ and ν. Notably, when the cost function is the Euclidean distance between the two arguments, D is better known as the Wasserstein distance. To facilitate the computation of evaluating OT losses, Cuturi (2013) and Genevay et al. (2017) suggested to add an entropic regularization to D c . This yields the objective function, D c, (µ, ν) = inf π∈Π(µ,ν) x,y {c(x, y) -H(π|µ ⊗ ν)}π(dx, dy), where H denotes the Kullback-Leibler divergence, and µ ⊗ ν is the product measure of µ and ν. Such an objective function can be efficiently evaluated by the Sinkhorn algorithm (Cuturi, 2013; Genevay et al., 2017) . To alleviate the bias resulting from the entropic regularization term, Cuturi (2013) and Genevay et al. (2017) further considered using the Sinkhorn loss function, defined as, D c, (µ, ν) = 2D c, (µ, ν) -D c, (µ, µ) -D c, (ν, ν). Next, we present our algorithm for learning the generator G Input: data {X i } i∈I ( ) , {Z i } i∈I ( ) , probability distribution ζ on the latent space V, initial values of θ 0 and ϕ 0 , batch size N , regularization parameter , number of Sinkhorn iterations R, and learning rate α.

Repeat:

Train discriminator: We next comment on the performance of the WGANs using Bellot & van der Schaar (2019)'s code. Specifically, we apply their code to their simulation setting as well as our experiments to train the WGAN. We then plot the histograms of the real samples and the WGAN samples scaled between 0 and 1 in Figure 4 . It can be seen that the WGANs perform poorly in both settings. For instance,  Sample {x j } N j=1 from {(X i , Z i )} i∈I ( ) ; Sample {z j } N j=1 from {Z i } i∈I ( ) ; Sample {v j } N j=1 from ζ; y j θ ← {(G θ (z j , v j ), z j )} for all j; ϕ ← ϕ + α∇ ϕ D c, (x N , ŷN θ )). Train generator: Sample {x j } N j=1 from {(X i , Z i )} i∈I ( ) ; Sample {z j } N j=1 from {Z i } i∈I ( ) ; Sample {v j } N j=1 from ζ; y j θ ← {(G θ (z j , v j ), z j )} for all j; θ ← θ -α∇ θ D c, (x N , ŷN θ )). until convergence.

D ADDITIONAL DETAILS FOR MMD

Let X and Y be independent copies of X and Y such that they are conditionally independent given Z. Note that, E{h 1 (X )h 2 (Y )} = E[E{h 1 (X )|Z}E{h 2 (Y )|Z}], Henceforth, φ XY = MMD(P XY , Q XY |H 1 ⊗ H 2 ) = sup h1∈H1,h2∈H2 E{h 1 (X)h 2 (Y )} -E{h 1 (X )h 2 (Y )} = sup h1∈H1,h2∈H2 E{h 1 (X)h 2 (Y )} -E[E{h 1 (X)|Z}E{h 2 (Y )|Z}] = sup h1∈H1,h2∈H2 E{h 1 (X)h 2 (Y )} -E[h 1 (X)E{h 2 (Y )|Z}] -E[{h 1 (X)|Z}h 2 (Y )] + E[E{h 1 (X)|Z}E{h 2 (Y )|Z}] = sup h1∈H1,h2∈H2 E h 1 (X) -E{h 1 (X)|Z} h 2 (Y ) -E{h 2 (Y )|Z} .

E ADDITIONAL NUMERICAL RESULTS

In our simulation settings, we find that DL-CIT, CCIT and KCIT cannot control the type-I error in finite samples. In particular, the type-I errors of DL-CIT are larger than 0.2 in almost all cases. We first use DL-CIT as an example to investigate its type-I error with a larger sample size. We fix d Z = 150, and generate Z from a standard normal distribution. It can be seen that DL-CIT requires a very large sample size, e.g., n = 5000, in order to control the type-I error. We next conduct additional experiments to investigate the performance of the proposed test with a small sample size, i.e., when n = 500. We fix the dimension d = 100, generate Z from a standard normal distribution, and set δ to 0.9 under H 1 . We did not implement DL-CIT and KCIT, because they are no longer valid even when n = 1000. α = 0.05 and 0.1 are 0.08 and 0.13, respectively. The differences 0.08 -0.05 and 0.13 -0.1 are statistically significant as they exceed the Monte Carlo error 1.96 * 0.05 * 0.95/500 = 0.019 and 1.96 0.1 * 0.9/500 = 0.026. Thus its power becomes meaningless.

F PROOFS

We provide the proofs of Proposition 1, and Theorems 2, 3, and 4. We omit the proof of Theorem 1, since it is similar as that of Theorem 2. Theorems 1-4 are established under our choice of the function classes H 1 and H 2 , which are set to the classes of neural networks with a single-hidden layer, finitely many hidden nodes, and the sigmoid activation function, as used in our implementation. Meanwhile, the results can be extended to more general cases.

F.1 PROOF OF PROPOSITION 1

Note that the total variation distance is bounded by 1. Suppose Ed TV ( P X|Z , P X|Z ) = o(1). Then we have d TV ( P X|Z , P X|Z ) = o p (1). By the dominated convergence theorem, we have Ed 2 TV ( P X|Z , P X|Z ) = o(1). By Theorem 1.2 of Devroye et al. (2018) , we have d TV ( P X|Z , P X|Z ) is proportional to min   1, σ -1 0 n i=1 {Z i ( β -β 0 )} 2   . It follows that 1 σ 0 E n i=1 {Z i ( β -β 0 )} 2 = o(1). Applying Theorem 1.2 of Devroye et al. (2018) again, we obtain d TV ( P X|Z=Zi , P X|Z=Zi ) is proportional to min 1, σ -1 0 |Z i ( β -β 0 )| . Thus, we obtain n i=1 Ed 2 TV ( P X|Z=Zi , P X|Z=Zi ) = o(1). Since the data is exchangeable, we have Ed 2 TV ( P X|Z=Zi , P X|Z=Zi ) = o(n -1 ). (6) This shows that when RHS of (2) is o(1), (6) automatically holds. Next, we show ( 6) is violated in the linear regression example. By the data exchangability, it suffices to show n i=1 Ed 2 TV { P X|Z=Zi , P X|Z=Zi } is not o(1). With some calculations, we obtain n i=1 E min 1, σ -2 0 |Z i ( β -β 0 )| 2 = n i=1 Eσ -2 0 |Z i ( β -β 0 )| 2 I{σ -2 0 |Z i ( β -β 0 )| 2 ≤ 1} + n i=1 EI{σ -2 0 |Z i ( β -β 0 )| 2 > 1} = n i=1 Eσ -2 0 |Z i ( β -β 0 )| 2 - n i=1 E[σ -2 0 |Z i ( β -β 0 )| 2 -1]I{σ -2 0 |Z i ( β -β 0 )| 2 > 1}. By the definition of β, we have n i=1 Eσ -2 0 |Z i ( β -β 0 )| 2 = 1 σ 2 0 E( β -β) Z Z( β -β) = 1 σ 2 0 Eε Z(Z Z) -1 Z ε, where ε = (ε 1 , • • • , ε n ) consist of i.i.d. copies of ε defined in Example 1. It follows that n i=1 Eσ -2 0 |Z i ( β -β 0 )| 2 = 1 σ 2 0 Eε Z(Z Z) Z ε = 1 σ 2 0 trace Eεε Z(Z Z) -1 Z = trace EZ(Z Z) -1 Z = d Z , where d Z is the dimension of Z. In the following, we show n i=1 Eσ -2 0 |Z i ( β -β 0 )| 2 I{σ -2 0 |Z i ( β -β 0 )| 2 ≥ 1} = o(1). Combining this together with ( 7) and (8) yields n i=1 E min 1, σ -2 0 |Z i ( β -β 0 )| 2 ≥ d Z -o(1) ≥ 1 -o(1), and hence n i=1 Ed 2 TV { P X|Z=Zi , Q (n) X (•|Z i )} ≥ 1 -o(1) . The proof is hence completed. It remains to show (9), or equivalently, Enσ -2 0 |Z i ( β -β 0 )| 2 I{σ -2 0 |Z i ( β -β 0 )| 2 ≥ 1} = o(1). We have already shown that Enσ -2 0 |Z i ( β -β 0 )| 2 = d Z . By dominated convergence theorem, it suffices to show nσ -2 0 |Z i ( β -β 0 )| 2 I{σ -2 0 |Z i ( β -β 0 )| 2 ≥ 1} = o p (1). By definition, it suffices to show Pr σ -2 0 |Z i ( β -β 0 )| 2 ≥ 1 → 0. However, this is immediate to see by Markov's inequality as Eσ -2 0 |Z i ( β -β 0 )| 2 = d Z n → 0. This completes the proof of Proposition 1.

F.2 PROOF OF THEOREM 2

We begin by providing an upper bound for the function classes H 1 and H 2 . Recall that θ 1,b and θ 2,b are generated according to N (0, 2I d1 /d 1 ) and N (0, 2I d2 /d 2 ), respectively. Since H 1 and H 2 are classes of neural networks with a single hidden layer and finitely many hidden nodes. Both d 1 and d 2 are finite. All the elements { √ d 1 θ 1,b,j } b,j and { √ d 2 θ 2,b,j } b,j are i.i.d. standard normal. For any standard normal variable Z, we have for any t ≥ 1 that Pr(|Z| > t) ≤ 2 +∞ t φ(z)dz ≤ 2 +∞ t zφ(z)dz = 2 exp - t 2 2 , where φ(•) denotes the standard normal density function. Setting t = c * √ 2 log n for some constant c * > 0, we obtain Pr(|Z| > t) ≤ n -c * . It follows from Bonferroni's inequality that Pr max b,j  | d 1 θ 1,b,j | > t max b,j | d 2 θ 2,b,j | > t ≤ Bd 1 d 2 n -c * . ( T * * = max b1,b2 σ -1 b1,b2 1 n n i=1 h 1,b1 (X i ) - 1 M M m=1 h 1,b1 (X (m) i ) h 2,b2 (Y i ) - 1 M M m=1 h 2,b2 (Y (m) i ) , where the σ b1,b2 is constructed based on { X (11) Consequently, the difference |T -T * * | is upper bounded by I 1 + I 2 + I 3 where I 1 = max b1,b2 σ -1 b1,b2 1 n n i=1 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )} h 2,b2 (Y i ) - 1 M M m=1 h 2,b2 (Y (m) i ) , I 2 = max b1,b2 σ -1 b1,b2 1 n n i=1 h 1,b1 (X i ) - 1 M M m=1 h 1,b1 (X (m) i ) 1 M M m=1 {h 2,b2 (Y (m) i ) -h 2,b2 ( Y (m) i )} , I 3 = max b1,b2 σ -1 b1,b2 1 n n i=1 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )} 1 M M m=1 {h 2,b2 (Y (m) i ) -h 2,b2 ( Y (m) i )} . By definition, we have min σ b1,b2 ≥ √ for some constant > 0. It suffices to show I * j = O p (n -2κ log n) for j = 1, 2, 3 where I * 1 = max b1,b2 1 n n i=1 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )} h 2,b2 (Y i ) - 1 M M m=1 h 2,b2 (Y (m) i ) , I * 2 = max b1,b2 1 n n i=1 h 1,b1 (X i ) - 1 M M m=1 h 1,b1 (X (m) i ) 1 M M m=1 {h 2,b2 (Y (m) i ) -h 2,b2 ( Y (m) i )} , I * 3 = max b1,b2 1 n n i=1 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )} 1 M M m=1 {h 2,b2 (Y (m) i ) -h 2,b2 ( Y (m) i )} . )} 1 M M m=1 {h 2,b2 (Y (m) i ) -h 2,b2 ( Y (m) i )} . The rest of the proof is divided into four steps. In the first three steps, we show I ( ) j = O p (n -2κ log n) for j = 1, 2, 3. n the last step, we show T * -T * * = O p (n -2κ log n). Step 1. Without loss of generality, suppose functions in H 1 and H 2 are bounded by log n in absolute values. By Bernstein's inequality, we have Pr M m=1 h 1,b (X (m) i ) -M E{h 1,b (X i )|Z i } ≥ t ≤ 2 exp - t 2 2(M log n + t √ log n/3) , for any b and i. Set t = 3(c + 2)M log n where the constant c is defined in the statement of Theorem 1. For sufficiently large n, we have t √ log n/3 ≤ M log n/2. It follows that Pr M m=1 h 1,b (X (m) i ) -M E{h 1,b (X i )|Z i } ≥ 3(c + 2)M log n ≤ 2 n c+2 . By Bonferroni's inequality, we obtain Pr max b∈{1,••• ,B} max i∈{1,••• ,n} M m=1 h 1,b (X (m) i ) -M E{h 1,b (X i )|Z i } ≥ 3(c + 2)M log n ≤ Bn max b∈{1,••• ,B} max i∈{1,••• ,n} Pr M m=1 h 1,b (X (m) i ) -M E{h 1,b (X i )|Z i } ≥ 3(c + 2)M log n ≤ 2Bn n c+2 . Under the condition B = O(n c ), we obtain with probability 1 -O(n -1 ) that max b∈{1,••• ,B} max i∈{1,••• ,n} M m=1 h 1,b (X (m) i ) -M E{h 1,b (X i )|Z i } ≤ O(1)n -1/2 log n, as M is proportional to n, and O(1) denotes some positive constant. Similarly, we can show max b∈{1,••• ,B} max i∈I ( ) M m=1 h 1,b ( X (m) i ) -M x h 1,b (x) P ( ) X|Z=Zi (dx) ≤ O(1) √ n log n, with probability 1 -O(n -1 ), as well. Combining this together with (12), we obtain with probability 1 -O(n -1 ) that max b∈{1,••• ,B} i∈I ( ) M m=1 {h 1,b (X (m) i ) -h 1,b ( X (m) i )} -M x h 1,b (x){P X|Z=Zi (dx) -P ( ) X|Z=Zi (dx)} ≤ O(1) √ n log n. Conditional on Z i , the expectation of h 2,b2 (Y i ) -M -1 M m=1 h 2,b2 (Y (m) i ) equals zero. Under the null hypothesis, the expectation of M -1 M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )}{h 2,b2 (Y i ) - M -1 M m=1 h 2,b2 (Y (m) i )} equals zero as well. Applying Bernstein's inequality again, we can similarly show with probability tending to 1 that I ( ) 1 can be upper bounded by O(1)(σn -1/2 log 3/2 n + n -1 log 2 n), where O(1) denotes some positive constant and σ 2 = max b1,b2 E 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )}{h 2,b2 (Y i ) - 1 M M m=1 h 2,b2 (Y (m) i )} 2 ≤ max b1 E 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )} 2 log n. Let A denote the event in ( 13). The last term on the second line can be bounded from above by max b1,i E 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )} 2 I(A) log n (15) + max b1,i E 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )} 2 I(A c ) log n. ( ) Since M is proportional to n, by ( 11), the term on the first line of ( 15) is upper bounded by O(1)      n -1 log 2 n + max b∈{1,••• ,B} i∈I ( ) E x h 1,b (x){ P ( ) X|Z=Zi (dx) -P X|Z=Zi (dx)} 2      log n. By the boundedness of the function class H 1 , it can be further bounded from above by O(1) n -1 log 3 n + Ed 2 TV ( P ( ) X|Z , P X|Z ) log 2 n . Under the current conditions, the above quantity is of the order O(n -2κ log 2 n). Consequently, ( 15) is or the order O(n -2κ log 2 n). Note that the event A occurs with probability at least 1 -O(n -1 ). By the boundedness of the function class H 1 , ( 16) is of the order O(n -1 log 2 n). To summarize, we have further that σ 2 is of the order O(n -2κ log 2 n). This implies I ( ) 1 can be bounded from above by O(n -1/2-κ log 5/2 n). This yields I ( ) 1 = O p (n -2κ log n). Step 2. Step 2 can be proven in a similar manner as Step 1 and is thus omitted. Step 3. Under H 0 , the expectation of  1 |I ( ) | i∈I ( ) 1 M M m=1 {h 1,b1 (X (m) i ) -h 1,b1 ( X (m) i )} 1 M M m=1 {h 2,b2 (Y (m) i ) -h 2,b2 ( Y (m) i )} equals E x h 1,b1 (x){ P ( ) X|Z (dx) -P X|Z (dx)} y h 2,b2 Y |Z=Zi , P Y |Z } ≤ 1 2 Ed 2 TV { P ( ) X|Z=Zi , P X|Z } + 1 2 Ed 2 TV { P ( ) Y |Z=Zi , P Y |Z } = O(n -2κ ). This yields that max b1,b2 E x h 1,b1 (x){ P ( ) X|Z (dx) -P X|Z (dx)} y h 2,b2 (y){ P ( ) Y |Z (dy) -P Y |Z (dy)} = O(n -2κ log n). Using similar arguments in Step 1, we can show that I ( ) 3 -max b1,b2 E x h 1,b1 (x){ P ( ) X|Z (dx) -P X|Z (dx)} y h 2,b2 (y){ P ( ) Y |Z (dy) -P Y |Z (dy)} = O p (n -2κ log n). Thus, we obtain I ( ) 3 = O p (n -2κ log n). This completes the proof of Step 3. Step 4. Denote by σ * 2 b1,b2 the variance estimator with { X (m) i To show |T * -T * * | = O p (n -2κ ), it suffices to show max b1,b2 | σ -1 b1,b2 -σ * -1 b1,b2 | = O p (n -c ) for some constant c > 0. Since both σ -1 b1,b2 and σ b1,b2 are well-bounded away from zero. It suffices to show max b1,b2 | σ 2 b1,b2 -σ * 2 b1,b2 | = O p (n -c ). Using similar arguments in Steps 1 and 3, we can show max b1,b2 σ 2 b1,b2 - n n -1 Var [h 1,b1 (X) -E{h 1,b1 (X)|Z}] [h 2,b2 (Y ) -E{h 2,b2 (Y )|Z}] = O p (n -c ), max b1,b2 σ * 2 b1,b2 - n n -1 Var [h 1,b1 (X) -E{h 1,b1 (X)|Z}] [h 2,b2 (Y ) -E{h 2,b2 (Y )|Z}] = O p (n -c ). This completes the proof of Theorem 2.  T * * * = max b1,b2 σ -1 b1,b2 | n -1 n i=1 h 1,b1 (X i ) - 1 M M m=1 h 1,b1 (X (m) i ) h 2,b2 (Y i ) - 1 M M m=1 h 2,b2 (Y (m) i ) , where σ 2 b1,b2 = max n n -1 Var [h 1,b1 (X) -E{h 1,b1 (X)|Z}] [h 2,b2 (Y ) -E{h 2,b2 (Y )|Z}] , . By (12), using similar arguments in I 1 in the proof of Theorem 2, we can show T * * * -T * * * * = O p (n -2κ log n) where T * * * * = max b1,b2 σ -1 b1,b2 | n -1 n i=1 [h 1,b1 (X i ) -E{h 1,b1 (X i )|Z i }] h 2,b2 (Y i ) - 1 M M m=1 h 2,b2 (Y (m) i ) . Similarly, we can show T * * * * -T 0 = O p (n -2κ log n) where T 0 = max b1,b2 σ -1 b1,b2 | n -1 n i=1 [h 1,b1 (X i ) -E{h 1,b1 (X i )|Z i }] [h 2,b2 (Y i ) -E{h 2,b2 (Y i )|Z i }] . To summarize, we have shown T -T 0 = O p (n -2κ log n). Since we require κ > 1/4, we obtain √ n(T -T 0 ) = o p (log -1/2 n). ( ) Define a B 2 × B 2 matrix Σ 0 whose {b 1 + B(b 2 -1), b 3 + B(b 4 -1)}th entry is given by Cov σ -1 b1,b2 [h 1,b1 (X i ) -E{h 1,b1 (X i )|Z i }] [h 2,b2 (Y i ) -E{h 2,b2 (Y i )|Z i }] , σ -1 b3,b4 [h 1,b3 (X i ) -E{h 1,b3 (X i )|Z i }] [h 2,b4 (Y i ) -E{h 2,b4 (Y i )|Z i }] . In the following, we show sup t |Pr( √ nT 0 ≤ t|H 0 ) -Pr( N (0, Σ 0 ) ∞ ≤ t)| = o(1). ( ) When B is finite, this is implied by the classical weak convergence results. When B diverges with n, we require B = O(n c ) for some constant c > 0. By the definition of σ b1,b2 , the variance of σ -1 b1,b2 [h 1,b1 (X i ) -E{h 1,b1 (X i )|Z i }] [h 2,b2 (Y i ) -E{h 2,b2 (Y i )|Z i } ] is bounded from above by (n -1)/n. Moreover, combining the boundedness assumption on the function spaces H 1 and H 2 together with the definition of σ b1,b2 yields that  σ -1 b1,b2 [h 1,b1 (X i ) -E{h 1,b1 (X i )|Z i }] [h 2,b2 (Y i ) -E{h 2,b2 (Y i )|Z i }] : b 1 , b 2 ∈ {1 for any sufficiently small 0 > 0, where the little-o terms are uniform in t. As such, the distribution of our test statistic can be well-approximated by that of the bootstrap samples. This completes the proof of Theorem 3.

F.4 PROOF OF THEOREM 4

According to the universal approximation theorem (Barron, 1993) , neural networks with a single hidden layer and the sigmoid activation function are universal approximators. Under H * 1 , there exist two neural networks functions f (X) and g(Y ), with a single hidden layer and the sigmoid activation function, such that E[f (X) -E{f (X)|Z}][g(Y ) -E{g(Y )|Z}] = 0. Note that f (g) can be represented by linear combinations of functions in H 1 (H 2 ). It implies that there exists some constant c * > 0 such that GCM * (h * 1 (X), h * 2 (Y )) > c * for some h * 1 ∈ H 1 and h * 2 ∈ H 2 . Let θ * 1 and θ * 2 be the corresponding parameters such that h * 1 = h 1,θ * 1 , h * 2 = h 2,θ * 2 . We next show that GCM * (h 1,θ1 (X), h 2,θ2 (Y )) is a Lipschitz continuous function of (θ 1 , θ 2 ). Note that h 1,θ1 (X) and h 2,θ2 (Y ) are Lipschitz continuous functions of θ 1 and θ 2 , respectively. For any θ 1,1 , θ 1,2 ∈ R d1 , θ 2,1 , θ 2,2 ∈ R d2 , we have |GCM * (h 1,θ1 (X), h 2,θ2 (Y )) -GCM * (h 1,θ1 (X), h 2,θ2 (Y ))| ≤ |E[h 1,1 (X) -E{h 1,1 (X)|Z} -h 1,2 (X) + E{h 2,1 (X)|Z}][h 2,1 (Y ) -E{h 2,1 (Y )|Z}]| (21) +|E[h 1,2 (X) -E{h 1,2 (X)|Z}][h 2,1 (Y ) -E{h 2,1 (Y )|Z} -h 2,2 (Y ) + E{h 2,2 (Y )|Z}]|. (22) Since H 2 is bounded function class, RHS of ( 21) is bounded from above by O(1)E|h 1,1 (X) -E{h 1,1 (X)|Z} -h 1,2 (X) + E{h 2,1 (X)|Z}| log n, where O(1) denotes some positive constant. By Jensen's inequality, the above quantity can be further bounded from above by O(1)E|h 1,1 (X) -h 1,2 (X)|2 log n ≤ L θ 1,1 -θ 1,2 2 log n, for some constant L > 0. Using similar arguments, we can show RHS of ( 22) is bounded from above by L θ 2,1 -θ 2,2 2 . To summarize, we have shown that |GCM * (h 1,θ1 (X), h 2,θ2 (Y )) -GCM * (h 1,θ1 (X), h 2,θ2 (Y ))| ≤ L( θ 1,1 -θ 1,2 2 + θ 2,1 -θ 2,2 2 ) log n. As such, for any sufficiently small > 0, there exists a neighborhood N = {(θ 1 , θ 2 ) : θ j -θ * j 2 ≤ δ log -1/2 n} for some constant δ > 0 around (θ * 1 , θ * 2 ) such that GCM * (h 1,θ1 (X), h 2,θ2 (Y )) ≥ for any (θ 1 , θ 2 ) that belongs to this neighborhood. Since we use multivariate normal to generate (θ 1,b , θ 2,b ) and the dimensions d 1 and d 2 are finite, the probability that (θ 1,b , θ 2,b ) belongs to this neighborhood is strictly greater than O(log -c1 n) for some constant c 1 > 0. Since B = c 0 n c , the probability that at least one pair of parameters (θ 1,b1 , θ 2,b2 ) belongs to this neighborhood approaches one. Consequently, we have max b1,b2 GCM * (h 1,b1 (X), h 2,b2 (Y )) ≥ , with probability tending to 1. Using similar arguments in the proof of Theorems 2 and 3, we can show that |Tmax b1,b2 GCM * (h 1,b1 (X), h 2,b2 (Y ))| = o p (1) and T j = o p (1). Consequently, both probabilities Pr(T < /2) and Pr( T j ≥ /2) converge to zero. As such, the probability that the p-value is greater than α is bounded by the probability that Pr(T < /2), and hence converges to zero. This completes the proof of Theorem 4.



Figure 1: Illustration of the conditional independence test with double GANs.

Figure 2: Top panels: the empirical type-I error rate of various tests under H 0 . From left to right: normal Z with α = 0.1, normal Z with α = 0.05, Laplacian Z with α = 0.1, and Laplacian Z with α = 0.05. Bottom panels: the empirical power of various tests under H 1 . From left to right: d Z = 100, α = 0.1, d Z = 100, α = 0.05, d Z = 200, α = 0.1, and d Z = 200, α = 0.05.

in Algorithm 3, where, xN and ŷN θ denote the empirical distributions of {x j } N j=1 and {y j θ } N j=1 , respectively.

Figure 3: Conditional Histograms. GANs are trained using data generated from the simulation study (see Section 5.1).

Figure 4: The line plots and histograms of real samples and GCIT WGAN samples (scaled between 0 and 1). The top two panels are the results under Bellot & van der Schaar (2019)'s setup with three different simulations and the bottom top panels, are the results under our setup with three different simulations.

Figure 5: Type-I errors of DL-CIT with larger sample sizes.

type-I error rate and power of various tests under H 0 and H 1 . n = 500, d = 100, normal Z and δ = 0.9 under H 1 .

Since d 1 and d 2 are finite and B = O(n c ) for some constant c > 0, the right-hand-side converges to zero with proper as long as c * > c. Consequently, all the weights can be uniformly bounded by O( √ log n). Since the sigmoid function is set to the activation function, the function classes are bounded by O( √ log n) with probability tending to 1. Without loss of generality, we assume this event holds throughout the proofs of Theorems 2-4. Define a test statistic

m . Thus, it suffices to show |T -T * * | = O p (n -2κ ) and |T * -T * * | = O p (n -2κ ). Consider the difference |T -T * * |. For any sequences {a n } n , {b n } n , we have | max n |a n | -max n |b n || ≤ max n |a n -b n |.

PROOF OF THEOREM 3 In the proof of Theorem 2, we have shown T -T * = O p (n -2κ log n). Using similar arguments in Step 4 of the proof of Theorem 2, we can show T * -T * * * = O p (n -2κ log n) where

Step 4 and Step 5 of the proof of Theorem 2, we can show that Σ -Σ 0 ∞,∞ = O p (n -c ) for some constant c > 0. Using similar arguments for (20) and also Lemma 3.1 ofChernozhukov et al. (2015), we have thatPr( √ nT ≤ t|H 0 ) ≥ Pr( N (0, Σ) ∞ ≤ t -2 0 log -1/2 n| Σ) -o(1), Pr( √ nT ≤ t|H 0 ) ≤ Pr( N (0, Σ) ∞ ≤ t + 2 0 log -1/2 n| Σ) + o(1),for any sufficiently small 0 > 0. Since the little-o terms are uniform in t ∈ R, we obtainsup t |Pr( √ nT ≤ t|H 0 ) -Pr( N (0, Σ) ∞ ≤ t| Σ)| ≤ o(1) + sup t |Pr( N (0, Σ) ∞ ≤ t + 2 log -1/2 n| Σ) -Pr( N (0, Σ) ∞ ≤ t -2 log -1/2 n| Σ)|.The term on the second line can be bounded by O(1) log 1/2 B log -1/2 n where O(1) denotes some positive constant, by Theorem 1 ofChernozhukov et al. (2017). Since B = O(n c ), log 1/2 B log -1/2 n = O(1). As grows to zero, this term becomes negligible. Consequently, we obtain sup t |Pr( √ nT ≤ t|H 0 ) -Pr( N (0, Σ) ∞ ≤ t| Σ)| ≤ o(1).

To derive the theoretical properties of the test statistic T , we first introduce a concept of the "oracle" test statistic T * . If P X|Z and P Y |Z were known a priori, then one can draw {X

The variable importance measures of the elastic net and random forest models, versus the p-values of the GCIT and DGCIT tests for the anti-cancer drug example. In this model, the parameter b determines the degree of conditional dependence. When b = 0, H 0 holds, and otherwise H 1 holds. The sample size is fixed at n = 1000.We call our test DGCIT, short for double GANs-based conditional independence test. We compare it with the GCIT test ofBellot & van der Schaar (2019), the regression-based test (RCIT) of Shah & Peters (2018), the kernel MMD-based test (KCIT) ofZhang et al. (2011), the classifier CI test (CCIT) ofSen et al. (2017) and the deep learning-based CI test (DL-CIT) ofSen et al. (2018).

Under the given conditions, it follows from Cauchy-Schwarz inequality that Ed TV { P

, • • • , B} are uniformly bounded away from infinity. Similar to Corollary 4.1 of Chernozhukov et al. (2014), we can show that (19) holds. Suppose B = 1. Then N (0, Σ 0 ) is a single normal variable. This implies that E{h 1,b1 (X i )|Z i }] [h 2,b2 (Y i ) -E{h 2,b2 (Y i )|Z i }]

annex

The number of folds L is finite, as such, it suffices to show I

