DOUBLE GENERATIVE ADVERSARIAL NETWORKS FOR CONDITIONAL INDEPENDENCE TESTING Anonymous

Abstract

In this article, we consider the problem of high-dimensional conditional independence testing, which is a key building block in statistics and machine learning. We propose a double generative adversarial networks (GANs)-based inference procedure. We first introduce a double GANs framework to learn two generators, and integrate the two generators to construct a doubly-robust test statistic. We next consider multiple generalized covariance measures, and take their maximum as our test statistic. Finally, we obtain the empirical distribution of our test statistic through multiplier bootstrap. We show that our test controls type-I error, while the power approaches one asymptotically. More importantly, these theoretical guarantees are obtained under much weaker and practically more feasible conditions compared to existing tests. We demonstrate the efficacy of our test through both synthetic and real data examples.

1. INTRODUCTION

Conditional independence (CI) is a fundamental concept in statistics and machine learning. Testing conditional independence is a key building block and plays a central role in a wide variety of statistical learning problems, for instance, causal inference (Pearl, 2009) , graphical models (Koller & Friedman, 2009) , dimension reduction (Li, 2018), among others. In this article, we aim at testing whether two random variables X and Y are conditionally independent given a set of confounding variables Z. That is, we test the hypotheses: H 0 : X ⊥ ⊥ Y | Z versus H 1 : X ⊥ ⊥ Y | Z, given the observed data of n i.i.d. copies {(X i , Y i , Z i )} 1≤i≤n of (X, Y, Z). For our problem, X, Y and Z can all be multivariate. However, the main challenge arises when the confounding set of variables Z is high-dimensional. As such, we primarily focus on the scenario with a univariate X and Y , and a multivariate Z. Meanwhile, our proposed method can be extended to the multivariate X and Y scenario as well. Another challenge is the limited sample size compared to the dimensionality of Z. As a result, many existing tests are ineffective, with either an inflated type-I error, or not having enough power to detect the alternatives. See Section 2 for a detailed review. We propose a double generative adversarial networks (GANs, Goodfellow et al., 2014)-based inference procedure for the CI testing problem (1). Our proposal involves two key components, a double GANs framework to learn two generators that approximate the conditional distribution of X given Z and Y given Z, and a maximum of generalized covariance measures of multiple combinations of the transformation functions of X and Y . We first establish that our test statistic is doubly-robust, which offers additional protections against potential misspecification of the conditional distributions (see Theorems 1 and 2). Second, we show the resulting test achieves a valid control of the type-I error asymptotically, and more importantly, under the conditions that are much weaker and practically more feasible (see Theorem 3). Finally, we prove the power of our test approaches one asymptotically (see Theorem 4), and demonstrate it is more powerful than the competing tests empirically.

2. RELATED WORKS

There has been a growing literature on conditional independence testing in recent years; see (Li & Fan, 2019) for a review. Broadly speaking, the existing testing methods can be cast into four main categories, the metric-based tests, e.g., (Su & White, 2007; 2014; Wang et al., 2015) , the conditional randomization-based tests (Candes et al., 2018; Bellot & van der Schaar, 2019) , the kernel-based

annex

tests (Fukumizu et al., 2008; Zhang et al., 2011) , and the regression-based tests (Hoyer et al., 2009; Zhang et al., 2018; Shah & Peters, 2018) . There are other types of tests, e.g., Bergsma (2004) ; Doran et al. (2014) ; Sen et al. (2017; 2018) ; Berrett et al. (2019) , to mention a few.The metric-based tests typically employ some kernel smoothers to estimate the conditional characteristic function or the distribution function of Y given X and Z. Kernel smoothers, however, are known to suffer from the curse of dimensionality, and as such, these tests are not suitable when the dimension of Z is high. The conditional randomization-based tests require the knowledge of the conditional distribution of X|Z (Candes et al., 2018) . If unknown, the type-I error rates of these tests rely critically on the quality of the approximation of this conditional distribution. Kernel-based test is built upon the notion of maximum mean discrepancy (MMD, Gretton et al., 2012) , and could have inflated type-I errors. The regression-based tests have valid type-I error control, but may suffer from inadequate power. Next, we discuss in detail the conditional randomization-based tests, in particular, the work of Bellot & van der Schaar (2019), the regression-based and the MMD-based tests, since our proposal is closely related to them.

2.1. CONDITIONAL RANDOMIZATION-BASED TESTS

The family of conditional randomization-based tests is built upon the following basis. If the conditional distribution P X|Z of X given Z is known, then one can independently draw X(1) i ∼ P X|Z=Zi for i = 1, . . . , n, and these samples are independent of the observed samples X i 's andHere we use boldface letters to denote data matrices that consist of n samples. The joint distributions of (X, Y , Z) and (X (1) , Y , Z) are the same under H 0 . Any large difference between the two distributions can be interpreted as the evidence against H 0 . Therefore, one can repeat the process M times, and generate, where I(•) is the indicator function. Since the triplets (X, Y , Z), (X (1) , Y , Z), . . . , (X (M ) , Y , Z) are exchangeable under H 0 , the p-value is valid, and it satisfies that Pr(p ≤ α|H 0 ) ≤ α + o(1) for any 0 < α < 1.In practice, however, P X|Z is rarely known, and Bellot & van der Schaar (2019) proposed to approximate it using GANs. Specifically, they learned a generator G X (•, •) from the observed data, then took Z i and a noise variable v (m) i,X as input to obtain a sample X (m) i , which minimizes the divergence between the distributions of (X i , Z i ) and ( X (m) i , Z i ). The p-value is then computed by replacing X (m) by X (m) = ( X (m) 1 , . . . , X (m) n ) . They called this test GCIT, short for generative conditional independence test. By Theorem 1 of Bellot & van der Schaar (2019), the excess type-I error of this test is upper bounded bywhere d TV is the total variation norm between two probability distributions, the supremum is taken over all measurable sets, and the expectations in (2) are taken with respect to Z.By definition, the quantity D on the right-hand-side of (2) measures the quality of the conditional distribution approximation. Bellot & van der Schaar (2019) argued that this error term is negligible due to the capacity of deep neural nets in estimating conditional distributions. To the contrary, we find this approximation error is usually not negligible, and consequently, it may inflate the type-I error and invalidate the test. We consider a simple example to further elaborate this.Example 1. Suppose X is one-dimensional, and follows a simple linear regression model, X = Z β 0 + ε, where the error ε is independent of Z and ε ∼ N (0, σ 2 0 ) for some σ 2 0 > 0. Suppose we know a priori that the linear regression model holds. We thus estimate β 0 by ordinary least squares, and denote the resulting estimator by β. For simplicity, suppose σ 2 0 is known too. For this simple example, we have the following result regarding the approximation error term D.Proposition 1 Suppose the linear regression model holds. The derived distribution P X|Z is N (Z β, σ 2 0 I n ), where I n is the n × n identity matrix. Then D is not o(1).

