MAXIMAL CORRELATION-BASED POST-NONLINEAR LEARNING FOR BIVARIATE CAUSAL DISCOVERY

Abstract

Bivariate causal discovery aims to determine the causal relationship between two random variables from passive observational data (as intervention is not affordable in many scientific fields), which is considered fundamental and challenging. Designing algorithms based on the post-nonlinear (PNL) model has aroused much attention for its generality. However, the state-of-the-art (SOTA) PNL-based algorithms involve highly non-convex objectives due to the use of neural networks and non-convex losses, thus optimizing such objectives is often time-consuming and unable to produce meaningful solutions with finite samples. In this paper, we propose a novel method that incorporates maximal correlation into the PNL model learning (short as MC-PNL) such that the underlying nonlinearities can be accurately recovered. Owing to the benign structure of our objective function, when modeling the nonlinearities with linear combinations of random Fourier features, the target optimization problem can be solved rather efficiently and rapidly via the block coordinate descent. We also compare the MC-PNL with SOTA methods on the downstream synthetic and real causal discovery tasks to show its superiority in time and accuracy.

1. INTRODUCTION AND RELATED WORKS

Causal discovery is an old and new topic to the machine learning community, which aims to find causal relationships among variables. Many recent attempts at application have emerged in various scientific domains, such as climate science (Ebert-Uphoff & Deng, 2012; Runge et al., 2019 ), bioinformatics (Choi et al., 2020; Foraita et al., 2020; Shen et al., 2020) , etc. The gold standard for causal discovery is to conduct randomized experiments (via interventions), however, interventions are often expensive, unethical, and impractical. It is highly demanded to discover causal relationships purely from passive observational data. In the past three decades, many pioneer algorithms for directed acyclic graph (DAG) searching have been developed for multi-variate causal discovery to reduce the computational complexity and improve the accuracy. For example, there are constraint/independencebased algorithms such as IC, PC, FCI (Pearl, 2009; Spirtes et al., 2000) , RFCI (Colombo et al., 2012) (too many to be listed), as well as score-based methods such as GES (Chickering, 2002) , NOTEARS (Zheng et al., 2018) , etc. However, the algorithms mentioned above can merely return a Markov equivalence class (MEC) that encodes the same set of conditional independencies, with many undetermined edge directions; moreover, the discovered DAG may not necessarily be causal. In this paper, we will focus on a fundamental problem, namely bivariate causal discovery, which aims to determine the causal direction between two random variables X and Y . Bivariate causal discovery is one promising routine for further identification of the underlying causal DAG (Peters et al., 2017) . Bivariate causal discovery is a challenging task, which cannot be directly solved using the existing methodologies for the multivariate case, as the two candidate DAGs, X → Y and X ← Y , are in the same MEC. More assumptions should be imposed to make bivariate causal discovery feasible, as summarized by Peters et al. (2017) . One assumption is on the a priori model class restriction, e.g., linear non-Gaussian acyclic model (LiNGAM) (Shimizu et al., 2006) , nonlinear additive noise model (ANM) (Mooij et al., 2016) , post-nonlinear (PNL) model (Zhang & Hyvärinen, 2009) , etc. The other assumption is on the "independence of cause and mechanism" leading to the algorithms of trace condition (Janzing et al., 2010) , IGCI (Janzing et al., 2012) , distance correlations (Liu & Chan, 2016 ), meta-transfer (Bengio et al., 2020) , CDCI (Duong & Nguyen, 2022), etc. There are

availability

//anonymous.4open.

annex

also seminal works focusing on causal discovery in linear/nonlinear dynamic systems, which are out of the scope of this paper, and the corresponding representatives are Granger causality test (Granger, 1969) and convergent cross mapping (Sugihara et al., 2012; Ye et al., 2015) .In this work, we focus on the PNL model, which is more general than LiNGAM and ANM. The existing works merely show the identifiability results with infinite data samples (i.e. known joint distribution), while practical issues with finite sample size are seldom discussed. We reveal the difficulties with the current PNL-based algorithms in the finite sample regime, such as insufficient model fitting, slow training progress, and unsatisfactory independent test performance, and correspondingly propose novel and practical solutions.The main contributions of this work are as follows.1. We point out various practical training issues with the existing PNL model learning algorithms, in particular PNL-MLP and AbPNL, and propose a new algorithm called MC-PNL (specifically the maximal correlation-based algorithm with independence regularization), which can achieve a better recovery of the underlying nonlinear transformations.2. We suggest using the randomized dependence coefficient (RDC) instead of the Hilbert-Schmidt independence criterion (HSIC) for the independent test and give a universal view of some widely used dependence measures.3. We use MC-PNL for model learning in bivariate causal discovery and show that our method outperforms other SOTA independence test-based methods on various benchmark datasets.

2. PRELIMINARIES

In this section, we will introduce the HSIC as a dependence measure, the current HSIC-based causal discovery methods for PNL model, and other relevant learning methods based on the Hirschfeld-Gebelein-Rényi (HGR) correlation. Our proposed MC-PNL method exploits all these ingredients.

2.1. HSIC SCORE AND HSIC-BASED REGRESSION

Regression by dependence minimization (Mooij et al., 2009) has attracted lots of attention recently.Greenfeld & Shalit (2020) has shown its power for robust learning, in particular the unsupervised covariate shift task. Let us consider the following regression model,where the additive noise is independent (symbolized by ⊥ ⊥) with the input variable X, and the selected regression model f θ is to be learned via minimizing the dependency between the input variable X and the residual Y -f θ (X). A widely used dependence measure is the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005; 2007) . Definition 1 (HSIC). Let X, Z ∼ P XZ be jointly distributed random variables, and F, G be the reproduced kernel Hilbert spaces with kernel functions k and l, the HSIC can be expressed as,where x and z denote independent copies of x and z, respectively. Remark 2.1. We can conclude that: (a) X ⊥ ⊥ Z ⇒ HSIC(X, Z) = 0; (b) with a proper universal kernel (e.g., Gaussian kernel), X ⊥ ⊥ Z ⇐ HSIC(X, Z) = 0 (Gretton et al., 2005) .When the joint distribution P XZ is unknown, given a dataset with n samples ( where K i,j = k (x i , x j ), L i,j = l (z i , z j ), and H = I -1 n 11 T ∈ R n×n is a centering matrix. The Gaussian kernel k (x i , x j ) = exp -(x i -x j ) 2 σ -2 is commonly used, and the same for l. One

