MAXIMAL CORRELATION-BASED POST-NONLINEAR LEARNING FOR BIVARIATE CAUSAL DISCOVERY

Abstract

Bivariate causal discovery aims to determine the causal relationship between two random variables from passive observational data (as intervention is not affordable in many scientific fields), which is considered fundamental and challenging. Designing algorithms based on the post-nonlinear (PNL) model has aroused much attention for its generality. However, the state-of-the-art (SOTA) PNL-based algorithms involve highly non-convex objectives due to the use of neural networks and non-convex losses, thus optimizing such objectives is often time-consuming and unable to produce meaningful solutions with finite samples. In this paper, we propose a novel method that incorporates maximal correlation into the PNL model learning (short as MC-PNL) such that the underlying nonlinearities can be accurately recovered. Owing to the benign structure of our objective function, when modeling the nonlinearities with linear combinations of random Fourier features, the target optimization problem can be solved rather efficiently and rapidly via the block coordinate descent. We also compare the MC-PNL with SOTA methods on the downstream synthetic and real causal discovery tasks to show its superiority in time and accuracy. Our code is available at https://anonymous.4open.



also seminal works focusing on causal discovery in linear/nonlinear dynamic systems, which are out of the scope of this paper, and the corresponding representatives are Granger causality test (Granger, 1969) and convergent cross mapping (Sugihara et al., 2012; Ye et al., 2015) . In this work, we focus on the PNL model, which is more general than LiNGAM and ANM. The existing works merely show the identifiability results with infinite data samples (i.e. known joint distribution), while practical issues with finite sample size are seldom discussed. We reveal the difficulties with the current PNL-based algorithms in the finite sample regime, such as insufficient model fitting, slow training progress, and unsatisfactory independent test performance, and correspondingly propose novel and practical solutions. The main contributions of this work are as follows. 1. We point out various practical training issues with the existing PNL model learning algorithms, in particular PNL-MLP and AbPNL, and propose a new algorithm called MC-PNL (specifically the maximal correlation-based algorithm with independence regularization), which can achieve a better recovery of the underlying nonlinear transformations. 2. We suggest using the randomized dependence coefficient (RDC) instead of the Hilbert-Schmidt independence criterion (HSIC) for the independent test and give a universal view of some widely used dependence measures. 3. We use MC-PNL for model learning in bivariate causal discovery and show that our method outperforms other SOTA independence test-based methods on various benchmark datasets.

2. PRELIMINARIES

In this section, we will introduce the HSIC as a dependence measure, the current HSIC-based causal discovery methods for PNL model, and other relevant learning methods based on the Hirschfeld-Gebelein-Rényi (HGR) correlation. Our proposed MC-PNL method exploits all these ingredients.

2.1. HSIC SCORE AND HSIC-BASED REGRESSION

Regression by dependence minimization (Mooij et al., 2009) has attracted lots of attention recently. Greenfeld & Shalit (2020) has shown its power for robust learning, in particular the unsupervised covariate shift task. Let us consider the following regression model, Y = f (X) + , ⊥ ⊥ X, where the additive noise is independent (symbolized by ⊥ ⊥) with the input variable X, and the selected regression model f θ is to be learned via minimizing the dependency between the input variable X and the residual Y -f θ (X). A widely used dependence measure is the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005; 2007) . Definition 1 (HSIC). Let X, Z ∼ P XZ be jointly distributed random variables, and F, G be the reproduced kernel Hilbert spaces with kernel functions k and l, the HSIC can be expressed as, HSIC(X, Z; F, G) =E XZ E X Z k (x, x ) l (z, z ) + E X E X k (x, x ) E Z E Z l (z, z ) -2E X Z [E X k (x, x ) E Z l (z, z )] , where x and z denote independent copies of x and z, respectively. Remark 2.1. We can conclude that: (a) X ⊥ ⊥ Z ⇒ HSIC(X, Z) = 0; (b) with a proper universal kernel (e.g., Gaussian kernel), X ⊥ ⊥ Z ⇐ HSIC(X, Z) = 0 (Gretton et al., 2005) . When the joint distribution P XZ is unknown, given a dataset with n samples (x = [x 1 , x 2 , . . . , x n ] T ∈ R n , z = [z 1 , z 2 , . . . , z n ] T ∈ R n ), a biased HSIC estimate can be constructed as, HSIC (x, z; F, G) = 1 n 2 tr(KHLH), where K i,j = k (x i , x j ), L i,j = l (z i , z j ), and H = I -1 n 11 T ∈ R n×n is a centering matrix. The Gaussian kernel k (x i , x j ) = exp -(x i -x j ) 2 σ -2 is commonly used, and the same for l. One can intuitively interpret this empirical HSIC as the inner-product of two centralized kernel matrices 1 n 2 HKH, HLH , where the kernel matrices summarize the sample similarities. Mooij et al. (2009) first proposed to use the above defined empirical HSIC for model learning. Concretely, the regression model is a linear combination of the basis functions, f θ (x) = k i=1 θ i φ i (x), and the parameters are learned from: θ ∈ arg min θ∈R p HSIC(x, y -f θ (x)) + λ 2 θ 2 2 , where f θ is applied elementwisely to the data points, and λ > 0 is a penalty parameter (we will keep using λ as a penalty parameter under different contexts). One key advantage of this formulation is that it requires no assumption on the noise distribution. Greenfeld & Shalit (2020) implemented f θ using neural networks, and showed the learnability of the HSIC loss theoretically.

2.2. CAUSAL DISCOVERY WITH POST-NONLINEAR MODEL

The bi-variate post-nonlinear model is expressed as, Y = f 2 (f 1 (X) + ), where f 1 denotes the nonlinear effect of the cause, is the independent noise, and f 2 denotes the invertible post-nonlinear distortion from the sensor or measurement side. The goal is to find the causal direction X → Y from a set of passive observations on X and Y . Note that from the data generating process, is independent with X but not Y . Taking this asymmetry as a prior, one can test the causal direction by first learning the underlying transformations, f -1 2 and f 1 , and then checking the independence between the residual r (→) = f -1 2 (Y ) -f 1 (X) and the input X. The PNL-MLP algorithm proposed by Zhang & Hyvärinen (2009) tests between two hypotheses (X → Y and X ← Y ) as follows. Under the hypothesis X → Y , one can parameterize f 1 and f -1 2 by two multi-layer perceptrons (MLPs) f (→) and g (→) , and learn them via minimizing the mutual information (MI): f(→) , ĝ(→) ∈ arg min f (→) ,g (→) MI r (→) := g (→) (y) -f (→) (x); x , where g (→) , f (→) are applied elementwisely. The estimated residual is r(→) = ĝ(→) (y) -f(→) (x). Similarly, under the hypothesis X ← Y , one can obtain an estimate of r(←) = ĝ(←) (x) -f(←) (y) via minimizing MI(r (←) ; y). The causal direction is determined by comparing HSIC r(→) , x and HSIC r(←) , y . If HSIC r(→) , x < HSIC r(←) , y , the hypothesis X → Y is endorsed; otherwise, the hypothesis X ← Y is endorsed. However, the MI between random variables is often difficult to calculate (see supplement A), and tuning the MLPs requires many tricks as mentioned in Zhang & Hyvärinen (2009) , altogether bringing huge difficulties to handle large-scale datasets with many variable pairs. Uemura & Shimizu (2020) proposed AbPNL method that uses HSIC instead of MI, and imposes the invertibility restriction of f 2 via an auto-encoder to eliminate nonsense solutions, min f,g,g HSIC (x, r := g(y) -f (x)) + λ y -g (g(y)) 2 2 , where g, g are encoder and decoder MLPs. The subscript (→) is omitted for conciseness here. We summarize the architectures of the above-mentioned two methods in Figure 1 . Nevertheless, inherent issues exist concerning the cost function and the neural network training procedure when dealing with finite sample datasets, see in Section 3.1.

2.3. PNL LEARNING THROUGH MAXIMAL CORRELATION

Another routine to learn the nonlinear transformations f and g is through the HGR maximal correlation (Hirschfeld, 1935; Gebelein, 1941; Rényi, 1959) . Definition 2 (HGR maximal correlation). Let X, Y be jointly distributed random variables. Then, ρ * = HGR(X; Y ) := sup f :X →R,g:Y→R E[f (X)]=E[g(Y )]=0 E[f 2 (X)]=E[g 2 (Y )]=1 E[f (X)g(Y )], is the HGR maximal correlation between X and Y , and f, g are the associated maximal correlation functions. Remark 2.2. The HGR maximal correlation ρ * is attractive as a measure of dependency due to some useful properties: (1) Bounded ρ * : 0 ≤ ρ * ≤ 1; (2) X and Y are independent if and only if ρ * = 0; (3) there exists f and g such that f (X) = g(Y ) with probability 1 if and only if ρ * = 1. The optimal unit-variance feature transformations, f * and g * , can be found by iteratively updating f and g in ( 7). However, for causal discovery applications, one fatal issue is that the learned f * and g * are constrained to have unit-variance, thus being unable to reflect the true magnitudes of the underlying functions f and g. As a consequence, the resulting residual can be incorrect for the independence tests in the next stage. We found two possible remedies in the literature, namely the alternating conditional expectation (ACE) algorithm (Breiman & Friedman, 1985) and a soft version of (7) (Soft-HGR) (Wang et al., 2019) . The ACE algorithm solves the regression problem (8) by computing the conditional mean alternatively, min f,g E(f (X) -g(Y )) 2 , s.t. E[f (X)] = E[g(Y )] = 0, E[g 2 (Y )] = 1, which only retains the unit-variance constraint on g. The equivalence to (7) was established, and the regression optimal transformation (f * * , g * * ) equals (ρ * f * , g * ), see Theorem 5.1 in Breiman & Friedman (1985) . The other formulation, Soft-HGR, relaxes the unit-variance constraints as follows, max f,g E [f (X)g(Y )] - 1 2 var(f (X)) var(g(Y )), s.t. E[f (X)] = E[g(Y )] = 0. It allows certain linear transformations (af * , a -1 g * ), where a ∈ R\{0} can produce infinitely many equivalent local minima. This scale ambiguity results in enormous useless solutions for causal discovery, and the desired one should make the estimated residual independent with the input. We will show how our proposed method is able to eliminate those undesired solutions in Section 4. Connections to VICReg. We notice that the recent proposed Variance-Invariance-Covariance Regularization (VICReg) (Bardes et al., 2022) shares similar intuitions with the HGR maximal correlation. When the dimension of representation vectors (i.e., f and g) reduces to one, the covariance term disappears, and the VICReg objective becomes, min f,g E(f (X) -g(Y )) 2 invariance term +λ [max(0, γ -var(f (X))) + max(0, γ -var(g(Y )))] variance term , where λ, γ > 0 are the hyper-parameters that need to be tuned. The invariance term encourages the alignment of the learned features; and the variance term encourages a γ-bounded variation to avoid trivial solutions like f (X) = g(Y ) = constant. To see the connections, we rewrite Soft-HGR (9) as, min f,g E [f (X) -g(Y )] 2 invariance term + var(f (X)) var(g(Y )) -var(f (X)) -var(g(Y )) variance term , s.t. E[f (X)] = E[g(Y )] = 0, in which the variance of f and g are also encouraged but not allowed to grow simultaneously.

3. PRACTICAL ISSUES WITH EXISTING ALGORITHMS

In this section, we summarize several practical issues of the existing algorithms for PNL learning, including among others PNL-MLP (Zhang & Hyvärinen, 2010) and AbPNL (Uemura & Shimizu, 2020) . These issues motivate our novel MC-PNL method to be introduced in Section 4, see the comparisons in terms of their architectures in Figure 1 .

3.1. ISSUES ON MODEL LEARNING

Over-fitting issue. The general idea of the PNL model learning, according to Section 2.2, is to encourage the independence between the input and the estimated residual. Both PNL-MLP and AbPNL use neural networks to parameterize f and g. But it is skeptical that meaningful representations can really be learned with finite samples. Let us review the dependence minimization problem below, min f,g HSIC(x, r) = 1 n 2 tr(K xx HL rr H), where r = f (x) -g(y). We argue that it is utmost difficult to learn meaningful representations of f and g via minimizing solely the HSIC score, due to the enormous degrees of freedom for f and g to learn arbitrary random noise. We conducted experiments using both wide over-parameterized and narrow deep neural networks with sufficient representation power. In our simulation results (see supplement B), for both network architectures, the objective values can reach zero but unfortunately produce meaningless estimates. This is unsurprising though, as one can force r to be samples from arbitrary independent random noise (Yun et al., 2019; Zhang et al., 2021) . To aid with that, we propose to cooperate dependence minimization with maximal correlation, which helps to obtain desired solutions, see Figure 1 (c) for illustration and Section 4 for details. Optimization issue. The optimization of neural networks is a long-standing problem, and yet there is not any study on the optimization landscape of the HSIC loss with neural networks. Typically, first-order methods such as stochastic gradient descent are used in the existing causal discovery methods, and initialization is crucial to the causal discovery accuracy, see in supplement B. In this paper, we suggest parameterizing both f and g as a linear combination of random Fourier features and using a linear kernel for HSIC, which admit a benign landscape with symmetry (see Chapter 7 in Wright & Ma ( 2022)) for the non-convex optimization. 

3.2. ISSUES ON INDEPENDENCE TEST

As the independence test is critical to the accuracy of causal discovery, we have to cautiously choose the dependence measure. Although HSIC is widely used, there are several drawbacks of HSIC (e.g., the choice of kernel and corresponding hyper-parameters are user-defined, the values of HSIC depends on the scale of the random variables). In this section, we show experimentally that the HSIC score is not the best choice, and we favor randomized dependence coefficient (RDC) (Lopez-Paz et al., 2013) particularly for finite samples. We generated various synthetic datasets following the PNL models, see supplement C, in which we know in advance that the injected noise ⊥ ⊥ X and ⊥ ⊥ Y . Thus, we are able to compare various dependence measures, by checking whether Dep(x, ) < Dep(y, ) on various datasets. In this section, the compared dependence measures are HSIC (Gretton et al., 2005) , its normalized variant (NOCCO) (Fukumizu et al., 2007), and RDC (Lopez-Paz et al., 2013) . Besides, we also study the impact of different choices of linear, Gaussian radial basis function (RBF), and rational quadratic (RQ) kernels. We note here that RDC is a computational tractable estimator inspired by the HGR maximal correlation. It shows that RDC outperforms other dependence measures especially when the sample size is small, see Table 1 . Thus, we advocate to use RDC to measure dependency with finite samples. Finally, we give a universal view of the aforementioned dependence measures in supplement D. 

4. PROPOSED METHOD

In this section, we propose a maximal correlation-based post-nonlinear model learning framework, called MC-PNL, to accurately estimate the nonlinear functions and compute the corresponding residuals. After then, independence tests will be conducted to determine the causal direction.

4.1. MAXIMAL CORRELATION-BASED PNL MODEL LEARNING

As we can see in the previous sections, minimizing HSIC (12) requires no assumption on the noise distribution and encourages the independence of the residual, but it can easily get stuck at meaningless local minima. Maximal HGR correlation based methods can learn meaningful transformations as its name suggested, but not necessarily produce independent residual. To combine their strengths, we propose the following MC-PNL objective, min f,g -E [f (X)g(Y )] + 1 2 var(f (X)) var(g(Y )) + λDep(X, f (X) -g(Y )), s.t. E[f (X)] = E[g(Y )] = 0, where Dep(•, •) ≥ 0 is a dependence measure (e.g., HSIC with different kernel functions), and λ > 0 is a hyper-parameter that penalizes the dependence between the input variable X and the estimated residuals f (X) -g(Y ). This novel objective can learn meaningful feature transformations with the Soft-HGR term, and resolve the scale ambiguity via the dependence minimization term. The objective ( 13) is consistent with minimizing MI principle, under the assumptions of invertible PNL generating functions and Gaussian noise, see details in supplement E.

Parameterization with Random Features

For ease of optimization, we parameterize the transformation functions as the linear combination of the random features, namely f (x; α) := α T φ(x) and g(y; β) := β T ψ(y) , where the random features φ(x) ∈ R k1 , ψ(y) ∈ R k2 are nonlinear projections as described in López-Paz et al. ( 2013), see supplement F. For a given dataset {(x i , y i )} n i=1 , the corresponding feature matrices are denoted as Φ := [φ(x 1 ), φ(x 2 ), . . . , φ(x n )] ∈ R k1×n and Ψ := [ψ(y 1 ), ψ(y 2 ), . . . , ψ(y n )] ∈ R k2×n . We further denote the residual vector as r := Φ T α -Ψ T β. Consequently, (13) can be written as the following non-convex programming problem, min α,β J(α, β) := -1 n α T ΦΨ T β + 1 2n 2 α T ΦΦ T αβ T ΨΨ T β + λDep(x, r) s.t. α T Φ1 = β T Ψ1 = 0, ( ) where 1 is an all-ones vector, and the dependence measure Dep(x, r) can be specially set to the HSIC score with linear kernel, namely, HSIC lin (x, r) = 1 n 2 tr(K xx HL lin rr H) = 1 n 2 tr(K xx Hrr T H) = 1 n 2 tr(K xx H(Φ T α -Ψ T β)(Φ T α -Ψ T β) T H) = (α T ΦHK xx HΦ T α + β T ΨHK xx HΨ T β -2α T ΦHK xx HΨ T β) n 2 .( ) Remark: We adopt the HSIC with linear kernel L lin rr mainly for a favorable optimization structure, as the resulting HSIC score admits a quadratic form w.r.t. both α and β. Note that the penalty HSIC term is always non-negative, but the Soft-HGR objective can be negative. The above problem can be solved via a simple block coordinate descent (BCD) algorithm that updates α and β iteratively, see Algorithm 1. Essentially, ( 14) is multi-convex (Xu & Yin, 2013) , and in each update (line 3 or 4 in Algorithm 1), the sub-problem is a linearly constrained quadratic programming. When the sub-problem is strictly convex, one can obtain the unique minimum in closed-form in each update, which admits convergence guarantee to a critical point (Grippo & Sciandrone, 2000) . More details on the subproblem optimization and the landscape study can be found in supplement G. Algorithm 1 BCD for problem 14 1: Initialize α (0) and β (0) // Use random initialization 2: for t=1:T do 3: Update α (t) ← arg min α J(α, β (t-1) ), s.t. α T Φ1 = 0. 4: Update β (t) ← arg min β J(α (t) , β), s.t. β T Ψ1 = 0. 5: if stopping creteria is met then 6: return α (t) , β (t) 7: end if 8: end for Remark: We can also impose the invertability of g by limiting the derivative d dy g(y) to be positive (or negative) in line 4, i.e., ΨT β > 0, where Ψ = [ d dy ψ(y 1 ), d dy ψ(y 2 ), . . . , d dy ψ(y n )] ∈ R k2×n . Fine-tune: Algorithm 1 may produce solutions with distortions, see Figure 2 , probably due to the use of the linear kernel. To cope with that, one can enlarge the penalty of dependence λ, and use HSIC with universal kernels or other dependence measures. Besides, we propose a banded loss to reinforce a banded residual plot, see in supplement H.

4.2. DISTINGUISH CAUSE FROM EFFECT VIA INDEPENDENCE TEST

Following the framework proposed by Zhang & Hyvärinen (2009) , we distinguish cause from effect according to Algorithm 2. We first fit nonlinear models f (→) , g (→) under hypothesis X → Y , and f (←) , g (←) under hypothesis X ← Y . After the learning iterations, we conduct independence tests. If Dep r(→) , x < Dep r(←) , y , the hypothesis X → Y is supported; otherwise, the hypothesis X ← Y is supported. We use the RDC for the independent test, as introduced in Section 3.2. Algorithm 2 The MC-PNL method for causal direction prediction. Input: The standardized data x, y ∈ R n . Output: The causal score C X→Y and direction. 1. Fit PNL models via Algorithm 1 and estimate residuals under hypotheses, X → Y and X ← Y. • Under hypothesis X → Y : r(→) = ĝ(→) (y) -f(→) (x). • Under hypothesis X ← Y : r(←) = ĝ(←) (x) -f(←) (y). 2. Calculate the causal score C X→Y := Dep r(←) , y -Dep r(→) , x . 3. Output the causal score C X→Y and direction := X → Y, if C X→Y > 0, X ← Y, if C X→Y < 0, Towards trustworthy decisions, Liu & Chan (2016) proposed to make no decision when |C X→Y | is less than a threshold δ > 0. Besides, bootstrap (Efron, 1992; Zoubir & Boashash, 1998) can also be used for uncertainty quantification, see supplement I.

5. EXPERIMENTS

In the following, we show the performance of MC-PNL in model learning and its application to bivariate causal discovery.

5.1. NONLINEAR FUNCTION FITTING

For better demonstration, we generated two synthetic datasets from the PNL model, Y = f 2 (f 1 (X) + ), and each contains 1000 samples. The data generation mechanisms are as follows, • Syn-1: f 1 (X) = X -1 + 10X, f 2 (Z) = Z 3 , X ∼ U (0.1, 1.1), ∼ U (0, 5), • Syn-2: f 1 (X) = sin(7X), f 2 (Z) = exp(Z), X ∼ U (0, 1), ∼ N (0, 0.3 2 ). We apply Algorithm 1 to both datasets and show the learned nonlinear transformations as well as the corresponding residual plots in Figure 2 . The underlying nonlinear functions are correctly learned under the true hypothesis but with certain distortions. We also show that, after fine-tuning with our proposed banded loss or HSIC-RBF loss, such distortion can be fixed up, see supplement H. Convergence Results. We demonstrate the convergence profile of our algorithm with Syn-2, see Figure 3 . Results for Syn-1 can be found in the supplement J. The top row shows the snapshots of the learned representations, where we do not impose independence regularization (λ = 0). The algorithm, starting from different random initializations, convergences quickly to the local minimizers sharing the same objective value. The bottom row is with independence regularization λ = 5, where the solutions have a sign symmetry.

5.2. BIVARIATE CAUSAL DISCOVERY

We evaluated the causal discovery accuracy on both synthetic and real datasets. Synthetic Datasets: The generated synthetic datasets all follow the PNL model. And we considered the following two settings: 1) PNL-A: f 1 are general nonlinear functions generated by polynomials with random coefficients; and f 2 are monotonic nonlinear functions generated by unconstrained monotonic neural networks (UMNN) (Wehenkel & Louppe, 2019) ; 2) PNL-B: Both f 1 and f 2 are monotonic generated by UMNN. The variances of f 1 , f 2 are rescaled to 1. The input variable X is sampled either from Gaussian mixture (mixG) or uniform (unif) distribution, and the injected noise is generated from normal distributions N (0, ns 2 ), where ns ∈ {0.2, 0.4, 0.6, 0.8, 1}. Each configuration contains 100 data pairs, and each data pair has 1000 samples. Gene Data: Discovering gene-gene causal relationships is one important application. We used the data in DREAM4 competition (D4-S1,D4-S2A,D4-S2B,D4-S2C) (Marbach et al., 2009; 2010) and the scRNA-seq data (GSE57872) (Han et al., 2017) , see supplement K. Baselines & Evaluation: Thanks to the implementation by Kalainathan et al. (2020) , we can easily compare our proposed method with various existing algorithms. In this paper, we compared our proposed algorithm on both synthetic datasets and real datasets with several baseline algorithms, including ANM (Hoyer et al., 2008) , CDS (Fonollosa, 2019), IGCI (Janzing et al., 2012) , RECI (Blöbaum et al., 2018) , CDCI (Duong & Nguyen, 2022) , OT-PNL (Tu et al., 2022) , AbPNL (Uemura & Shimizu, 2020) . Our implementation of MC-PNL follows Algorithm 1 (without fine-tuning), and we empirically set λ = 5 (the choice of λ is briefly discussed in supplement L). We also conducted causal discovery on the PNL learned by the ACE algorithm. The ROC-AUC score is used for the evaluation. 5) is equivalent to maximizing E log p r (→) + E log d dy g (→) (y) (Zhang & Hyvärinen, 2009) , where p is the assumed noise density. We find this objective interpretable, since the first term, E log p r (→) , can be understood as the data fitting term, and the second term, E log d dy g (→) (y) , can be understood from an information-geometric perspective (Daniušis et al., 2010) . However, such equivalent form requires a known noise distribution to calculate the log-likelihood. Some works (Ma et al., 2020; Uemura & Shimizu, 2020) have been proposed to avoid this difficulty by using HSIC instead of MI.

B EXPERIMENTS ON MINIMIZING HSIC

In this section, we show the PNL model learning result by minimizing (12). We generated two synthetic datasets from PNL model, Y = f 2 (f 1 (X) + ), and each contains 1000 data samples. The data generation mechanisms are as follows (see Figure 4 ), We build different MLPs with the following configurations. • Syn-1: f 1 (X) = X -1 + 10X, f 2 (Z) = Z 3 , X ∼ U (0.1, 1.1), ∼ U (0, 5), • Syn-2: f 1 (X) = sin(7X), f 2 (Z) = exp(Z), X ∼ U (0, 1), ∼ N (0, 0.3 2 ). • Narrow deep MLP: the input and output are both one-dimensional; there are 9 hidden layers, each with 5 neurons. The activation function is Leaky-ReLU. • Wide over-parameterized MLP: the input and output are both one-dimensional; there is only one single hidden layer with 9000 neurons. The activation function is Leaky-ReLU. We use the default initialization method in PyTorch (Paszke et al., 2019) , and make sure the exact same initial weights for narrow/wide MLPs are used (i.e., the initializations for different datasets are the same). Optimization Setup: We set the batch size to be 32. We use Adam (Kingma & Ba, 2015) for the optimization (the learning rates are 10 -3 and 10 -6 for narrow deep and wide over-parameterized MLPs, respectively, while all other parameters are set by default). We report the learning results in Figure 5 . The learned transformations (see row 3 and row 4 in Figure 5 ) deviates far away from the underlying functions, and are quite similar across datasets. The possible reason is that, the solutions were started from the same initialization and trapped at the local minima near the initializations. To verify whether such HSIC-based PNL learning algorithm is stable for causal discovery, we further evaluate the AbPNL on the following dataset. We build 100 data pairs with different random seeds, following the same mechanism, Syn-1, and each contains 1000 data samples. And we applied the AbPNL (Uemura & Shimizu, 2020) with different initializations on each of those data pairs. The results in Table 3 show that the causal discovery stableness for AbPNL is not satisfactory. 

C SYNTHETIC DATASETS FOR INDEPENDENCE TEST

In this section, we describe the synthetic data generation from PNL model for the independent test. The data were generated from the following model, Y = f 2 (f 1 (X) + ) , X ∼ GMM, ∼ N (0, σ 2 ), where f 1 , f 2 are randomly initialized monotonic neural networks (Wehenkel & Louppe, 2019) with 3 layers and 100 integration steps, and each layer contains 100 units. The cause term X is sampled from a Gaussian mixture model as described in Lopez-Paz et al. (2017) . The datasets were configured with various noise levels and sample sizes. There are three different injected noise levels, σ ∈ {0.1, 1, 10}, and three different sample sizes, N ∈ {1000, 2000, 5000}. And under each configuration, we generated 100 data pairs for evaluating the independence test accuracy.

D A UNIVERSAL VIEW OF DEPENDENCE MEASURES

Actually the discussed dependence measures in Section 3.2 are all closely related to the mean squared contingency introduced by (Rényi, 1959) and rediscovered due to its squared version called squared-loss mutual information (SMI) (Suzuki et al., 2009) , SMI := p(x)p(y) p(x, y) p(x)p(y) -1 2 dxdy = p(x, y) p(x)p(y) p(x, y)dxdy -1. When the density ratio DR(x, y) := p(x,y) p(x)p(y) is constant 1 (namely X and Y are independent), the SMI should be zero. To estimate the SMI, one can first approximate DR(x, y) by a surrogate function DR θ (x, y) parameterized by θ. The optimal parameter θ can be obtained via minimizing the following squared-error loss J DR , J DR (θ) := (DR θ (x, y) -DR(x, y)) 2 p(x)p(y)dxdy = DR θ (x, y) 2 p(x)p(y)dxdy -2 DR θ (x, y)p(x, y)dxdy + Const. ( ) Then the empirical SMI can be calculated as, SMI = 1 n n j=1 DR θ (x j , y j ) -1. We show that, with different parameterizations of the density ratio, the resulting SMI will be equivalent to different dependence measures, see Table 4 . Corresponding dependence measure DR θ (x, y) = 1 + n i=1 θ i K (x, x i ) L (y, y i ) variant of LSMI (Sugiyama & Yamada, 2012) DR θ (x, y) = 1 + n i=1 1 n K (x, x i ) L (y, y i ) HSIC (Gretton et al., 2005 ) DR θ (x, y) = 1 + m i=1 f i (x)g i (y) m-mode HGR correlation (Wang et al., 2019) DR θ (x, y) = 1 + f (x)g(y) 1 HGR correlation (Rényi, 1959) 1 When f, g are the linear combinations of random features, f (x) = α T φ(x), g(y) = β T ψ(y), the corresponding dependence measure will be RDC (López-Paz et al., 2013) , Sugiyama & Yamada (2012) proposed to approximate the density ratio by DR θ (x, y) = n i=1 θi K (x, x i ) L (y, y i ) , where θ has a closed-form solution via minimizing (17). After then, they approximated the SMI using the empirical average of Equation ( 16), 1 n n j=1 DR θ (x j , y j ) -1 = 1 n n j=1 n i=1 θi K (x, x i ) L (y, y i ) -1. It is shown that, the first term is actually the empirical HSIC, when { θi } n i=1 = 1 n . We argue that there is a flaw above, as when X and Y are independent, both the SMI and HSIC score should be zero. A simple modification is to model the density ratio by DR θ (x, y) = 1 + n i=1 θ i K (x, x i ) L (y, y i ). The constant 1 here is to exclude all the independence terms, and the rest ones should model the dependency only. This modification will not hurt the quadratic form of J DR (θ), and maintains good interpretation. The SMI reduced to HSIC score, when {θ i } n i=1 = 1 n , We extend this idea to approximate the density ratio by DR θ (x, y) = 1 + f (x)g(y), where f, g are zero mean and unit variance functions parameterized by θ, the resulting SMI will be equal to the HGR maximal correlation. Similarly, the constant 1 will capture the independence part, and f (x)g(y) will capture the dependencies. Proposition 1. The density ratio estimation problem ( 17) is equivalent to the maximal HGR correlation problem (7), when the density ratio is modeled in the form of DR θ (x, y) = 1 + f (x)g(y), and f, g are restricted to zero mean and unit variance functions. Proof. We substitute DR θ (x, y) into Equation ( 17), J DR (f, g) = (1 + f (x)g(y)) 2 p(x)p(y)dxdy -2 (1 + f (x)g(y))p(x, y)dxdy + Const. = 1 + 2E(f (X))E(g(Y )) + var(f (X))var(g(Y )) -2 -2E(f (X)g(Y )) + Const. Then it is not hard to see, min f,g J DR (f, g), subject to E(f ) = E(g) = 0, var(f ) = var(g) = 1, is equivalent to the maximal HGR correlation problem (7) . Proposition 2. The density ratio estimation problem ( 17) is equivalent to the Soft-HGR problem (9), when the density ratio is modeled in the form of DR θ (x, y) = 1 + f (x)g(y), and f, g are restricted to zero mean functions. We further note that the above density ratio estimation can be regard as a truncated singular value decomposition DR θ (x, y) = 1+ m i=1 f i (x)g i (y), where m = 1. When letting m > 1 and imposing zero mean and unit variance constraints on all f i and g i , the corresponding J DR minimization problem is equivalent to solving the m-mode HGR maximal correlation (Wang et al., 2019; Lee, 2021) . Definition 3 (m-mode HGR maximal correlation). Given 1 ≤ m ≤ min{|X |, |Y|}, the m-mode maximal correlation problem for random variables X ∈ X , Y ∈ Y is, (f * , g * ) arg max f :X →R m ,g:Y→R m E[f (X)]=E[g(Y )]=0, E[f (X)f T (X)]=E[g(Y )g T (Y )]=I E f T (X)g(Y ) , ( ) where f = [f 1 , f 2 , . . . , f m ] T , g = [g 1 , g 2 , . . . , g m ] T are referred as the maximal correlation functions.

E CONNECTIONS AMONG MI, ML, AND MC

In this section, we build connections among minimizing MI, maximum likelihood, and maximal correlation. The equivalence between minimizing MI and maximizing likelihood was built in Zhang & Hyvärinen (2009) . The following proposition shows the connection to maximal correlation. Proposition 3. Suppose the dataset {(x i , y i )} n i=1 is generated from a PNL model Y = g -1 (f (X) + ), where f, g are both invertible functions, and the noise follows a Gaussian density p( ; θ) with zero mean and variance θ, then maximizing the log-likelihood log p({(x i , y i )} n i=1 ) is equivalent to solving the regression problem (8). Proof. Under proper assumptions in proposition 3, the log-likelihood can be written as follows, L n (f, g) = n i=1 log p(x i , y i ; f, g, θ) = n i=1 log p(x i ) + n i=1 log p(y i |x i ; f, g, θ) = n i=1 log p(x i ) + n i=1 log p(g(y i )|f (x i ); θ) (f, g are invertible) = n i=1 log p(x i ) + n i=1 log p(g(y i ) -f (x i ); θ) (from PNL model) = n i=1 log p(x i ) + n i=1 -(g(y i ) -f (x i )) 2 2θ + n log 1 √ 2πθ (Gaussianity) (19) It is not hard to see, with fixed θ, maximizing the log-likelihood L n (f, g) is equivalent to minimizing f (x) -g(y) 2 with invertible f and g. Without loss of generality, one can make f, g zero mean. To avoid trivial solutions, one can further restrict g to have unit-variance. Then the equivalence to the regression problem (8) is build. Corollary 1. When n → ∞, the ground truth transformations f * , g * , minimize the MI(x, r) to zero, achieve optimum of (8), and maximize the log-likelihood L n (f * , g * ). Proof. The proof is directly follows Theorem 3 in (Zhang et al., 2015) . The reformulating to (8) or ( 9)foot_2 allows efficient BCD-like optimization algorithms to be exploited.

F RANDOM FEATURE GENERATION

We generate the random features as described in López-Paz et al. (2013) . The generation process has the following two steps: copula transformation (optional) and random nonlinear projection. Step 1. Copula transformation. We first estimate the empirical cumulative distribution of both X and Y by, P X n (x) := 1 n n i=1 I (x i ≤ x) , P Y n (y) := 1 n n i=1 I (y i ≤ y) . Then we can apply the empirical copula transformation to data samples {(x i , y i )} n i=1 , u X i = P X n (x i ) and u Y i = P Y n (y i ), where the marginals U X and U Y follow uniform distribution U (0, 1). Step 2. Random nonlinear projection. We design a k-dimensional random feature vector φ (x) = [sin(w 1 x + b 1 ), • • • , sin(w k x + b k )] T , where w i , b i ∼ N (0, s 2 ). The random feature matrix Φ ∈ R k×n is stacked as, Φ(x; k, s) :=    sin (w 1 x 1 + b 1 ) • • • sin (w 1 x n + b 1 ) . . . . . . . . . sin (w k x 1 + b k ) • • • sin (w k x n + b k )    . One can replace the x i here by u X i from the first step to form the random feature matrix. Similar procedures can be applied to y as well to generate Ψ. The number of random Fourier features k is user-defined, which is typically chosen from a few tens to a few thousands (Rahimi & Recht, 2008; Theodoridis, 2015) . In our experiments, we set k = 30 and s = 2.

G ON THE OPTIMIZATION OF PROBLEM (14)

G.1 SUBPROBLEM: EQUALITY CONSTRAINED QUADRATIC PROGRAMMING To simplify the notation, we rewrite the sub-problem into the following form, min x∈R n f (x) := 1 2 x T Ax -b T x, s.t. v T x = c. ( ) With the KKT conditions, one can find the unique optimal solution x * by solving the following linear system, A v v T 0 =:KKT x * λ * = b c , when the KKT matrix is non-singular. In our setting, we can choose Φ and Ψ properly to make ΦΦ T and ΨΨ T positive definite, or add a small positive definite perturbation matrix I, such that the unique optimum would be obtained. Besides, the sub-problem is of smaller size and easy to solve.

G.2 LANDSCAPE STUDY WITH HESSIAN

To simplify the notation, we rewrite J(α, β; A, B, C, D, E) = α T Aαβ T Bβ -α T Cβ + α T Dα + β T Eβ, where, A = 1 2n 2 ΦΦ T , B = ΨΨ T , C = 1 n ΦΨ T + λ (n-1) 2 ΦHK xx HΨ T , D = λ (n-1) 2 ΦHK xx HΦ T , E = λ (n-1) 2 ΨHK xx HΨ T . And the corresponding Hessian is ∇ 2 J(α, β) = 2Aβ T Bβ + 2D Aαβ T B -C B T βα T A -C T 2Bα T Aα + 2E . ( ) Now we are able to verify the property of the critical points via checking their Hessians numerically. One obvious critical point is the all zero vector 0. From our experiments, the Hessian at 0 is mostly indefinite, as long as the convex regularization term λ is not too huge, which means 0 is a saddle point. In practice, the algorithm rarely converges to 0.

H FINE-TUNE WITH BANDED LOSS / UNIVERSAL HSIC

In the PNL model, the injected noise are assumed to be independently and identically distributed. Thus, the residual plot should forms a "horizontal band". We design a banded residual loss to fine-tune the models as follows. The data samples are separated into b bins {x (i) , y (i) } b i=1 according to the ordering of X, and we expect the residuals in those bins Res i = f (x (i) ) -g(y (i) ) to have the same distribution, see Figure 6 . To this end, we adopt the empirical maximum mean discrepancy (MMD) (Gretton et al., 2012) The above banded residual loss involves MMD, which is highly non-convex and brings difficulties to the optimization. We used the projected gradient descent with momentum to optimize the loss function. The residual plot shows a band shape, see top row in Figure 7 . 

J ADDITIONAL CONVERGENCE RESULTS

In this section, we show the convergence results on Syn-1 as well. 

K DETAILED DATA DESCRIPTIONS

In this section, we describe the datasets in detail.

Gene Datasets:

For D4-S1, D4-S2A, D4-S2B, D4-S2C, we used the preprocessed data in Duong & Nguyen (2022) 2 . D4-S1 contains 36 variable pairs with 105 samples in each pair; D4-S2A, D4-S2B, D4-S2C contains 528, 747, and 579 variable pairs respectively, and each pair contains 210 samples. The GSE57872 dataset is built on Patel et al. (2014) , in which the data has continuous values. Following Choi et al. (2020) , we first screen out 657 gene pairs that have corresponding labels in the TRRUST database (Han et al., 2017) . The gene contains many repeated values. we examined each gene pair and deleted those repeated expression values.

L ON THE CHOICE OF λ

We tried seven different values for λ, and report the AUC scores on the PNL-A-unif dataset with different noise levels. We found that the MC-PNL is suitable to use in the small noise regime. We also found that for the data with small noise, smaller λ is preferred; and for the data with large injected noise, larger λ is preferred. 2 https://github.com/baosws/CDCI 



Independence test-based methods. Average running time evaluated on synthetic data containing 100 pairs, and each pair has 1000 samples.We report the comparison of ROC-AUCs in Table2. The results are averaged over five different noise scales for the synthetic datasets. Our proposed MC-PNL consistently outperforms other independence test-based methods on the synthetic PNL data. Especially compared with AbPNL, our MC-PNL is not sensitive to the initializations and is much more efficient (w.r.t. training time); compared with ACE (without independence regularizer), MC-PNL has better causal discovery accuracy. And for real datasets, our methods is quite competitive.6 CONCLUSIONSIn this paper, we focus on the PNL model learning and propose a maximal correlation-based method, which can recover the nonlinear transformations accurately and swiftly in an iterative manner. The key is to incorporate with maximal correlation to avoid learning arbitrary independent noise, and the proposed MC-PNL is more reliable than previous methods that are solely based on the independence loss. Besides the PNL model learning, we conduct experiments on the downstream causal discovery task where MC-PNL is superior to the SOTA independence test-based methods. We note that the optimal solution of (8) is also one solution of (9).



Figure 1: Architectures of PNL learning frameworks.

Figure 2: The sub-figures (a) and (b) show the nonlinear function fitting of the two datasets. In each sub-figure, the top row shows the learned f (→) (x) (red line) and the residual plot under the correct hypothesis X → Y , which has lower RDC value; the bottom row is under the opposite X ← Y .

Figure 3: The Algorithm 1 converges on Syn-2. We plot the snapshots of the feature transformations f at training epochs [0, 5, 10, 20], under 15 random initializations (indicated by colors). Upper: λ = 0, most initializations converge to local minimizers (symmetry: (α, β) → (aα, a -1 β)). Lower: λ = 5, most initializations converge to two local minimizers (symmetry: (α, β) → -(α, β)).

Figure 4: The ground truth transformations of f * and g * of Syn-1 (top) and Syn-2 (bottom).

Figure 5: Visualization of the learned nonlinearities (trained solely with HSIC, under different datasets and MLP configurations). From top to bottom, the convergence results, residual plot, learned f , learned g, are plotted. Each column shows one specific configuration. None of them learns meaningful nonlinearities, and the learned transformations are quite similar across datasets.

Figure 6: The construction of banded residual loss.

Figure 9: Bootstrap results of MC-PNL on eight Tuebingen datasets. We plot the histogram of the RDC estimates and the estimated causal scores of 30 replications.

Figure 10: The Algorithm 1 converges on Syn-1. We plot the snapshots of the feature transformations f at training epochs [0, 5, 10, 20], under 15 random initializations (indicated by colors). Upper: λ = 0, most initializations converge to local minimizers (symmetry: (α, β) → (aα, a -1 β)). Lower: λ = 5, most initializations converge to two local minimizers (symmetry: (α, β) → -(α, β)).

Figure 11: The detailed AUC scores vs. λ under five noise levels on PNL-A-unif data.

The independence test accuracy (%) with known injected noise

Comparison of bivaraite causal discovery ROC-AUC on synthetic and real datasets

Comparison of bivaraite causal discovery AUC on 100 realizations of Syn-1

Connections between DR parameterization and dependence measureDensity ratio surrogate function DR θ (x, y)

I BOOTSTRAP FOR TRUSTWORTHY CAUSAL DISCOVERY

Bootstrap is a commonly used technique to estimate the confidence interval. In this section, we show a few examples of bootstrap with Tuebingen data (Mooij et al., 2016) . We obtained 30 estimates of RDC from the data re-sampled with replacement, see 

