HIERARCHICAL PROBABILISTIC MODEL FOR BLIND SOURCE SEPARATION VIA LEGENDRE TRANSFORMATION

Abstract

We present a novel blind source separation (BSS) method, called information geometric blind source separation (IGBSS). Our formulation is based on the loglinear model equipped with a hierarchically structured sample space, which has theoretical guarantees to uniquely recover a set of source signals by minimizing the KL divergence from a set of mixed signals. Source signals, received signals, and mixing matrices are realized as different layers in our hierarchical sample space. Our empirical results have demonstrated on images and time series data that our approach is superior to well established techniques and is able to separate signals with complex interactions.

1. INTRODUCTION

The objective of blind source separation (BSS) is to identify a set of source signals from a set of multivariate mixed signals 1 . BSS is widely used for applications which are considered to be the "cocktail party problem". Examples include image/signal processing (Isomura & Toyoizumi, 2016) , artifact removal in medical imaging (Vigário et al., 1998) , and electroencephalogram (EEG) signal separation (Congedo et al., 2008) . Currently, there are a number of solutions for the BSS problem. The most widely used approaches are variations of principal component analysis (PCA) (Pearson, 1901; Murphy, 2012) and independent component analysis (ICA) (Comon, 1994; Murphy, 2012) . However, they all have limitations with their approaches. PCA and its modern variations such as sparse PCA (SPCA) (Zou et al., 2006) , non-linear PCA (NLPCA) (Scholz et al., 2005) , and Robust PCA (Xu et al., 2010) extract a specified number of components with the largest variance under an orthogonal constraint, which are composed of a linear combination of variables. They create a set of uncorrelated orthogonal basis vectors that represent the source signal. The basis vectors with the N largest variance are called the principal components and is the output of the model. PCA has shown to be effective for many applications such as dimensionality reduction and feature extraction. However, for BSS, PCA makes the assumption that the source signals are orthogonal, which is often not the case in most practical applications. Similarly, ICA also attempts to find the N components with the largest variance, but relaxes the orthogonality constraint. All variations of ICA such as infomax (Bell & Sejnowski, 1995) , Fas-tICA (Hyvärinen & Oja, 2000) and JADE (Cardoso, 1999) separate a multivariate signal into additive subcomponents by maximizing statistical independence of each component. ICA assumes that each component is non-gaussian and the relationship between the source signal and the mixed signal is an affine transformation. In addition to these assumptions, ICA is sensitive to the initialization of the weights as the optimization is non-convex and is likely to converge to a local optimum. Other potential methods which can perform BSS include non-negative matrix factorization (NMF) (Lee & Seung, 2001; Berne et al., 2007) , dictionary learning (DL) (Olshausen & Field, 1997) , and reconstruction ICA (RICA) (Le et al., 2011) . NMF, DL and RICA are degenerate approaches to recover the source signal from the mixed signal. These approaches are more typically used for feature extraction. NMF factorizes a matrix into two matrices with nonnegative elements representing weights and features. The features extracted by NMF can be used to recover the source signal. More recently there are more advanced techniques that uses Short-time Fourier transform (STFT) to transform the signal into the frequency domain to construct a spectrogram before applying NMF (Sawada et al., 2019) . However, NMF does not maximize statistical independence which is required to completely separate the mixed signal into the source signal, and it is also sensitive to initialization as the optimization is non-convex. Due to the non-convexity, additional constraints or heuristics for weight initialization is often applied to NMF to achieve better results (Ding et al., 2008; Boutsidis & Gallopoulos, 2008) . DL can be thought of as a variation of the ICA approaches which requires an over-complete basis vector for the mixing matrix. DL may be advantageous because additional constraints such as a positive code or a dictionary can be applied to the model. However, since it requires an over-complete basis vector, information may be lost when reconstructing the source signal. In addition, like all the other approaches, DL is also non-convex and it is sensitive to the initialization of the weights. All previous approaches have limitations such as loss of information or non-convex optimization and require constraints or assumptions such as orthogonality or an affine transformation which are not ideal for BSS. In the following, we introduce our approach to BSS, called IGBSS (Information Geometric BSS), using the log-linear model (Agresti, 2012) , which can introduce relationships between possible states into its sample space (Sugiyama et al., 2017) . Unlike the previous approaches, our proposed approach does not have the assumptions or limitations that they require. We provide a flexible solution by introducing a hierarchical structure between signals into our model, which allows us to treat interactions between signals that are more complex than an affine transformation. Unlike other existing methods, our approach does not require the inversion of the mixing matrix and is able to recover the sign of the signal. Thanks to the well-developed information geometric analysis of the log-linear model (Amari, 2001) , optimization of our method is achieved via convex optimization, hence it always arrives at the globally optimal unique solution. Moreover, we theoretically show that it always minimizes the Kullback-Leibler (KL) divergence from a set of mixed signals to a set of source signals. Our experimental results demonstrate that our hierarchical model leads to better separation of signals including complex interaction such as higher-order feature interactions (Luo & Sugiyama, 2019) than existing methods.

2. FORMULATION

BSS is formulated as a function f that separates a set of received signals X into a set of source signals Z, i.e., Z = f (X). For example, if one employs ICA based formulation, the BSS problem reduces to X = AZ, where the received signal X ∈ R L×M with L signals with the sample size M is affine transformation of the source signal Z ∈ R N ×M with N signals and a mixing matrix A ∈ R L×N . The objective is to estimate Z by learning A given X. Our idea is to use the log-linear model (Agresti, 2012) , which is a well-known energy-based model, to take non-affine transformation into account and formulate BSS as a convex optimization problem.

2.1. LOG-LINEAR MODEL ON PARTIALLY ORDERED SET

We use the log-linear model given in the form of log p(ω) = s∈S 1 s ω θ s -ψ(θ), where p(ω) ∈ (0, 1) is probability of each state ω ∈ Ω and S ⊆ Ω is a parameter space such that a parameter value θ s ∈ R is associated with each s ∈ S, and ψ(θ) is the partition function so that ω∈Ω p(ω) = 1. In this formulation, we assume that the set Ω of possible states, equivalent to the sample space in the statistical sense, is a partially ordered set (poset); that is, it is equipped with a partial order " " (Gierz et al., 2003) and 1 s ω = 1 if s ω and 0 otherwise. This formulation is firstly introduced by Sugiyama et al. (2016) and used to model the matrix balancing problem (Sugiyama et al., 2017) , which includes Boltzmann machines as a special case (Luo & Sugiyama, 2019) . If we index Ω as Ω = {ω 1 , ω 2 , . . . , ω |Ω| }, we obtain the following matrix form: log p = Fθ -ψ(θ), where p ∈ (0, 1) |Ω| with p i = p(ω i ), θ ∈ R |Ω| such that θ i = θ ωi if ω i ∈ S and θ i = 0 otherwise, F = (f ij ) ∈ {0, 1} |Ω|×|Ω| with f ij = 1 ωj ωi , and ψ(θ) = (ψ(θ), . . . , ψ(θ)) ∈ R |Ω| . Each vector is treated as a column vector, and log is entry-wise operation. This matrix form is often used as a general form of the log-linear model (Coull & Agresti, 2003) and F is called a model matrix, which represents relationship between states. The assumption to the log-linear model is that F is needed to be non-singular, and Sugiyama et al. (2017) showed that Equation (1) with a poset Ω always provides a non-singular model matrix; that is, F is regular as long as each entry is given as f ij = 1 ωj ωi . This property is powerful in mathematical modeling as we can introduce any partial order structure into Ω, which we will use to introduce our hierarchical structure in tne next subsection to solve BSS.

2.2. LAYER CONFIGURATION FOR BLIND SOURCE SEPARATION

Our key idea is to introduce a hierarchical layered structure into the sample space Ω of the log-linear model to achieve BSS. We call this model information geometric BSS (IGBSS) as its optimality is supported by the tight connection between the log-linear model and information geometric property of the space of distributions (statistical manifold), which will be shown in the next subsection. We implement three layers of BSS, the mixing layer, the source layer, and the received layer, into Ω as partial orders and learn the joint representation on it using the log-linear model. The received layer and the source layer represent the input received signal and the output source signal of BSS, respectively, and the mixing layer encodes information of how to mix the source signal. In the following, we consistently assume that L is the number of received signals, M is the sample size, and N is the number of source signals. The element ⊥ denotes the least element, and it acts as a partition function and θ ⊥ = -ψ(θ) always holds. We use 2D indexing of elements in each layer to make the correspondence between our formulation and ICA based formulation clear; that is, these three layers A, Z, and X are analogue to a mixing matrix A ∈ R L×N , a source matrix Z ∈ R N ×M , and a received matrix X ∈ R L×M , respectivelyfoot_1 . We will also use symbols ω and s to denote elements of Ω, i.e., they can be ⊥, a ln , z nm , and x lm . It is always assumed that the parameter space of the loglinear model S = A ∪ Z ⊂ Ω, meaning that mixing and source layers are used as parameters to represent distributions in our model. Here we introduce a partial order between layers. Define a a ij z i j if j = i , a ij z i j otherwise, z ij x i j if j = j , z ij x i j otherwise (2) for each element in three layers A, Z, and X , and we do not any ordering among elements in the same layer. Since it is a partial order, transitivity always holds, for example, a 11 x 22 as a 11 z 12 and z 12 x 22 . The first condition encodes the structure such that the source layer is higher than the mixing layer, and the second condition encodes that the received layer is higher than the source layer. An example of our sample space with L = M = N = 2 is illustrated in Figure 1 . The joint distribution for BSS is described by the log-linear model in Equation (1) over the sample space Ω = {⊥} ∪ A ∪ Z ∪ X equipped with the partial order defined in Equation ( 2). If we learn the joint distribution from a received signal X, we will obtain probabilities on the source layer p(z 11 ), . . . , p(z N M ), which represents normalized source signals. The rational of our approach is given as follows: The connections between each layer is structured so that the log-linear model performs a similar computation with the ICA based approach X = AZ. Our structure ensures that each p(x lm ) is determined by (θ a ln ) n∈[N ] and (θ zmn ) n∈[N ] with [N ] = {1, . . . , N }, as we always have a ln x lm and z nm x lm . Moreover, this formulation allows us to model more complex interaction than affine transformation, such as higher-order interactions, between signals if we additionally include partial order structure into Z and/or A, which cannot be treated by a simple matrix multiplication.

2.3. OPTIMIZATION

We train the log-linear model by minimizing the KL divergence from an empirical distribution p, which is identical to the normalized received signal X ∈ R L×M , to the model joint distribution p given by Equation (1) or, equivalently, maximizing the likelihood. More precisely, we normalize a given X by dividing each entry by the sum of all entries; that is, an empirical distribution p is obtained as p(x lm ) = x lm / l,m x lm . If X contains negative values, an exponential kernel exp (x lm )/ l,m exp (x lm ) or min-max normalization (x lm + -min(X))/(max(X)+ -min(X)) can be used, where is some arbitrary small value to avoid zero probability. We also assume that p(a ln ) = 0 and p(z nm ) = 0 for all a ln ∈ A and z nm ∈ Z. The objective function is given as arg min p∈P θ D KL (p p) = arg min p∈P θ ω∈Ω p(ω) log p(ω) p(ω) , where P θ is the set of distributions that can be represented by Equation ( 1) with our structured sample space Ω = {⊥} ∪ A ∪ Z ∪ X and S = A ∪ Z. The remarkable property of our model is that this optimization problem is convex and it is guaranteed that gradient-based methods can always arrive at the globally optimal unique solution. To show this, we analyze the geometric structure of the statistical manifold, the set of probability distributions, generated by the log-linear model. Let Ω + = Ω \ {⊥}. First we introduce another parameterization (η ω ) ω∈Ω + of the log-linear model, which is defined as η ω = s∈Ω 1 ω s p(s). Note that η ⊥ = 1 always holds and we do not include it into parameters. In addition, for theoretical consistency we change the parameter space used in Equation (1) from S to Ω + and assume that θ ω = 0 if ω ∈ S. Again we do not include θ ⊥ as a parameter as it is the partition function. Two parameters (θ ω ) ω∈Ω + and (η ω ) ω∈Ω + have clear statistical interpretation as it is widely known that any log-linear model belongs to the exponential family, where θ and η correspond to natural and expectation parameters, respectively. θ and η are connected via a Legendre transformation which means that they are both differentiable and have a one-to-one correspondence. To simplify the notation, we denote by θ and η the corresponding θ and η of the empirical distribution p. Let P = {p | 0 < p(ω) < 1 for all ω ∈ Ω} be the set of all probability distributions. This set forms a statistical manifold with dually flat structure, which is the canonical geometric structure in information geometry (Amari, 2016) , with its dual coordinate system ((θ ω ) ω∈Ω + , (η ω ) ω∈Ω + ); that is, both of (θ ω ) ω∈Ω + and (η ω ) ω∈Ω + work as coordinate systems and determine a distribution in P. The Riemannian metric with respect to θ is given as g ss = ∂η s ∂θ s = E ∂ log p(ω) ∂θ s ∂ log p(ω) ∂θ s = ω∈Ω 1 s ω 1 s ω p(ω) -η s η s , which coincides with the Fisher information (Sugiyama et al., 2017 , Theorem 3) and we will use it for natural gradient. Now we consider two submanifolds P θ , P η ⊆ P, which we define as P θ = { p ∈ P | θ ω = 0, ∀ω ∈ E } , E = Ω + \ S, P η = { p ∈ P | η ω = ηω , ∀ω ∈ M } , M = S. Note that this P θ coincides with that in Equation (3). The submanifold P θ is called an e-flat submanifold and P η an m-flat submanifold in information geometry. The highlight of considering these two types of submanifolds is that, if E ∩ M = ∅ and E ∪ M = Ω + , it is theoretically guaranteed that the intersection P θ ∩ P η is always a singleton and it is the optimizer of Equation (3) (Amari, 2009, Theorem 3) , that is, it is the globally optimal solution of our model.  (∆η ω ) ω∈Z ← (η ω ) ω∈Z -(η ω ) ω∈Z 9: (∆η ω ) ω∈A ← (η ω ) ω∈A -(η ω ) ω∈A 10: Compute the Fisher information matrix for source layer G Z and the mixing layer G A 11: (θ ω ) ω∈Z ← (θ ω ) ω∈Z -G -1 Z (∆η ω ) ω∈Z 12: (θ ω ) ω∈A ← (θ ω ) ω∈A -G -1 A (∆η ω ) ω∈A 13: until convergence of (θ s ) s∈S 14: End Function a coordinate system of P θ that is linearly constrained on θ. We can therefore use the standard gradient descent strategy to optimize the log-linear model. The derivative of the KL divergence with respect to θ s is known to be the difference between expectation parameters η (Sugiyama et al., 2017, Theorem 2): (∂/∂θ s )D KL (p p) = η s -ηs , and the KL divergence D KL (p p) is minimized if and only if η s = ηs for all s ∈ S. From our definition of Ω in Equation ( 2), we have η z kl = η z k l for all z kl , z k l ∈ Z. Therefore all elements in the source layer will learn the same value. This problem can be avoided by removing some of partial orders between source and received layers. We propose to systematically remove the partial order z ij x i j if i = i to ensure η z kl = η z k l (see Figure 1 ), while other strategies are possible as long as η z kl = η z k l is satisfied, for example, random deletion of such orders. Using the above results, gradient descent can be directly applied to achieve Equation (3). However, this may need a large number of iterations to reach convergence. To reduce the number of iterations, we propose to use natural gradient (Amari, 1998) , which is a second-order optimization approach and will also always find the global optimum. Let us re-index S = A ∪ Z as S = {s 1 , s 2 , . . . , s |S| } and assume that θ = [θ s1 , . . . , θ s |S| ] T and η = [η s1 , . . . , η s |S| ] T . In each step of natural gradient, the current θ is updated to θ next by the following formula: θ next = θ -G -1 (η -η) where G = (g ij ) ∈ R |S|×|S| is the Fisher information matrix such that each g ij is given as g sisj in Equation ( 5). Although the natural gradient requires much less iterations compared to the gradient descent, matrix inversion G -1 is computationally expensive as it has the complexity of O(|S| 3 ). In addition, FIM values are often too small and optimization becomes numerically unstable. To solve these problems, we separate the update steps in the source layer and the mixing layer: (θ ω,next ) ω∈Z = (θ ω ) ω∈Z -G -1 Z (∆η ω ) ω∈Z , (θ ω,next ) ω∈A = (θ ω ) ω∈A -G -1 A (∆η ω ) ω∈A , where G Z and G A are the Fisher information matrices for source and mixing layers, respectively. Note that this also leads to the same global optimum. They are constructed by assuming all the other parameters are fixed. This approach reduces the time complexity to O(|Z| 3 +|A| 3 ). The full algorithm using natural gradient is given in Algorithm 1. Computation of p from θ and η from p can be achieved using Equations ( 1) and ( 4). We also give more explicit description of p and η for each layer in Appendix 

3. EXPERIMENTS

We empirically examine the effectiveness of IGBSS to perform BSS using real-world image and synthetic time-series datasets for an affine transformation and higher-order interactions between signals. All experiments were run on CentOS Linux 7 with Intel Xeon CPU E5-2623 v4 and Nvidia QuadroGP100foot_2 .

3.1. BLIND SOURCE SEPARATION FOR AFFINE TRANSFORMATIONS ON IMAGES

In our experiments, we use three benchmark images widely used in computer vision from the University of Southern California's Signal and Image Processing Institute (USC-SIPI)foot_3 , which include "airplane (F-16)", "lake" and "peppers". Each image is standardized to have 32x32 pixels with red, green and blue color channels with integer values between 0 and 255 to represent the intensity of each pixel. These images shown in Figure 2a are the source signal Z which are unknown to the model. They are only used as ground truth to evaluate the model's output. The equation X = AZ is used to generate the received signal X by randomly generating values for a mixing matrix A using the uniform distribution which generates real numbers between 1 and 6. The images are then rescaled to integer values within the range between 0 and 255. The received signal X, which is the input to the model, is the three images shown in Figure 2b . The three images for the mixed signal may look visually similar, however, they are actually superposition of the source signal with different intensity. The objective of our model is to reconstruct the source signal Z without knowing the mixing matrix A. We compare our approach to FastICA (Hyvärinen & Oja, 2000) with the log cosh function as the signal prior, dictionary learning (DL) (Olshausen & Field, 1997) with constraint for positive dictionary and positive code, and NMF with the coordinate descent solver and non-negative double singular value decomposition (NNDSVD) initialization (Boutsidis & Gallopoulos, 2008) with zero values replaced with the mean of the input. Since BSS is an unsupervised learning problem, the order of the signal is not recovered. We identify the corresponding signal by taking all permutations of the output and calculate the minimum euclidean distance with the ground truth. The permutation which returns the minimum error is considered as the correct order of the image. The scale of the output is also not recovered, thereby we have used min-max normalization to the output of each model. Separation results for images are shown in Figure 2 . Our proposed approach IGBSS is able to recover majority of the "shape" of the source signal, while the intensity of each image appears to larger than the ground truth for all images. Small residuals of each image can be seen on the other images. For instance, in the airplane (F-16) image, there residuals from the lake image can be clearly seen. Compared to the reconstruction of IGBSS with FastICA, DL and NMF, IGBSS performs significantly better as all the other approaches are unable to clearly separate the mixed signal. FastICA was unable to provide a reasonable reconstruction with 3 mixed signal. To overcome this limitation of FastICA, we randomly generated another column of the mixing matrix and append it to the current mixing matrix to create 4 mixed signals as an input to FastICA to recover a more reasonable signal. The root mean square error (RMSE) of the Euclidean distance and the signal-to-noise ratio (SNR) between the reconstruction and the ground truth is calculated to quantify results of each method. The SNR is computed by SNR dB = 20 log 10 (z norm /|(z -z norm )|). The full results are shown in Table 1 (top row for each experiment). In the table, we present three experiments with different RGB images from USC-SIPI dataset, for each experiment we generate a new mixing matrix, where the second and the third experiments uses images of "mandrill", "splash", "jelly beans" and "mandrill", "lake", "peppers", respectively. Ground truth and resulting images for second and third experiments are presented in Supplement. Our results clearly show that IGBSS is superior to other methods, that is, IGBSS has consistently produced the lowest RMSE error for every experiment. When looking at the SNR ratio, our model has produce the highest SNR for majority of the cases and is always able to recover the same result after each run as it is formulated as convex optimization.

3.2. BLIND SOURCE SEPARATION WITH HIGHER-ORDER FEATURE INTERACTIONS

We demonstrate the ability of BSS for our model to include higher-order feature interactions in BSS. We use the same benchmark images in the standard BSS as the source signal Z for our experiment. We generate the higher-order feature interactions of the received signal by using the multiplicative product of the source signal. If we take into account up to kth order interaction (k ≤ N ), x lm = n a ln z nm + n1 n2>n1 a ln1n2 z n1m z n2m + n1 n2>n1 n3>n2 a ln1n2n3 z n1m z n2m z n3m + • • • + n1 . . . n k >n k-1 a ln1...n k z n1m . . . z n k m . All the other known approaches take into account only first order interactions (that is, affine transformation) between features. Differently, our model can directly incorporate the higher-order features as we do not have any assumption of the affine transformation. When we consider up to kth order interactions, we additionally include elements corresponding to new mixing parameters into the mixing layer. For example, if k = 2, nodes for a ln1n2 are added and a ln1n2 z nm if n 1 = n or n 2 = n. Figure 3 shows experimental results for the third-order feature experiment. Our approach IGBSS shows superior reconstruction of the source signal to other approaches. All the other approaches except for NMF is able to achieve reasonable reconstruction. NMF is able to recover the "shape" of the image, however, unlike IBSS, NMF is a degenerate approach, so it is unable to recover all color channels in the correct proportion, creating discoloring for the image which is clearly shown in the SNR values. Since the proportion of the intensity of the pixel is not recovered. In terms of both of the RMSE and the SNR shown in Table 1 , IGBSS again shows the best results for both second-and third-order interactions of signals across three experiments.

3.3. TIME SERIES DATA ANALYSIS

We demonstrate the effectiveness of our model on time series data. In our experiments, we create three signals with 500 observations each using the sinusoidal function, sign function, and the sawtooth function. The synthetic data simulates typical signals from a wide range of applications including audio, medical and sensors. We randomly generate a mixing matrix by drawing from a uniform distribution with values between 0.5 and 2. In our experiment, we provide comparison of using both min-max normalization and exponential kernel as a pre-processing step and compare our approach with FastICA. Experimental results are illustrated in Figure 4 . These results show that IGBSS is superior to all the ICA approaches because it is able to recover both the shape of the signal and the sign of the signal, while all the other ICA approaches are only able to recover the shape of the signal and are unable to recover the sign of the signal. This means that ICA could recover a flipped signal. We have paired the recovered signal of ICA with the ground truth by finding the signal and sign with the lowest RMSE error. In any practical application, this is not possible for ICA because the latent signal is unknown. Through visual inspection, IGBSS is able to recover all visual signals with high accuracy, while FastICA is only able to recover the first-order interaction and it is unable to produce a reasonable recovery for second-and third-order interactions. In addition to our visual comparison, we have also performed a quantitative analysis on the experimental results using RMSE error with the ground truth. Results are shown in Table 2 . FastICA has shown to have better performance for First-Order interactions. However, for second-and third-order SNR results for FastICA is unable to recover a reasonable signal because the noise is more dominant. IGBSS has shown superior performance and is able to recover the signal for second-and third-order interactions with better scores for both RMSE and SNR.

4. CONCLUSION

We have proposed a novel blind source separation (BSS) method, called Information Geometric Blind Source Separation (IGBSS). We have formulated our approach using the log-linear model, which enables us to introduce a hierarchical structure into its sample space to achieve BSS. We have theoretically shown that IGBSS has desirable properties for BSS such as unique recover of source signals as it solves the convex optimization problem by minimizing the KL divergence from mixed signals to source signals. We have experimentally shown that IGBSS recovers images and signals closer to the ground truth than independent component analysis (ICA), dictionary learning (DL) and non-negative matrix factorization (NMF). Thanks to the flexibility of the hierarchical structure, IGBSS is able to separate signals with complex interactions such as higher-order interactions. Our model is superior to the other approaches because it is non-degenerate and is able to recover the sign of the signal. Since our approach is flexible and requires less assumptions than alternative approaches, it can be applied to various real world applications such as medical imaging, signal processing, and image processing.

A APPENDIX

A.1 PARAMETER COMPUTATION FOR EACH LAYER In the following, we give p, η, and the gradient for each layer, which are used in gradient descent. Received Layer (Input Layer): Probability p(x) on the received layer x ∈ X is obtained as log p(x) = z∈Z 1 z x θ z + a∈A 1 a x θ a + θ ⊥ , η x = x ∈X 1 x x p(x ) = p(x). We do not need to compute gradient for the received layer as there is no parameter on this layer and θ x = 0 for all x ∈ X . Source Layer (Output Layer): Probability p(z) on the source layer for each z ∈ Z is given as log p(z) = z ∈Z, 1 z z θ z + a∈A 1 a z θ a + θ ⊥ = θ z + a∈A 1 a z θ a + θ ⊥ , η z = x∈X 1 z x p(x) + z ∈Z 1 z z p(z ) = x∈X 1 z x p(x) + p(z). Thus the gradient for the source layer is given as ∂ ∂θ z D KL (p p) = η z -ηz = x∈X 1 z x (p(x) -p(x)) + p(z). ( ) Mixing Layer: Probability p(a) on this layer for each a ∈ A is given as log p(a) = a ∈A 1 a a θ a + θ ⊥ = θ a + θ ⊥ , η a = x∈X 1 a x p(x) + z∈Z 1 a z p(z) + a ∈A 1 a a p(a ) (14) = x∈X 1 a x p(x) + z∈Z 1 a z p(z) + p(a). ( ) The gradient of the mixing layer is given as ∂ ∂θ a D KL (p p) = η a -ηa = x∈X 1 a x (p(x) -p(x)) + z∈Z 1 a z p(z) + p(a). Parameter values θ a in the mixing layer represent the degree of mixing between source signals. Hence they can be used to perform feature selection and extraction. For example, if θ a = 0 in the extreme case, the corresponding node a does not have any contribution to the source mixing.

A.2 FEATURE EXTRACTION FOR A 2D POINT CLOUD EXPERIMENT

We demonstrate the effectiveness for IGBSS to identify independent components on a 2-dimensional point cloud to be used for feature extraction or dimensionality reduction. In our experiment, we generate a 2-dimensional point cloud using two standard Student's t-distribution with 1.3 degree of freedom and have scaled the first dimension by 1/5 and the second dimension by 1/10 to the point cloud, illustrated in Figure 5a . Then we have randomly generated a mixing matrix for our experiment to generate a mixed signal shown in Figure 5b . We run the experiment on our model IGBSS using min-max normalization as a pre-processing step and compare it to PCA and ICA. We apply the reverse transformation of the min-max normalization on the recovered signal and have plotted the results in Figure 5 . From the experimental results, we can see that PCA is able to recover the same scale of the point cloud. However, the sign of the signal is not recovered as we have recovered reversed sign of the signal. PCA also recovers signals which are orthogonal to the largest variance. Therefore the axes of the point cloud recovered by PCA does not align with the source signal in Figure 5a , that is, the axes do not run parallel to the x-and y-axes but instead is still in the same orientation as the mixed signal. This is not what we want as the signal is still mixed, and we would like to recover the signal in the same orientation as the source signal in blind source separation. ICA aims to recover statistically independent signals that are generally considered as the axes with the largest variances and not necessarily orthogonal to each other. However, the limitations of ICA is that it is unable to recover the sign and the scale of the signal. Therefore the scale of the recovered signal does not match with the source signal. In our experiment, we have plotted the results with unit variance as the recovered signal is generally unnormalized in ICA. Since our experiment is synthetically generated, we are able to quantitatively measure the the error in each approach by normalizing both the recovered signal and the source signal by its standard deviation then computing the root mean squared error (RMSE) and the signal-to-noise ratio (SNR). The results of this is shown in Table 3 . Our proposed approach IGBSS has clear advantages, where it is able to recover the same orientation as the source signal as well as preserve the signal. 

A.3 RUNTIME ANALYSIS

In our experiment, we used a learning rate of 1.0 for gradient descent. Although the time complexity for each iteration of natural gradient is O(|Z| 3 +|A| 3 +|Ω||S|), which is larger than O(|Ω||S| 2 ) for gradient descent, natural gradient is able to reach convergence faster because it is quadratic convergence and requires significantly less iterations compared to gradient descent, which linearly converges. Increasing the size of the input will increase the size of |Ω| only, while the number of parameters |Z|, |A| remain this same. Since the complexity of natural gradient is linear with respect to the size |Ω| of the input, increasing the size of the input is unlikely to increase the runtime significantly. Our experimental analysis in Figure 6 supports this analysis: our model scales linearly for both natural gradient and gradient descent when increasing the order of interactions in our model. This is because for practical application it is unlikely that |A| > |Z|. The different between the runtime for natural gradient and gradient descent becomes larger as the order of interactions increased. When demonstrate the problem of the sign inversion in ICA. We use the same experimental set-up explained in Section 3.1 on blind source separation for affine transformation. We run the experiment on the dataset used for experimental 1 for the first order experiment and have shown the output of several runs in FastICA to show the problem of the sign inversion in Figure 7 . For the 6 runs, we can see that none of the experiments were able to obtain the correct sign of the signal. This means that apply FastICA to applications where the sign of the signal is important is quite problematic. 



Mixed signals and received signals are used exchangeably throughout this article. We abuse an entry x lm of X and its corresponding state in X to avoid complicated notations. The code is available in the supplementary material and will be publicly available online after the peer review process http://sipi.usc.edu/database/



. The time complexity to compute p in Algorithm 1 Line 6 is O(|Ω||S|). The complexity to compute ∆η in Algorithm 1 Line 8 and Line 9 is O(|Z|) + O(|A|) = O(|S|). Therefore the total complexity of each iteration is O(|Z| 3 +|A| 3 +|Ω||S|).

Figure 2: First-order interaction experiment.

Figure 3: Third-order interaction experiment.

Figure 4: Time series signal experiment.

Figure 5: 2-dimensional point cloud experiment.

Figure6: Experimental analysis of the scalability of number of parameters and higher-order features in the model for both natural gradient approach and gradient descent

Figure 7: Six different runs of FastICA with the same experimental input experimental dataset as exp1 with first order interactions. The different results can demonstrate that the FastICA model is non-convex leading to potential problemic results such as the sign inversion.

{⊥} ∪ A ∪ Z ∪ X with A = {a 11 , . . . , a LN }, Z = {z 11 , . . . , z N M }, and X = {x 11 , . . . , x LM }.

: Compute η = (η s ) s∈S from p 4: Initialize (θ s ) s∈S (randomly or θ s = 0)

Signal-to-Noise Ratio of reconstructed signal. ( * ) Results for Figure 2. ( †) Results for Figure 3. Scores are means ± standard deviation after 40 runs.We have applied different weight initialization after each run. Second 0.200 ± 0.000 0.280 ± 0.049 0.515 ± 0.007 0.709 ± 0.000 10.862 ± 0.000 10.171 ± 2.353 0.529 ± 0.228 -5.579 ± 0.000

Quantitative results for time-series separation experiment (mean ± standard deviation with 40 runs).

Signal-to-Noise Ratio (SNR) and Root Mean Square Error (RMSE) between the recovered signal and the latent source signal for the 2-dimensional point cloud experiment.

