HIERARCHICAL PROBABILISTIC MODEL FOR BLIND SOURCE SEPARATION VIA LEGENDRE TRANSFORMATION

Abstract

We present a novel blind source separation (BSS) method, called information geometric blind source separation (IGBSS). Our formulation is based on the loglinear model equipped with a hierarchically structured sample space, which has theoretical guarantees to uniquely recover a set of source signals by minimizing the KL divergence from a set of mixed signals. Source signals, received signals, and mixing matrices are realized as different layers in our hierarchical sample space. Our empirical results have demonstrated on images and time series data that our approach is superior to well established techniques and is able to separate signals with complex interactions.

1. INTRODUCTION

The objective of blind source separation (BSS) is to identify a set of source signals from a set of multivariate mixed signals 1 . BSS is widely used for applications which are considered to be the "cocktail party problem". Examples include image/signal processing (Isomura & Toyoizumi, 2016) , artifact removal in medical imaging (Vigário et al., 1998) , and electroencephalogram (EEG) signal separation (Congedo et al., 2008) . Currently, there are a number of solutions for the BSS problem. The most widely used approaches are variations of principal component analysis (PCA) (Pearson, 1901; Murphy, 2012) and independent component analysis (ICA) (Comon, 1994; Murphy, 2012) . However, they all have limitations with their approaches. PCA and its modern variations such as sparse PCA (SPCA) (Zou et al., 2006) , non-linear PCA (NLPCA) (Scholz et al., 2005) , and Robust PCA (Xu et al., 2010) extract a specified number of components with the largest variance under an orthogonal constraint, which are composed of a linear combination of variables. They create a set of uncorrelated orthogonal basis vectors that represent the source signal. The basis vectors with the N largest variance are called the principal components and is the output of the model. PCA has shown to be effective for many applications such as dimensionality reduction and feature extraction. However, for BSS, PCA makes the assumption that the source signals are orthogonal, which is often not the case in most practical applications. Similarly, ICA also attempts to find the N components with the largest variance, but relaxes the orthogonality constraint. All variations of ICA such as infomax (Bell & Sejnowski, 1995) , Fas-tICA (Hyvärinen & Oja, 2000) and JADE (Cardoso, 1999) separate a multivariate signal into additive subcomponents by maximizing statistical independence of each component. ICA assumes that each component is non-gaussian and the relationship between the source signal and the mixed signal is an affine transformation. In addition to these assumptions, ICA is sensitive to the initialization of the weights as the optimization is non-convex and is likely to converge to a local optimum. Other potential methods which can perform BSS include non-negative matrix factorization (NMF) (Lee & Seung, 2001; Berne et al., 2007) , dictionary learning (DL) (Olshausen & Field, 1997) , and reconstruction ICA (RICA) (Le et al., 2011) . NMF, DL and RICA are degenerate approaches to recover the source signal from the mixed signal. These approaches are more typically used for feature extraction. NMF factorizes a matrix into two matrices with nonnegative elements representing weights and features. The features extracted by NMF can be used to recover the source signal. More recently there are more advanced techniques that uses Short-time Fourier transform (STFT) to transform the signal into the frequency domain to construct a spectrogram before applying NMF (Sawada et al., 2019) . However, NMF does not maximize statistical independence which is required to completely separate the mixed signal into the source signal, and it is also sensitive to initialization as the optimization is non-convex. Due to the non-convexity, additional constraints or heuristics for weight initialization is often applied to NMF to achieve better results (Ding et al., 2008; Boutsidis & Gallopoulos, 2008) . DL can be thought of as a variation of the ICA approaches which requires an over-complete basis vector for the mixing matrix. DL may be advantageous because additional constraints such as a positive code or a dictionary can be applied to the model. However, since it requires an over-complete basis vector, information may be lost when reconstructing the source signal. In addition, like all the other approaches, DL is also non-convex and it is sensitive to the initialization of the weights. All previous approaches have limitations such as loss of information or non-convex optimization and require constraints or assumptions such as orthogonality or an affine transformation which are not ideal for BSS. In the following, we introduce our approach to BSS, called IGBSS (Information Geometric BSS), using the log-linear model (Agresti, 2012), which can introduce relationships between possible states into its sample space (Sugiyama et al., 2017) . Unlike the previous approaches, our proposed approach does not have the assumptions or limitations that they require. We provide a flexible solution by introducing a hierarchical structure between signals into our model, which allows us to treat interactions between signals that are more complex than an affine transformation. Unlike other existing methods, our approach does not require the inversion of the mixing matrix and is able to recover the sign of the signal. Thanks to the well-developed information geometric analysis of the log-linear model (Amari, 2001) , optimization of our method is achieved via convex optimization, hence it always arrives at the globally optimal unique solution. Moreover, we theoretically show that it always minimizes the Kullback-Leibler (KL) divergence from a set of mixed signals to a set of source signals. Our experimental results demonstrate that our hierarchical model leads to better separation of signals including complex interaction such as higher-order feature interactions (Luo & Sugiyama, 2019) than existing methods.

2. FORMULATION

BSS is formulated as a function f that separates a set of received signals X into a set of source signals Z, i.e., Z = f (X). For example, if one employs ICA based formulation, the BSS problem reduces to X = AZ, where the received signal X ∈ R L×M with L signals with the sample size M is affine transformation of the source signal Z ∈ R N ×M with N signals and a mixing matrix A ∈ R L×N . The objective is to estimate Z by learning A given X. Our idea is to use the log-linear model (Agresti, 2012) , which is a well-known energy-based model, to take non-affine transformation into account and formulate BSS as a convex optimization problem.

2.1. LOG-LINEAR MODEL ON PARTIALLY ORDERED SET

We use the log-linear model given in the form of log p(ω) = s∈S 1 s ω θ s -ψ(θ), where p(ω) ∈ (0, 1) is probability of each state ω ∈ Ω and S ⊆ Ω is a parameter space such that a parameter value θ s ∈ R is associated with each s ∈ S, and ψ(θ) is the partition function so that ω∈Ω p(ω) = 1. In this formulation, we assume that the set Ω of possible states, equivalent to the sample space in the statistical sense, is a partially ordered set (poset); that is, it is equipped with a partial order " " (Gierz et al., 2003) and 1 s ω = 1 if s ω and 0 otherwise. This formulation is firstly introduced by Sugiyama et al. ( 2016) and used to model the matrix balancing problem (Sugiyama et al., 2017) , which includes Boltzmann machines as a special case (Luo & Sugiyama, 2019) . If we index Ω as Ω = {ω 1 , ω 2 , . . . , ω |Ω| }, we obtain the following matrix form: log p = Fθ -ψ(θ), where p ∈ (0, 1) |Ω| with p i = p(ω i ), θ ∈ R |Ω| such that θ i = θ ωi if ω i ∈ S and θ i = 0 otherwise, F = (f ij ) ∈ {0, 1} |Ω|×|Ω| with f ij = 1 ωj ωi , and ψ(θ) = (ψ(θ), . . . , ψ(θ)) ∈ R |Ω| . Each vector



Mixed signals and received signals are used exchangeably throughout this article.

