DEEP GATED CANONICAL CORRELATION ANALYSIS

Abstract

Canonical Correlation Analysis (CCA) models can extract informative correlated representations from multimodal unlabelled data. Despite their success, CCA models may break if the number of variables exceeds the number of samples. We propose Deep Gated-CCA, a method for learning correlated representations based on a sparse subset of variables from two observed modalities. The proposed procedure learns two non-linear transformations and simultaneously gates the input variables to identify a subset of most correlated variables. The nonlinear transformations are learned by training two neural networks to maximize a shared correlation loss defined based on their outputs. Gating is obtained by adding an approximate 0 regularization term applied to the input variables. This approximation relies on a recently proposed continuous Gaussian based relaxation for Bernoulli variables which act as gates. We demonstrate the efficacy of the method using several synthetic and real examples. Most notably, the method outperforms other linear and non-linear CCA models.

1. INTRODUCTION

Canonical Correlation Analysis (CCA) (Hotelling, 1936; Thompson, 2005) , is a classic statistical method for finding the maximally correlated linear transformations of two modalities (or views). Using modalities X ∈ R Dx×N and Y ∈ R D Y ×N , which are centered and have N samples with D x and D y features respectively. CCA seeks canonical vectors a i ∈ R D X , and b i ∈ R D Y , such that , u i = a T i X, and v i = b T i Y , i = 1, ..., N , maximize the sample correlations between u i and v i , where u i (v i ) form an orthonormal basis for i = 1, ..., d, i.e. a i , b i = argmax ui,uj =δi,j , vi,vj =δi,j ,i,j=1,...,d Corr(u i , v i ). While CCA enjoys a closed-form solution using a generalized eigen pair problem, it is restricted to the linear transformations A = [a 1 , ..., a d ] and B = [b 1 , ..., b d ]. In order to identify non-linear relations between input variables, several extensions of CCA have been proposed. et al., 2017) . One key limitations of these models is that they typically require more samples than features, i.e. N > D x , D y . However, if we have more variable than samples, the estimation based on the closed form solution of the CCA problem (in Eq. 1) breaks (Suo et al., 2017) . Moreover, in high dimensional data, often some of the variables are not informative and thus should be omitted from the transformations. For these reasons, there has been a growing interest in studying sparse CCA models. Sparse CCA (SCCA) (Waaijenborg et al., 2008; Hardoon & Shawe-Taylor, 2011; Suo et al., 2017) uses an 1 penalty to encourage sparsity of the canonical vectors a i and b i . This can not only remove the degeneracy inherit to N < D x , D y , but can improve interpetability and performance. One caveat of this approach is its high computational complexity, which can be reduced by replacing the orthonormality constraints on u i and v i with orthonormality constraints on a i and b i . This procedure is known as simplified-SCCA (Parkhomenko et al., 2009; Witten et al., 2009) , which enjoys a closed form solution. There has been limited work on extending these models to sparse nonlinear CCA. Specifically, there are two kernel based extensions, two-stage kernel CCA (TSKCCA) by Yoshida et al. ( 2017) and SCCA based on Hilbert-Schmidt Independence Criterion (SCCA-HSIC) by Uurtio et al. ( 2018). However, these models suffer from the same limitations as KCCA and are not scalable to a high dimensional regime. This paper presents a sparse CCA model that can be optimized using standard deep learning methodologies. The method combines the differentiable loss presented in DCCA (Andrew et al., 2013) with an approximate 0 regularization term designed to sparsify the input variables of both X and Y . Our regularization relies on a recently proposed Gaussian based continuous relaxation of Bernoulli random variables, termed gates (Yamada et al., 2020) . The gates are applied to the input features to sparsify X and Y . The gates parameters are trained jointly via stochastic gradient decent to maximize the correlation between the representations of X and Y , while simultaneously selecting only the subsets of the most correlated input features. We apply the proposed method to synthetic data, and demonstrate that our method can improve the estimation of the canonical vectors compared with SCCA models. Then, we use the method to identify informative variable in multichannel noisy seismic data and show its advantage over other CCA models.  θ * X , θ * Y = argmax θ X , θ Y Corr(f (X; θ X ), g(Y ; θ Y )), where θ X and θ Y are the trainable parameters, and f (X), g(Y ) ∈ R d are the desired correlated representations.

1.3. SPARSE CCA

Several authors have proposed solutions for the problem of recovering sparse canonical vectors. The key advantages of sparse vectors are that they enable identifying correlated representations even in the regime of N < D x , D y and they allow unsupervised feature selection. Following the formulation by Suo et al. ( 2017), SCCA could be described using the following regularized objective a, b = argmin -Cov(a T X, b T Y ) + τ 1 a 1 + τ 2 b 1 , subject to a T X 2 ≤ 1, b T Y 2 ≤ 1, where τ 1 and τ 2 are regularization parameters for controlling the sparsity of the canonical vectors a and b. Note that the relaxed inequality constrain on a T X and b T Y makes the problem bi-convex, however, if a T X 2 < 1 or b T X 2 < 1, then the covariance in the objective is no longer equal to the correlation.



. (2013), present a deep neural network that learns correlated representations. They proposed Deep Canonical Correlation Analysis (DCCA) which extracts two nonlinear transformations of X and Y with maximal correlation. DCCA trains two neural networks with a joint loss aiming to maximize the total correlation of the network's outputs. The parameters of the networks are learned by applying stochastic gradient decent to the following objective:

