DEEP GATED CANONICAL CORRELATION ANALYSIS

Abstract

Canonical Correlation Analysis (CCA) models can extract informative correlated representations from multimodal unlabelled data. Despite their success, CCA models may break if the number of variables exceeds the number of samples. We propose Deep Gated-CCA, a method for learning correlated representations based on a sparse subset of variables from two observed modalities. The proposed procedure learns two non-linear transformations and simultaneously gates the input variables to identify a subset of most correlated variables. The nonlinear transformations are learned by training two neural networks to maximize a shared correlation loss defined based on their outputs. Gating is obtained by adding an approximate 0 regularization term applied to the input variables. This approximation relies on a recently proposed continuous Gaussian based relaxation for Bernoulli variables which act as gates. We demonstrate the efficacy of the method using several synthetic and real examples. Most notably, the method outperforms other linear and non-linear CCA models.

1. INTRODUCTION

Canonical Correlation Analysis (CCA) (Hotelling, 1936; Thompson, 2005) , is a classic statistical method for finding the maximally correlated linear transformations of two modalities (or views). Using modalities X ∈ R Dx×N and Y ∈ R D Y ×N , which are centered and have N samples with D x and D y features respectively. CCA seeks canonical vectors a i ∈ R D X , and b i ∈ R D Y , such that , u i = a T i X, and v i = b T i Y , i = 1, ..., N , maximize the sample correlations between u i and v i , where u i (v i ) form an orthonormal basis for i = 1, ..., d, i.e. a i , b i = argmax ui,uj =δi,j , vi,vj =δi,j ,i,j=1,...,d Corr(u i , v i ). While CCA enjoys a closed-form solution using a generalized eigen pair problem, it is restricted to the linear transformations A = [a 1 , ..., a d ] and B = [b 1 , ..., b d ]. In order to identify non-linear relations between input variables, several extensions of CCA have Linear and non-linear canonical correlation models have been widely used in the setting of unsupervised or semi-supervised learning. When d is set to a dimension satisfying d < D x , D y , these models find dimensional reduced representations that may be useful for clustering, classification or manifold learning in many applications. For example, in biology (Pimentel et al., 2018) , neuroscience (Al-Shargie et al., 2017 ), medicine (Zhang et al., 2017 ), and engineering (Chen et al., 2017) . One key limitations of these models is that they typically require more samples than features, i.e. N > D x , D y . However, if we have more variable than samples, the estimation based on the closed form solution of the CCA problem (in Eq. 1) breaks (Suo et al., 2017) . Moreover, in high dimensional data, often some of the variables are not informative and thus should be omitted



been proposed. Kernel methods such as Kernel CCA (Bach & Jordan, 2002), Non-paramatric CCA (Michaeli et al., 2016) or Multi-view Diffusion maps (Lindenbaum et al., 2020) learn the non-linear relations in reproducing Hilbert spaces. These methods have several shortcomings: they are limited to a designed kernel, they require O(N 2 ) computations for training, and they have poor interpolation and extrapolation capabilities. To overcome these limitations, Andrew et al. (2013) have proposed Deep CCA, to learn parametric non-linear transformations of the input modalities X and Y . The functions are learned by training two neural networks to maximize the total correlation between their outputs.

