ADVERSARIAL REPRESENTATION LEARNING FOR CANONICAL CORRELATION ANALYSIS

Abstract

Canonical correlation analysis (CCA) provides a framework to map multimodality data into a maximally correlated latent space. The deep version of CCA has replaced linear maps with deep transformations to enable more flexible correlated data representations; however, optimization for the CCA target requires calculation on sufficiently large sample batches. Here, we present a deep, adversarial approach to CCA, adCCA, that can be efficiently solved by standard mini-batch training. We reformulate CCA under the assumption that the different modalities are embedded with identical latent distributions, derive a tractable deep CCA target. We implement the new target and distribution constraint with an adversarial framework to efficiently learn the canonical representations. adCCA learns maximally correlated representations across multimodalities meanwhile preserves class information within individual modalities. Further, adCCA removes the need for feature transformation and normalization and can be directly applied to diverse modalities and feature encodings. Numerical studies show that the performance of adCCA is robust to data transformations, binary encodings, and corruptions. Together, adCCA provides a scalable approach to align data across modalities without compromising sample class information within each modality.

1. INTRODUCTION

Data samples can be measured with different modalities (e.g., image or text), encoded in different formats, and modeled by different distributions. Integrative analysis of multimodality data provides the opportunity in many machine learning tasks to combine partial information from each modality and achieve better performance than any single modality alone (Ngiam et al. (2011) ; Srivastava & Salakhutdinov (2012) ). Canonical correlation analysis (CCA) (Thompson (1984) ) is one of the most classical and general approaches for multimodality data integration. It learns linear mappings between data modalities that achieve maximal cross-modality correlations. Replacing the linear mappings in CCA with deep functions can achieve non-linear and flexible transformations and provide better correlated representations. However, learning the deep CCA target function requires optimization over all data samples (Andrew et al. (2013) ) or a sufficiently large data batch (Wang et al. (2015) ), which is incompatible with standard batch-based learning strategies widely used in deep learning and limits its power on large-scale datasets. Therefore, many recent deep CCA approaches (Wang et al. (2016) ; Dutton (2020) ; Karami & Schuurmans (2021) ) sidestep the original CCA formulation and rather focus on learning a joint representation for the paired modalities using approaches that are compatible with mini-batch training (Appendix A). Here, we propose a multimodal adversarial learning framework for deep CCA learning: adCCA. Mathematical analysis provides an optimization target for CCA amenable to mini-batch training under the requirement that the different modality distributions are identical in latent space. adCCA formulates this requirement as a penalty function that, during optimization, brings the two latent distributions into alignment. As the latent space representations converge (distribution-wise), maximizing the optimization target leads to highly correlated latent representations (sample-wise) for the two modalities, as illustrated with numerical experiments. Thus, adCCA is derived from a deep CCA framework that follows the original correlation target of CCA, yet can be directly optimized by mini-batch training. In more detail, adCCA learns representations in multiple steps. First, initial latent representations are provided using modality-specific autoencoders (AEs). Second, the Wasserstein distance is used to measure overall differences between the two latent distributions, and an inner product is used to measure sample-wise similarities. Then, adCCA uses an adversarial framework to minimize latent space distribution differences and maximize sample-wise similarity (Goodfellow et al. (2014) ). Together, the final output representation from each modality is expected to have similar latent distributions (from the Wasserstein distance terms), high cross-modality sample similarity (from the inner product term) while still maintaining a faithful representation of the original features (from the AE). The design of adCCA framework has several advantages. One major advantage is flexibility in input features. Often in multimodal analysis, the total loss is composed of loss terms from different modalities (Wan et al. (2021) ; Andrew et al. (2013) ; Ngiam et al. (2011) ; Jain et al. (2005) ). If features from diverse modalities have different distributions or data formats, the joint optimization can be biased towards a certain modality. In the adCCA framework, each modality representation is generated from a modality-specific encoder, and the two latent representations are only related indirectly through the adversarial network. Another major advantage is the ability to learn correlation without losing information in the original features. In particular, deep CCA approaches use deep transformations to learn canonical representations with high correlation; however, in doing so these representations may lose class information contained in the original feature space. This point is typically not considered in deep CCA frameworks. In adCCA, the use of AE in learning forces the latent representation to retain a faithful reconstruction of the original data. We benchmark adCCA against 10 different methods, using both synthetic and real datasets. In the evaluation, we focused on two aspects. Cross-modality consistency: Correlation of sample points in the latent space for both modalities (Appendix Fig D .1B ). And, Class preservation: Ability of the latent representation to retain the original class information (Appendix Fig D .1C ). The ideal method should provide embedding that are both correlated per sample and retain class structure across both modalities (Appendix Fig D .1D ). We show representations from adCCA both achieve high cross-modality consistency in the joint latent space and preserve class information from the original feature distributions. We demonstrate that adCCA can maintain stable performances across features of different modality types, feature distributions, and degradation levels. The major contributions of this work are: (1) A reformulation of CCA with deep transformations into a target that can be optimized through standard mini-batch training. (2) An adversarial learning framework for efficient canonical representation learning and an optimization strategy for the stable adversarial training between two latent representations. (3) An expanded approach for evaluating the performance of CCA approaches for both cross-modality consistency and single-modality information preservation. (4) An extensive test and demonstration of adCCA's performance.

2. ADVERSARIAL REPRESENTATION LEARNING FOR CCA

2.1 PRELIMINARY Here, we consider a general framework for CCA. (An overview of existing CCA variants is provided in Appendix A.) Given a pair of feature vectors from two modalities, x ∈ R p and y ∈ R q , CCA tries to learn two mapping functions f 1 and f 2 that maximize the correlation after mapping: max corr(z x , z y ) s.t. z x = f 1 (x), z y = f 2 (y), where z x ∈ R and z x ∈ R are the corresponding canonical variables. Transformation functions f 1 and f 2 can be linear mappings (in the original CCA), implicit kernel functions (in kernel CCA) or deep transformations (in deep CCA). These functions are learned through optimization over the entire dataset (Andrew et al., Appendix B.1) , and mini-batch training strategies cannot be leveraged. Here, we aim to reformulate the CCA learning target to a form that enables optimization through general deep-learning optimizers. To this end, we relax the CCA learning target with an additional identical latent distribution assumption: Theorem 1. Assuming latent representations from modality x and y follow the same distribution π with the mean µ and variance σ 2 , the optimization target of CCA (Eq. 1) can be rewritten as: max E (x,y)∼p data [z x z y ] s.t. z x = f 1 (x), z y = f 2 (y), z x ∼ π(µ, σ 2 ), z y ∼ π(µ, σ 2 ), where (x, y) ∼ p data indicates paired modal features x and y drawn from the data. Proof. Refer to Appendix B.2. The latent distributions are assumed to be the same with the same fixed µ and σ 2 , and the new target function can be expressed simply as the average of the product of canonical variables. This formulation avoids noted problems with mini-batch optimization in CCA where µ and σ 2 are not fixed. We next extend the optimization target to learn k canonical variables simultaneously. Deep CCA (Appendix B.1) employs Singular Value Decomposition (SVD) to output canonical variables that are orthogonal to each other. This step requires the use of large data batch sizes. To achieve more efficient optimization, we forgo the requirement for orthogonality. Instead, we use two independent autoencoders to generate canonical representations. We add the autoencoders as constraint terms and extend Theorem 1 to a multi-variable form: Corollary 1. Multidimensional canonical representations z x ∈ R k and z y ∈ R k are learned from two individual autoencoders with encoders f 1 (x) and f 2 (y) and decoders g 1 (z x ) and g 2 (z y ). Under the assumption that each entry of the two latent representations z (j) x , z y follow the same latent distribution with mean µ and variance σ 2 , the optimization target of CCA can be re-formulated as: max E (x,y)∼p data [z T x z y ] s.t. z x = f 1 (x), z y = f 2 (y), ∥x -g 1 (z x )∥ 2 2 = 0, ∥y -g 2 (z y )∥ 2 2 = 0 z (j) x , z (j) y ∼ π(µ, σ 2 ) for j = 1 : k (3) Proof. Refer to Appendix B.3. Based on Corollary 1, the new target function for CCA has three requirements: 1) maximizing the inner product between latent representations of the two modalities; 2) requiring the two latent distributions from two modalities to be the same; and 3) fixing the mean and variance of each latent dimension to be constant.

2.2. MODEL FORMULATION

These requirements can be easily satisfied using existing deep-learning paradigms. For (1), the inner product is a general term to measure sample similarities and is widely used in deeplearning approaches, such as simCLR (Chen et al. (2020) ); for (2), distribution differences can be measured by the Wasserstein distance (Arjovsky et al. (2017)) or by other types of divergences used in machine learning tasks, such as image generation (Makhzani et al. (2015) ); for (3), constant mean and variance can be achieved through batch normalization. Next, we describe the adCCA framework (Fig. 1 ; Appendix C). Here, we consider paired modality features x and y collected from the same sample. First, we employ an autoencoder (AE) for each modality to learn the initial latent representations z x and z y : min φx,θx E x∼px ∥r (x) φx [g (x) θx (x)] -x∥ 2 2 , min φy,θy E y∼py ∥r (y) φy [g (y) θy (y)] -y∥ 2 2 , where g (x) θx and r (x) φx are the encoder and decoder for modality x with network parameter sets θ x and φ x (Resp. g θy (y). The reconstruction loss encourages the AE bottleneck layer to faithfully encode information contained in the original feature space. Second, a Wasserstein distance is used to measure the distance between the latent distributions of the two modalities: W (p zx , p zy ) = inf γz∈Π(pz x ,pz y ) E (zx,zy)∼γz [∥z x -z y ∥] = inf γ∈Π(px,py) E (x,y)∼γ [∥z x -z y ∥] s.t. z x = g (x) θx (x), z y = g (y) θy (y), where p zx , p zy are distributions for z x and z y , and Π(p zx , p zy ) represents the space of joint distributions for z x and z y with marginal distributions p zx and p zy . The same notation also applies to Π(p x , p y ). As previously shown (Arjovsky et al. (2017) ), the least upper bound of Wasserstein distance can also be formulated in terms of a 1-Lipschitz function f w : W (p zx , p zy ) = sup ∥fw∥ L ≤1 E x∼px [f w (z x )] -E y∼py [f w (z y )] . Following the Wasserstein GAN structure (Arjovsky et al. (2017) ), we minimize this distance in an adversarial paradigm: max w min θx,θy E x∼px [f w (z x )] -E y∼py [f w (z y )] s.t. z x = g (x) θx (x), z y = g (y) θy (y). (7) Here, the AE encoders g (x) θx and g (y) θy from above are reused as generative networks. f w is modeled by a discriminator network with parameter set w. Maximizing the function by optimizing the discriminator parameter w increases the difference between two representations z x and z y , while minimizing the function by tuning generator parameters decreases the distance by finding better encoder representations. Third, we add the inner product to Eq. 7 and formulate the learning target used by adCCA: max w min θx,θy -E (x,y)∼p data [z T x z y ] + λ E x∼px [f w (z x )] -E y∼py [f w (z y )] , s.t. z x = g (x) θx (x), z y = g (y) θy (y), where λ > 0 is a multiplier used to balance the distribution penalty (given by the Wasserstein distance) with the sample pairwise-similarity term (given by the inner product of latent representations). Taken together, the overall adCCA model requires the optimization of Eqs. 4 and 8. We note that the two latent representations are provided from two modality-specific AEs and are not linked directly in the model. Thus, there is no requirement to process raw features x and y for balanced learning and the two AEs can be replaced with modality specific structure (e.g.,. one AE can be a convolutional network for images and the other an attention model for language).

2.3. INTUITIVE INTERPRETATION OF THE ADCCA LOSS FUNCTION

In this section, we provide intuition for how the formulated loss function can provide correlated embeddings from multimodal data. In Eq. 8, the first two Wasserstein distance terms force the two latent representations, z x and z y , to share similar distribution shapes. However, this overall distribution penalty does not align different modality features from the same sample. The last inner product term explicitly encourages representations generated from paired modality features corresponding to the same samples to maximize their similarities. Together, adCCA encourages similarity across modalities in the latent space representations at both distribution (Wasserstein distance) and sample (feature correlation) scales. The AEs in Eq. 4 encourage z x and z y to preserve information from each modality. This allows adCCA to avoid the adversarial network converging to trivial solutions. For example, z x and z y could converge to zero vectors, which satisfy both the distribution and similarity requirements, but do not carry any information from the original data. By using both Eqs. 4 and 8, multimodal representations can be learned that both preserve the original information and achieve high crossmodality correlations.

2.4. MODEL OPTIMIZATION

Algorithm 1: adCCA optimization Data: {(x, y)} n ∼ p data , x ∈ R p , y ∈ R q Result: {(zx, zy)} n , zx ∈ R k , zy ∈ R k # Initialize zx and zy with autoencoders for i ← 1 : epoch (AE) do φx, θx ← min Ex∼p x ∥r (x) φx [g (x) θx (x)] -x∥ 2 2 ; φy, θy ← min Ey∼p y ∥r (y) φy [g (y) θy (y)] -y∥ 2 2 ; end zx = g (x) θx (x), zy = g (y) θy (y); # Optimize zx and zy iteratively for i ← 1 : epoch (AD) do # y-step: fix zx and optimize zy φy, θy ← min Ey∼p y ∥r (y) φy [g (y) θy (y)] -y∥ 2 2 ; w ← max Ex∼p x [fw(zx)] -Ey∼p y [fw(zy)]; θy ← min -Ey∼p y [fw(zy)] -λE (x,y)∼p data [z T x zy]; # x-step: fix zy and optimize zx φx, θx ← min Ex∼p x ∥r (x) φx [g (x) θx (x)] -x∥ 2 2 w ← max Ex∼p x [fw(zx)] -Ey∼p y [fw(zy)]; θx ← min Ex∼p x [fw(zx)] -λE (x,y)∼p data [z T x zy]; end return zx = g (x) θx (x), zy = g (y) θy (y) In the standard generative adversarial network framework (Goodfellow et al. (2014) ; Arjovsky et al. ( 2017)), we only need to optimize one generator and discriminate its output against real samples. However, in adCCA, we need to optimize two generative networks and match the distributions between both latent representations. During training, these two latent space representations change epoch by epoch, which creates a challenge for a stable optimization. Here, we propose a multi-step training process for stable adCCA optimization in Algorithm 1 (illustrated in Appendix Fig. D.2). First, autoencoders are trained independently for each modality to provide an initial embedding. This enables adCCA to work with raw input feature data from different modalities and different distributions (see Section 3.1.2). Next, we fix the representation for modality x and optimize the generative network for modality y through the adversarial learning step (y-step). During this y-step, we optimize the autoencoder, discriminator and generator. The discriminator identifies distributional differences, which are then minimized by "moving" the latent distribution of y towards the latent distribution of x. Meanwhile, the autoencoder ensures that the latent representation faithfully reflects information in the original data. Then, we fix the representation of y and optimize the generator for x (x-step). The x-step and y-step are performed iteratively.

3. EXPERIMENTS

In this section, we show the results of experiments using simulated and real data. We compared the proposed adCCA with other CCA approaches as well as multimodal learning frameworks that can complete the same task. Specifically, we compared adCCA to two different classes of existing approaches. The first class (individual representation learning methods) provides an individual representation for each modality. This class contains 6 approaches: classical CCA, kernel CCA (radial basis function and polynomial kernel) (Perry et al. (2021) ), CCA with deep transformations (DCCA) (Andrew et al. (2013) ), simCLR (Chen et al. (2020) ) that employs the same encoder to learn similar representations from multimodal inputs, and contrastive autoencoders (contrast AE) that uses a contrastive loss to constrain the similarities in the latent space (which can be viewed as adCCA without the distribution penalty). The second class (joint representation learning methods) provided a combined representation for both modalities. This class contained 4 approaches: CCA with the latent distribution constraint under variational framework (VCCA and VCCA-p with private latents) (Wang et al. (2016) ) or adversarial framework (ACCA) (Dutton (2020) ), and deep probabilistic CCA (DPCCA) (Karami & Schuurmans (2021) ). Detailed configurations were provided in Appendix E.1.

3.1. EXPERIMENTS WITH SYNTHETIC DATA

We first performed simulations studies by generating synthetic multimodal features with known sample labels. Here, we considered 10 ground truth classes existing in the data and generated synthetic features using multi-layer perception networks with random mapping matrices for the two modalities (Appendix E.2). Then, we used multimodal approaches to learn representations in latent spaces of 10 dimensions.

3.1.1. BASELINE COMPARISONS

We set high feature drop-out rates for features in modality x (90% feature missing ratio and 0.1 noise variance) and high noise levels for modality y (10% feature missing ratio and 0.5 noise variance). We observed, based on visualization of the raw feature spaces using Umap (Becht et al. ( 2019)) (Fig. 2 , left), that the high drop-out rates in x had more significant effects on degrading the data class information than the high noise levels in y. We then tested the ability to map both modalities to the same latent space. First, we evaluated the degree of correlation between the latent representations of the two modalities across different methods. We co-embedded the two modal representations using Umap. (We note that ACCA, VCCA(-p) and DPCCA can only output a joint representation and were not evaluated for this comparison.) For most approaches, representations from different modalities were mixed in the latent space (Fig. 2 , right). Quantitative assessment in 10-dimension latent spaces using correlation coefficients and entropy of mixing (Appendix E.3) showed that CCA, DCCA and the proposed adCCA had reasonable performance for combining representations from different modalities (Table 1 ). Kernel CCA also achieved good performance with appropriate kernel choices (RBF kernel in this case). Second, we evaluated whether the representations of the two modalities in the latent space reflected sample clusters present in the original data. It was evident from 2D representation embeddings with the original class labels (Fig. 3 ) that DCCA and adCCA were best suited to preserve cluster structure in the original data. We quantified the quality of the latent representations to reflect true sample classes using three measures: 1) clustering the latent representations and comparing cluster identities with the true class labels using the adjusted Rand index (ARI); 2) calculating betweenand within-class distances using the true labels and quantifying sample compactness using the Silhouette coefficient; and 3) constructing simple classifiers in the latent space using the true labels and reporting classification accuracy. Details of these measures are provided in Appendix E.3. The results of these evaluations (Table 2) confirmed that DCCA and adCCA performed well at retaining class information from the original data. While some of the other approaches could align multimodal representations in the latent space (i.e., high correlation; Table 1 ), they did so at the cost of losing spatial coherency present in the original features (Fig. 3 ,Table 2 ). In general, higher correlation need not lead to better class predictions for classical approaches as their only learning target is the maximization of the correlation between two representations. To get a high correlation, the feature information from original data can be greatly compromised (An illustration in Appendix Overall, our experiments suggest that the adCCA framework can achieve a balance between cross-modality consistency (encouraged by the adversarial loss) and single-modality preservation (encouraged by the autoencoder loss). We provide a 2D illustration of how adCCA gradually aligns the two modalities in the latent space during training (Appendix D.3) as well as a systematic structure test to highlight the impact of the different key components of the adCCA model architecture in achieving high correlation and classification prediction (Appendix Table F .1).

3.1.2. PERFORMANCE WITH INPUT TRANSFORMATIONS

A known challenge in the integrative analysis of multimodal data is balancing contributions from different modalities, especially when data are encoded in different formats or the data distributions are unknown. In the adCCA framework, to enable flexibility of input encoding, independent AEs are used to construct latent encodings for each modality, then the representations are aligned indirectly using an adversarial network. To test the stability of performance, we considered two types of transformations to the raw features y used in the previous experiments: 1) a feature distribution transformation, by applying two sequential log transformations on entry values; and 2) a feature encoding transformation, by mapping features into binary codes with local sensitive hashing (Appendix E.2). We evaluated how these transformations changed the consistency between the two modal representations (i.e., correlation) as well as the preservation of original data classes (i.e., classification accuracy). From Fig. 4 , we observed that some methods had stable class prediction scores but unstable correlation (e.g., DCCA), while others had unstable class prediction scores but stable correlation (e.g., KCCA-RBF). A few were stable under all tested measurements (e.g., adCCA, CCA, KCCA-poly, VCCA-p). The joint representation learning approaches (ACCA, VCCA and DPCCA) showed strong accuracy change under log transform, implying different distributions of multimodal features can affect the effective information combination. Notably, adCCA was stable under both transforms and performed well overall (Fig. 4 , in upper right quadrants of top scatter plots). Full results were provided in the Appendix Tables F.2, F.3, F.4 and F.5 .

3.2. EXPERIMENTS WITH REAL DATA

Next, we performed evaluations on real datasets with multimodal features. We considered a multiview MNIST data where two views of the digits were related through transformations and a CITEseq multimodal dataset on spleen and lymph node where gene and protein features were measured directly from single cells.

3.2.1. MNIST: DIGITS WITH TWO TRANSFORMED VIEWS

For each digit, we applied random image transformations (rotation, translation, scaling, shearing and random pixel corruption) to generate two views for the same digit image (Appendix E.4). We then learned single or joint representations for the modalities. We benchmarked how representations from the two views mixed in the latent space and how these representations reflected different digit classes. From the results (Fig. 5A ), we found both the classical and kernel CCA approaches showed compromised performance. DCCA and adCCA still maintained high consistency between the two views as well as class information. Further, the runtime (Appendix Fig. D .4) showed adCCA was shower than VCCA, DPCCA and simCLR due to its multiple step training, but faster than classical approaches KCCA, DCCA. We selected these two top approaches and stress-tested them with expanded transformation ranges of 2-fold or 3-fold (Appendix E.4). Under these increasing degradation levels, adCCA exhibited the most robust performance with less loss in correlation and accuracy (Fig. 5b ). Finally, we tested performance on human spleen and lymph node cells measured by CITE-seq (Gayoso et al. (2021) ). This technology profiles genes and proteins simultaneously from the same cells. After data preprocessing 9,264 cells (Appendix E.5), modality x includes 2,000 genes and modality y include 112 proteins. It has been reported that the gene data follow a negative binomial distributions, while the protein data follow a negative binomial mixture distribution (Gayoso et al. (2021) ). Transformations on raw data may reduce biological signals. Therefore, we evaluated how well the different computational methods could directly integrate these two types of data. Prediction results were based on 5-fold cross validation using linear SVM. We note that simCLR could not be evaluated because of the unequal modal dimensions. We benchmarked using correlations of multimodal representations as well as their prediction accuracies on major classes (5) and subclasses (35) of cell types (Table 3 ). For correlation, most CCA approaches showed reasonable performance in consistency of latent embeddings. However, approaches outputting joint representations (ACCA and VCCA) had better prediction performance. adCCA maintained both high correlation and prediction accuracy.

4. DISCUSSION AND FUTURE WORKS

In this work, we formulated the CCA task under the identical latent distribution assumption and derived an adCCA framework that can be efficiently solved by standard mini-batch optimizers. The training of adCCA includes both cross-modality consistency constraint and single-modality information preservation. Analysis of both simulated and real datasets suggests that adCCA provides an effective approach to learn correlated latent representations and maintain original class information. Compared with the original CCA framework, adCCA has several limitations and directions for future exploration. First, adCCA's latent representations are output from independent encoders and therefore cannot maintain orthogonality of representations. Second, adCCA's latent representations are limited by the AE framework, though the AE can be replaced with more general (such as attention-based model) or modality-specific representation learning frameworks to improve performance. Third, adCCA training involves simultaneously optimization of the GAN and AE, which can be time-consuming to reach to a balanced state. Finally, the use of inner products to constrain similarities between modalities in adCCA may be replaced by error models (such as L 2 or Huber loss) to enforce stronger similarity constraints. Nevertheless, adCCA provides a starting for for scalable alignment of data across modalities that preserves structure within each modality.

APPENDIX A RELATED WORKS

In the development of CCA, early improvements (Lai & Fyfe (2000) ; Hardoon et al. ( 2004)) focused on employing kernels to generalize the model to non-linear transformation. DCCA (Andrew et al. (2013) ) introduced deep learning to CCA and replaced the core matrix transformation with a multilayered neural network, and DTCCA (Wong et al. ( 2021)) further extended this to more than two views. With progress in variational neural networks (Kingma & Welling (2013) ), variational CCA (VCCA) (Wang et al. (2016) ) re-formulated CCA under a graph model and proposed variational CCA using a generative framework. VCCA assumed multimodal data were generated from a shared latent representation and solved it under the variational autoencoder framework. Similarly, deep probabilistic CCA (DPCCA) (Karami & Schuurmans (2021) ) took a probabilistic interpretation of CCA and decomposed the multimodal features into a shared distribution and a modality-specific distribution. We note VCCA and DPCCA are different from the original CCA and DCCA as they combined all modalties into a joint representation instead of learning a unique transformation for each modality. Our proposed adCCA is different from these approaches in that our model is strictly formulated within the classical CCA framework. The generative adversarial network (GAN) (Goodfellow et al. (2014) ) framework was originally proposed to learn generative processes underlying a data distribution. The adversarial autoencoder (Makhzani et al. (2015) ) was the first work to use adversarial loss to match autoencoder latent representation to any prior distribution. A combination of CCA and GAN was used for ACCA (Dutton (2020) ), which can be viewed as an extension of the variational CCA. Instead of using a Kullback-Leibler (KL) divergence to constrain the prior and posterior distributions in latent space, ACCA used an adversarial loss to match the latent representation to arbitrary priors. DACCA (Fan et al. (2020) ) concatenates multiview generation and deep CCA for data generation and representation learning. Most adversarial learning frameworks focus on constraining the distributions between generated and real samples, rather than between latent representations from different modalities. Our proposed adCCA is different from existing approaches in that adCCA directly uses the adversarial loss to constrain the latent distributions from multimodal features. We summarize the major differencse in representation learning and training for CCA variants in Appendix Table F.6.

B MATHEMATICAL FORMULATIONS

B.1 OPTIMIZATION OF GENERAL CCA FRAMEWORK Z x ∈ R n×k and Z y ∈ R n×k are mean-centered matrix form representations from two modalities with functions f (x) wx (X) and f (y) wy (Y) parameterized by w x and w y , where X ∈ R n×p and Y ∈ R n×q . The correlation can be formulated as a matrix trace norm: corr(Z x , Z y ) = ∥S∥ trace S = Σ -1/2 xx Σ xy Σ -1/2 yy , (B.1) where Andrew et al. (2013) , the optimization of w x and w y requires the gradient of correlation to be calculated: Σ xx = 1 n-1 Z T x Z x , Σ yy = 1 n-1 Z T y Z y , and Σ xy = 1 n-1 Z T x Z y . Based on ∂corr(Z x , Z y ) ∂w x = ∂corr(Z x , Z y ) ∂Z x • ∂Z x ∂w x = 1 n -1 (2M xx Z x + M xy Z y ) • ∂f (x) wx ∂w x , (B.2) where M xy = Σ -1/2 xx UV T Σ -1/2 yy and M xx = -1 2 Σ -1/2 xx UDU T Σ -1/2 yy . U and V are from the Singular Value Decomposition (SVD) of S by S = UDV T . The term ∂corr(Zx,Zy) ∂wy can be formulated similarly. The calculation in Eq. B.2 requires the SVD decomposition on all samples for every training iteration. Therefore, it is not tractable for standard mini-batch training.

B.2 PROOF OF THEOREM 1

Proof. x ∈ R p and y ∈ R q represent a pair of random variables with p and q dimensions in two modalities. z x = f 1 (x) ∈ R and z y = f 2 (y) ∈ R maps modality features to a pair of canonical variables with transformations f 1 and f 2 . With that, a general optimization target for CCA can be formulated as: f * 1 , f * 2 = arg max f1,f2 corr(z x , z y ) = arg max f1,f2 E[(z x -µ x )(z y -µ y )] σ x σ y = arg max f1,f2 E(z x z y ) -µ x µ y σ x σ y , (B.3) where µ x and σ x are mean and standard deviation for z x (Resp. µ y and σ y for z y ). Here, z x and z y are assumed to follow the same distribution π with shared mean and standard deviation µ and σ and rewrite the target to: f * 1 , f * 2 = arg max f1,f2 E(z x z y ) -µ x µ y σ x σ y , = arg max f1,f2 E(z x z y ) -µ 2 σ 2 , = µ,σ∈C arg max f1,f2 E(z x z y ), (B.4) where C is the constant number set.

B.3 PROOF OF COROLLARY 1

Proof. x ∈ R p and y ∈ R q are random variables with p and q dimensions from two modalities. Two canonical representation vectors z x and z y with k dimensions are learned from two autoencoders and constrained by the correct reconstruction of original features. The j th entries of z x and z y follow the latent distribution with fixed mean µ and variance σ 2 . Following the Theorem 1, the learning target for CCA with multiple canonical variables can be generalized as: f * 1 , f * 2 = arg max f1,f2 k j=1 corr(z (j) x z (j) y ) = arg max f1,f2 k j=1 E(z (j) x z (j) y ) = arg max f1,f2 k j=1 1 n n i=1 z (i,j) x z (i,j) y = arg max f1,f2 1 n n i=1   k j=1 z (i,j) x z (i,j) y   = arg max f1,f2 1 n n i=1 z (i,:)T x z (i,:) y = arg max f1,f2 E[z T x z y ], (B.5) where z (j) x is the j th entry of the vector; z (i,j) x is the j th entry of the representation learned from sample i; z (i,:) x is the whole representation vector from sample i (Resp. z 

C BACKGROUND

C.1 AUTOENCODERS Autoencoders are a general unsupervised representation learning framework. Given a sample feature vector x ∈ R p , an autoencoder tries to learn a compact feature z ∈ R q from an encoder network that can be faithfully reconstructed to the original feature by a decoder network. The whole framework is optimized by minimizing the reconstruction error: min φ,θ E x∼px ∥r φ [g θ (x)] -x∥ 2 2 , (C.1) where g θ and r φ are encoder and decoder networks parameterized by θ and φ, respectively; p x is the distribution of input. The representation is the output from the encoder z = g θ (x).

C.2 WASSERSTEIN GAN

The Wasserstein GAN is an improved GAN framework. It modifies the core classification network by using the Wasserstein distance to measure the distribution difference between the generated and realistic samples. Given a sample x and a prior distribution z ∼ p z , the Wasserstein GAN is learned via a two-step minmax optimization: min θ max w E x∼px f w (x) -E z∼pz f w [g θ (z)], (C.2) where g θ and f w are generator and discriminator with parameter θ and w; p x is the actual data distribution and p z is the prior distribution to generate new samples. Under review as a conference paper at ICLR 2023 

D SUPPLEMENTARY FIGURES

Time/s Method C C A K C C A -p o ly K C C A -R B F D C C A s im C L R C o n t r a s t A E A C C A V C C A V C C A -p D P C C A a d C C A

E EXPERIMENT DETAILS E.1 COMPARISON OF APPROACHES

Here we set the latent dimension to 10 for all approaches. For deep-learning based methods, we used the same encoding network [1024, 512, 256, 10] to learn the latent representation. CCA: we used the multiview CCA function in the mvlearn python package 1 (version 0.5.0) with regulation paramter regs=0.5. KCCA: kernel CCA is implemented from the in KMCCA function in mvlearn python package. For Radial Basis Function (a.k.a) kernel, we set γ = 1,regs = 0.01, sval -thresh = 1e -5. For polynomial kernel, we set kernel parameter as degree = 2, coef 0 = 0.1 and regs = 0.01. DCCA: we used the DCCA function in mvlearn and set the training epoch to 500. Default values are used for all other parameters. transformations; for the feature encoding transformation, we used local sensitive hashing (LSH) (Datar et al. (2004) ) implemented from GitHubfoot_5 to encode features into 1, 000 bits.

E.3 EVALUATION METRIC

Entropy of mixing: we put latent representations from x and y together [z T x ; z T y ] T . In the combined data, we identify its 100 nearest neighbors for each sample and calculate the proportion of neighbors from the same or different modality. The entropy of mixing is given by: -[p log p + (1 -p) log(1 -p)], (E.3) where p is the proportion of neighbors belonging to modality x. Classification analysis: we employed a 5-fold classification accuracy evaluation. In each data split, we fit a linear support vector machine (SVM) or k-nearest-neighbor classifier (kNN) on training set (80%) and calculated classification accuracies on the test set (20%). We used implementations of both classifiers from sklearn. For kNN, we considered 10 nearest neighbors to build classifier. Silhouette score: we use the implementation from sklearn. Given the true sample classes, d a is the average distance between samples in the same class, d b is the average distance between one sample and its nearest samples from different classes. The score is defined as: d b -d a max (d a , d b ) , (E.4) a positive value indicates the inner-class-distance is smaller than between-class-distance.

E.4 MNIST EXPERIMENTS

We applied rotation (range [-10 • , 10 • ]), translation (range [0,0.1]), scaling (range [0.9,1.1]), shearing ([0,15] ) and random pixel corruption (10%) to digit images. Each view was obtained through this transformation sequence with randomly chosen transformation parameters. After the transformation, two views were input to approaches, and the results were compared with the original digit labels. In the experiments with increased corruption levels, we further increased the range of each transform to 2 fold and 3 fold (e.g., the rotation angle range was broaded to 2*[-10 • , 10 • ] and 3*[-10 • , 10 • ]) and then tested accuracy.

E.5 CITESEQ EXPERIMENTS

The dataset was downloaded from a previous publicationfoot_6 . We used the DLN111-D1 batch, which includes 9,264 cells from spleen and lymph node in total. Each cell includes genes (> 20,000 dimensions) and surface proteins (112 dimensions). For the gene modality, we selected the top 2,000 variable genes as input; for the protein modality, we input all proteins. Cells are classified to 5 major cell types and 35 cell subtypes. F SUPPLEMENTARY TABLES 



https://mvlearn.github.io https://github.com/Spijkervet/SimCLR/blob/master/simclr https://github.com/bcdutton/AdversarialCanonicalCorrelationAnalysis https://github.com/Karami-m/Deep-Probabilistic-Multi-View https://github.com/eriklindernoren/PyTorch-GAN/blob/master/ implementations/wgan_gp/wgan_gp.py http://ethen8181.github.io/machine-learning/recsys/content_based/lsh_ text.html https://github.com/YosefLab/totalVI_reproducibility/tree/master/data



Figure 1: adCCA framework and loss function design.

φy for modality y); p x and p y are data distributions for x and y. The latent representations are given from the modality-specific encoders by z x = g (x) θx (x) and z y = g (y)

Figure 2: Evaluation of cross-modality consistency. Left: visualization of raw features from two modalities using Umap (Becht et al. (2019)). Color indicates the sample classes. Right: joint visualization of two modalities in the same latent space. Color indicates the modality that representation was learned from. KCCA-poly: CCA with polynomial kernel; KCCA-RBF: CCA with radial basis function kernel.

Figure 3: Evaluation of preserving original data classes in latent space. Umap embeddings of latent representations are as in Fig. 2, with color annotations indicating the true sample labels.

Fig. D.1).

We note compared with the SVD-based solution in Eq. B.2, different dimensions in the latent representations are no long orthogonal to each other.

Figure D.1: Cartoon illustration of different possible outcomes from canonical representation learning of two modalities. (A) Sample raw features from two modalities visualized in 2D. Color indicates the class labels. (B-D) Three scenarios of canonical representations learned from two modalities. (B) 2D embedding of canonical representations with high cross-modality correlation but low class information. (C) 2D embedding of canonical representations with well separation between classes but low correlation between modalities. (D) Ideal case that canonical representations from two modalities and same classes are well mixed, meanwhile different classes are well separated.

Figure D.4: Runtime of comparing approaches on MNIST dataset.

Figure 5: Integrative analysis of two transformed views from MNIST dataset.

Integrative analysis of gene and protein modalities from CITE-seq.

1: Model structure analysis of adCCA. Note: removing Wasserstein distance term is equivalent to Contrast AE approach.

3: Class of evaluation of log-transformed feature.Table F.4: Consistency evaluation of binary encoded features.

5: Class of evaluation of binary encoded features.Table F.6: A summary of representation and training for CCA variants.

annex

simCLR: we made use of the NT Xent simCLR implementation 2 and set the training epoch to 500. The learning step is chosen by grid search from [1e -2, 1e -3, 1e -4].Contrast AE: the implementation is based on the adCCA framework. The adversarial learning structure is removed and only the similarity constraint is used in learning. The parameter settings and training strategy are the same as in adCCA.ACCA: we used the implementation from the authors 3 . The latent representation is generated from both modalities (q(z|x, y) mode by setting num z = 2). The L 2 reconstruction loss is used.VCCA and VCCA-p: we used the implementation from ACCA. We followed the original VCCA framework and assume the joint latent representation is generated from single view.DPCCA: we modified the model from the original paper 4 . Parameters were selected from a grid search of learning steps [1e -2, 1e -3, 1e -4] and epochs [500, 1000, 2000] . λ was set to 1, but we empirically determined it has little effect on the final results.adCCA: the model is implemented with the pyTorch package. The network structure used in the experiment is presented in Apendix Fig. D.5. Each layer includes a BatchNorm1d and a ReLU activation function. The Adam optimizer was used for both the AE and GAN networks. β (coefficients for the running averages of the gradient and its square) was selected from a grid search, giving [0.9, 0.99] for the discriminator and [0.5, 0.9] for the generator and AE. The hyperparameter λ balancing the inner product term and Wasserstein distance was set to 1. Each training step includes 500 epochs. The learning rate was set to 1e -3. In each iteration, the discriminator optimizer was repeated twice and the generator optimizer repeated 5 times for better convergence in each step. In the GAN optimization, we used the wGAN with gradient penalty 5 . Source code and demonstration to run the code are provided in supplementary materials.

E.2 SIMULATION DESIGN

We first pre-determined the k classes existing in the data. To get raw features X ∈ R n×p for modality x, we use a multivariate normal distribution with randomly determined mean to generate the latent code Z x ∈ R n×d . Next, we projected the latent code to a new space with a multi-layer neural network:∈ R m×p are random projection matrices. Finally, we added random noise and random feature dropouts to obtain the final feature:where δ (x) (•; ξ) indicates the random feature sampling at a probability ξ ∈ [0, 1]; ε is a Gaussian noise matrix with the same shape as X (2) . The X is the final input feature for modality x. Similarly, we obtained input feature Y ∈ R n×q for modality x for modality y using the same framework with different random projection matrices and drop-out function.In the experiment, we set k = 50, p = q = 1, 000, and noise variances to 0.1 and 0.9 for x and y, respectively. Further, 90% and 10% of features are randomly missing for these two modalities.In the experiment of feature transformations (Section 3.1.2), for the feature distribution transformation, we linearly shifted features with a minimal value 0 then employed two sequential log 1p

