ADVERSARIAL REPRESENTATION LEARNING FOR CANONICAL CORRELATION ANALYSIS

Abstract

Canonical correlation analysis (CCA) provides a framework to map multimodality data into a maximally correlated latent space. The deep version of CCA has replaced linear maps with deep transformations to enable more flexible correlated data representations; however, optimization for the CCA target requires calculation on sufficiently large sample batches. Here, we present a deep, adversarial approach to CCA, adCCA, that can be efficiently solved by standard mini-batch training. We reformulate CCA under the assumption that the different modalities are embedded with identical latent distributions, derive a tractable deep CCA target. We implement the new target and distribution constraint with an adversarial framework to efficiently learn the canonical representations. adCCA learns maximally correlated representations across multimodalities meanwhile preserves class information within individual modalities. Further, adCCA removes the need for feature transformation and normalization and can be directly applied to diverse modalities and feature encodings. Numerical studies show that the performance of adCCA is robust to data transformations, binary encodings, and corruptions. Together, adCCA provides a scalable approach to align data across modalities without compromising sample class information within each modality.

1. INTRODUCTION

Data samples can be measured with different modalities (e.g., image or text), encoded in different formats, and modeled by different distributions. Integrative analysis of multimodality data provides the opportunity in many machine learning tasks to combine partial information from each modality and achieve better performance than any single modality alone (Ngiam et al. (2011); Srivastava & Salakhutdinov (2012) ). Canonical correlation analysis (CCA) (Thompson (1984) ) is one of the most classical and general approaches for multimodality data integration. It learns linear mappings between data modalities that achieve maximal cross-modality correlations. Replacing the linear mappings in CCA with deep functions can achieve non-linear and flexible transformations and provide better correlated representations. However, learning the deep CCA target function requires optimization over all data samples (Andrew et al. ( 2013)) or a sufficiently large data batch (Wang et al. (2015) ), which is incompatible with standard batch-based learning strategies widely used in deep learning and limits its power on large-scale datasets. Therefore, many recent deep CCA approaches (Wang et al. (2016) ; Dutton (2020); Karami & Schuurmans (2021)) sidestep the original CCA formulation and rather focus on learning a joint representation for the paired modalities using approaches that are compatible with mini-batch training (Appendix A). Here, we propose a multimodal adversarial learning framework for deep CCA learning: adCCA. Mathematical analysis provides an optimization target for CCA amenable to mini-batch training under the requirement that the different modality distributions are identical in latent space. adCCA formulates this requirement as a penalty function that, during optimization, brings the two latent distributions into alignment. As the latent space representations converge (distribution-wise), maximizing the optimization target leads to highly correlated latent representations (sample-wise) for the two modalities, as illustrated with numerical experiments. Thus, adCCA is derived from a deep CCA framework that follows the original correlation target of CCA, yet can be directly optimized by mini-batch training. In more detail, adCCA learns representations in multiple steps. First, initial latent representations are provided using modality-specific autoencoders (AEs). Second, the Wasserstein distance is used to measure overall differences between the two latent distributions, and an inner product is used to measure sample-wise similarities. Then, adCCA uses an adversarial framework to minimize latent space distribution differences and maximize sample-wise similarity (Goodfellow et al. (2014) ). Together, the final output representation from each modality is expected to have similar latent distributions (from the Wasserstein distance terms), high cross-modality sample similarity (from the inner product term) while still maintaining a faithful representation of the original features (from the AE). The design of adCCA framework has several advantages. One major advantage is flexibility in input features. Often in multimodal analysis, the total loss is composed of loss terms from different modalities (Wan et al. ( 2021 2005)). If features from diverse modalities have different distributions or data formats, the joint optimization can be biased towards a certain modality. In the adCCA framework, each modality representation is generated from a modality-specific encoder, and the two latent representations are only related indirectly through the adversarial network. Another major advantage is the ability to learn correlation without losing information in the original features. In particular, deep CCA approaches use deep transformations to learn canonical representations with high correlation; however, in doing so these representations may lose class information contained in the original feature space. This point is typically not considered in deep CCA frameworks. In adCCA, the use of AE in learning forces the latent representation to retain a faithful reconstruction of the original data. We benchmark adCCA against 10 different methods, using both synthetic and real datasets. In the evaluation, we focused on two aspects. Cross-modality consistency: Correlation of sample points in the latent space for both modalities (Appendix Fig D .1B ). And, Class preservation: Ability of the latent representation to retain the original class information (Appendix Fig D .1C ). The ideal method should provide embedding that are both correlated per sample and retain class structure across both modalities (Appendix Fig D .1D ). We show representations from adCCA both achieve high cross-modality consistency in the joint latent space and preserve class information from the original feature distributions. We demonstrate that adCCA can maintain stable performances across features of different modality types, feature distributions, and degradation levels. The major contributions of this work are: (1) A reformulation of CCA with deep transformations into a target that can be optimized through standard mini-batch training. (2) An adversarial learning framework for efficient canonical representation learning and an optimization strategy for the stable adversarial training between two latent representations. (3) An expanded approach for evaluating the performance of CCA approaches for both cross-modality consistency and single-modality information preservation. (4) An extensive test and demonstration of adCCA's performance.

2. ADVERSARIAL REPRESENTATION LEARNING FOR CCA

2.1 PRELIMINARY Here, we consider a general framework for CCA. (An overview of existing CCA variants is provided in Appendix A.) Given a pair of feature vectors from two modalities, x ∈ R p and y ∈ R q , CCA tries to learn two mapping functions f 1 and f 2 that maximize the correlation after mapping: max corr(z x , z y ) s.t. z x = f 1 (x), z y = f 2 (y), where z x ∈ R and z x ∈ R are the corresponding canonical variables. Transformation functions f 1 and f 2 can be linear mappings (in the original CCA), implicit kernel functions (in kernel CCA) or deep transformations (in deep CCA). These functions are learned through optimization over the entire dataset (Andrew et al., Appendix B.1), and mini-batch training strategies cannot be leveraged. Here, we aim to reformulate the CCA learning target to a form that enables optimization through general deep-learning optimizers. To this end, we relax the CCA learning target with an additional identical latent distribution assumption:



); Andrew et al. (2013); Ngiam et al. (2011); Jain et al. (

