STOCHASTIC CANONICAL CORRELATION ANALYSIS: A RIEMANNIAN APPROACH

Abstract

We present an efficient stochastic algorithm (RSG+) for canonical correlation analysis (CCA) derived via a differential geometric perspective of the underlying optimization task. We show that exploiting the Riemannian structure of the problem reveals natural strategies for modified forms of manifold stochastic gradient descent schemes that have been variously used in the literature for numerical optimization on manifolds. Our developments complement existing methods for this problem which either require O(d 3 ) time complexity per iteration with O( 1 √ t ) convergence rate (where d is the dimensionality) or only extract the top 1 component with O( 1t ) convergence rate. In contrast, our algorithm achieves O(d 2 k) runtime complexity per iteration for extracting top k canonical components with O( 1 t ) convergence rate. We present our theoretical analysis as well as experiments describing the empirical behavior of our algorithm, including a potential application of this idea for training fair models where the label of protected attribute is missing or otherwise unavailable.

1. INTRODUCTION

Canonical correlation analysis (CCA) is a popular method for evaluating correlations between two sets of variables. It is commonly used in unsupervised multi-view learning, where the multiple views of the data may correspond to image, text, audio and so on Rupnik & Shawe-Taylor ( 2010 The goal in CCA is to find linear combinations within two random variables X and Y which have maximum correlation with each other. Formally, the CCA problem is defined in the following way. Given a pair of random variables, a d x -variate random variable X and a d y -variate random variable Y, with unknown joint probability distribution, find the projection matrices U ∈ R dx×k and V ∈ R dy×k , with k ≤ min{d x , d y }, such that the correlation is maximized: maximize trace U T E X,Y X T Y V s.t. U T E X X T X U = I k , V T E Y Y T Y V = I k (1) Here, X, Y are samples of X and Y respectively. The objective function in (1) is the expected crosscorrelation in the projected space and the constraints specify that different canonical components should be decorrelated. Let C X = E X [X T X] and C Y = E Y [Y T Y ] be the covariance matrices, and C XY = E (X,Y) [X T Y ] denote cross-covariance. Let us define the whitened covariance T := C -1/2 X C XY C -1/2 Y and Φ k (and Ψ k ) contains the top-k left (and right) singular vectors of T . It is known Golub & Zha (1992) that the optimum of ( 1) is achieved at U * = C -1/2 X Φ k , V * = C -1/2 Y Ψ k . In practice, we may be given two views of N samples as X ∈ R N ×dx and Y ∈ R N ×dy . A natural approach to solving CCA is based on on the following sequence of steps. We first compute the empirical covariance and cross-covariance matrices, namely, C X = 1 /NX T X, C Y = 1 /NY T Y and C XY = 1 /NX T Y . We then calculate the empirical whitened cross-covariance matrix T , finally, compute U * , V * by applying a k-truncated SVD to T .



); Chaudhuri et al. (2009); Luo et al. (2015). Classical CCA formulations have also been extended to leverage advances in representation learning, for example, Andrew et al. (2013) showed how the CCA can be interfaced with deep neural networks enabling modern use cases. Many results over the last few years have used CCA or its variants for problems including measuring representational similarity in deep neural networks Morcos et al. (2018), speech recognition Couture et al. (2019), etc.

