STOCHASTIC CANONICAL CORRELATION ANALYSIS: A RIEMANNIAN APPROACH

Abstract

We present an efficient stochastic algorithm (RSG+) for canonical correlation analysis (CCA) derived via a differential geometric perspective of the underlying optimization task. We show that exploiting the Riemannian structure of the problem reveals natural strategies for modified forms of manifold stochastic gradient descent schemes that have been variously used in the literature for numerical optimization on manifolds. Our developments complement existing methods for this problem which either require O(d 3 ) time complexity per iteration with O( 1 √ t ) convergence rate (where d is the dimensionality) or only extract the top 1 component with O( 1t ) convergence rate. In contrast, our algorithm achieves O(d 2 k) runtime complexity per iteration for extracting top k canonical components with O( 1 t ) convergence rate. We present our theoretical analysis as well as experiments describing the empirical behavior of our algorithm, including a potential application of this idea for training fair models where the label of protected attribute is missing or otherwise unavailable.

1. INTRODUCTION

Canonical correlation analysis (CCA) is a popular method for evaluating correlations between two sets of variables. It is commonly used in unsupervised multi-view learning, where the multiple views of the data may correspond to image, text, audio and so on Rupnik & Shawe-Taylor (2010); Chaudhuri et al. (2009) ; Luo et al. (2015) . Classical CCA formulations have also been extended to leverage advances in representation learning, for example, Andrew et al. (2013) showed how the CCA can be interfaced with deep neural networks enabling modern use cases. Many results over the last few years have used CCA or its variants for problems including measuring representational similarity in deep neural networks Morcos et al. (2018 ), speech recognition Couture et al. (2019) , etc. The goal in CCA is to find linear combinations within two random variables X and Y which have maximum correlation with each other. Formally, the CCA problem is defined in the following way. Given a pair of random variables, a d x -variate random variable X and a d y -variate random variable Y, with unknown joint probability distribution, find the projection matrices U ∈ R dx×k and V ∈ R dy×k , with k ≤ min{d x , d y }, such that the correlation is maximized: maximize trace U T E X,Y X T Y V s.t. U T E X X T X U = I k , V T E Y Y T Y V = I k (1) Here, X, Y are samples of X and Y respectively. The objective function in (1) is the expected crosscorrelation in the projected space and the constraints specify that different canonical components should be decorrelated. Let C X = E X [X T X] and C Y = E Y [Y T Y ] be the covariance matrices, and C XY = E (X,Y) [X T Y ] denote cross-covariance. Let us define the whitened covariance T := C -1/2 X C XY C -1/2 Y and Φ k (and Ψ k ) contains the top-k left (and right) singular vectors of T . It is known Golub & Zha (1992) that the optimum of ( 1) is achieved at U * = C -1/2 X Φ k , V * = C -1/2 Y Ψ k . In practice, we may be given two views of N samples as X ∈ R N ×dx and Y ∈ R N ×dy . A natural approach to solving CCA is based on on the following sequence of steps. We first compute the empirical covariance and cross-covariance matrices, namely, C X = 1 /NX T X, C Y = 1 /NY T Y and C XY = 1 /NX T Y . We then calculate the empirical whitened cross-covariance matrix T , finally, compute U * , V * by applying a k-truncated SVD to T . Runtime and memory considerations. The above procedure is simple but is only feasible when the data matrices are small. But in most modern applications, not only are the datasets large but also the dimension d (let d = max{d x , d y }) of each sample can be quite high, especially if representations are being learned using deep neural network models. As a result, the computational footprint of the foregoing algorithm can be quite high. This has motivated the study of stochastic optimization routines for solving CCA. Observe that in contrast to the typical settings where stochastic optimization schemes are most effective, the CCA objective does not decompose over samples in the dataset. Many efficient strategies have been proposed in the literature: for example, Ge et al. ( 2016 2017). Often, the search space for U and V corresponds to the entire R d×k (ignoring the constraints for the moment). But if the formulation could be cast in a form which involved approximately writing U and V as a product of several matrices with nicer properties, we may obtain specialized routines which are tailored to exploit those properties. Such a reformulation is not difficult to derive -where the matrices used to express U and V can be identified as objects that live in well studied geometric spaces. Then, utilizing the geometry of the space and borrowing relevant tools from differential geometry leads to an efficient approximate algorithm for top-k CCA which optimizes the population objective in a streaming fashion. Contributions. (a) First, we re-parameterize the top-k CCA problem as an optimization problem on specific matrix manifolds, and show that it is equivalent to the original formulation in equation 1. (b) Informed by the geometry of the manifold, we derive stochastic gradient descent algorithms for solving the re-parameterized problem with O(d 2 k) cost per iteration and provide convergence rate guarantees. (c) This analysis provides a direct mechanism to obtain an upper bound on the number of iterations needed to guarantee an error w.r.t. the population objective for the CCA problem. (d) The algorithm works in a streaming manner so it easily scales to large datasets and we do not need to assume access to the full dataset at the outset. (e) We present empirical evidence for both the standard CCA model and the DeepCCA setting Andrew et al. (2013) , describing advantages and limitations.

2. STOCHASTIC CCA: REFORMULATION, ALGORITHM AND ANALYSIS

The formulation of Stochastic CCA and the subsequent optimization scheme will seek to utilize the geometry of the feasible set for computational gains. Specifically, we will use the following manifolds (please see Absil et al. (2007) for more details): (a) Stiefel: St(p, n). The manifold consists of n × p, with p < n, column orthonormal matrices, i.e., St(p, n) = X ∈ R n×p |X T X = I p . (b) Grassmanian: Gr(p, n). The manifold consists of p-dimensional subspaces in R n , with p < n. (c) Rotations: SO(n). the manifold/group consists of n × n special orthogonal matrices, i.e., SO(n) = X ∈ R n×n |X T X = XX T = I n , det(X) = 1 . We summarize certain geometric properties/operations for these manifolds in the Appendix but have been leveraged in recent works for other problems also Li et al. (2020); Rezende et al. (2020) . Let us recall the objective function for CCA as given in (1). We denote X ∈ R N ×dx as the matrix consisting of the samples {x i } drawn from a zero mean random variable X ∼ X and Y ∈ R N ×dy denotes the matrix consisting of samples {y i } drawn from a zero mean random variable Y ∼ Y. For notational and formulation simplicity, we assume that d x = d y = d in the remainder of the paper although the results hold for general d x and d y . Let C X , C Y be the covariance matrices of X, Y. Also, let C XY be the cross-correlation matrix between X and Y. Then, we can write the CCA objective as max U,V F = trace U T C XY V subject to U T C X U = I k V T C Y V = I k (2)



); Wang et al. (2016) present Empirical Risk Minimization (ERM) models which optimize the empirical objective. More recently, Gao et al. (2019); Bhatia et al. (2018); Arora et al. (2017) describe proposals that optimize the population objective. To summarize the approaches succinctly, if we are satisfied with identifying the top 1 component of CCA, effective schemes are available by utilizing either extensions of the Oja's rule Oja (1982) to generalized eigenvalue problem Bhatia et al. (2018) or the alternating SVRG algorithm Gao et al. (2019)). Otherwise, a stochastic approach must make use of an explicit whitening operation which involves a cost of d 3 for each iteration Arora et al. (2017). Observation. Most approaches either directly optimize (1) or instead a reparametrized or regularized form Ge et al. (2016); Allen-Zhu & Li (2016); Arora et al. (

