STOCHASTIC CANONICAL CORRELATION ANALYSIS: A RIEMANNIAN APPROACH

Abstract

We present an efficient stochastic algorithm (RSG+) for canonical correlation analysis (CCA) derived via a differential geometric perspective of the underlying optimization task. We show that exploiting the Riemannian structure of the problem reveals natural strategies for modified forms of manifold stochastic gradient descent schemes that have been variously used in the literature for numerical optimization on manifolds. Our developments complement existing methods for this problem which either require O(d 3 ) time complexity per iteration with O( 1 √ t ) convergence rate (where d is the dimensionality) or only extract the top 1 component with O( 1t ) convergence rate. In contrast, our algorithm achieves O(d 2 k) runtime complexity per iteration for extracting top k canonical components with O( 1 t ) convergence rate. We present our theoretical analysis as well as experiments describing the empirical behavior of our algorithm, including a potential application of this idea for training fair models where the label of protected attribute is missing or otherwise unavailable.

1. INTRODUCTION

Canonical correlation analysis (CCA) is a popular method for evaluating correlations between two sets of variables. It is commonly used in unsupervised multi-view learning, where the multiple views of the data may correspond to image, text, audio and so on Rupnik & Shawe-Taylor (2010) ; Chaudhuri et al. (2009) ; Luo et al. (2015) . Classical CCA formulations have also been extended to leverage advances in representation learning, for example, Andrew et al. (2013) showed how the CCA can be interfaced with deep neural networks enabling modern use cases. Many results over the last few years have used CCA or its variants for problems including measuring representational similarity in deep neural networks Morcos et al. (2018) , speech recognition Couture et al. (2019) , etc. The goal in CCA is to find linear combinations within two random variables X and Y which have maximum correlation with each other. Formally, the CCA problem is defined in the following way. Given a pair of random variables, a d x -variate random variable X and a d y -variate random variable Y, with unknown joint probability distribution, find the projection matrices U ∈ R dx×k and V ∈ R dy×k , with k ≤ min{d x , d y }, such that the correlation is maximized: maximize trace U T E X,Y X T Y V s.t. U T E X X T X U = I k , V T E Y Y T Y V = I k (1) Here, X, Y are samples of X and Y respectively. The objective function in (1) is the expected crosscorrelation in the projected space and the constraints specify that different canonical components should be decorrelated. Let C X = E X [X T X] and C Y = E Y [Y T Y ] be the covariance matrices, and C XY = E (X,Y) [X T Y ] denote cross-covariance. Let us define the whitened covariance T := C -1/2 X C XY C -1/2 Y and Φ k (and Ψ k ) contains the top-k left (and right) singular vectors of T . It is known Golub & Zha (1992) that the optimum of ( 1) is achieved at U * = C -1/2 X Φ k , V * = C -1/2 Y Ψ k . In practice, we may be given two views of N samples as X ∈ R N ×dx and Y ∈ R N ×dy . A natural approach to solving CCA is based on on the following sequence of steps. We first compute the empirical covariance and cross-covariance matrices, namely, C X = 1 /NX T X, C Y = 1 /NY T Y and C XY = 1 /NX T Y . We then calculate the empirical whitened cross-covariance matrix T , finally, compute U * , V * by applying a k-truncated SVD to T . Runtime and memory considerations. The above procedure is simple but is only feasible when the data matrices are small. But in most modern applications, not only are the datasets large but also the dimension d (let d = max{d x , d y }) of each sample can be quite high, especially if representations are being learned using deep neural network models. As a result, the computational footprint of the foregoing algorithm can be quite high. This has motivated the study of stochastic optimization routines for solving CCA. Observe that in contrast to the typical settings where stochastic optimization schemes are most effective, the CCA objective does not decompose over samples in the dataset. Many efficient strategies have been proposed in the literature: for example, Ge et al. (2016) ; Wang et al. (2016) present Empirical Risk Minimization (ERM) models which optimize the empirical objective. More recently, Gao et al. (2019) ; Bhatia et al. (2018) ; Arora et al. (2017) describe proposals that optimize the population objective. To summarize the approaches succinctly, if we are satisfied with identifying the top 1 component of CCA, effective schemes are available by utilizing either extensions of the Oja's rule Oja (1982) to generalized eigenvalue problem Bhatia et al. (2018) or the alternating SVRG algorithm Gao et al. (2019) ). Otherwise, a stochastic approach must make use of an explicit whitening operation which involves a cost of d 3 for each iteration Arora et al. (2017) . Observation. Most approaches either directly optimize (1) or instead a reparametrized or regularized form Ge et al. (2016) ; Allen-Zhu & Li (2016) ; Arora et al. (2017) . Often, the search space for U and V corresponds to the entire R d×k (ignoring the constraints for the moment). But if the formulation could be cast in a form which involved approximately writing U and V as a product of several matrices with nicer properties, we may obtain specialized routines which are tailored to exploit those properties. Such a reformulation is not difficult to derive -where the matrices used to express U and V can be identified as objects that live in well studied geometric spaces. Then, utilizing the geometry of the space and borrowing relevant tools from differential geometry leads to an efficient approximate algorithm for top-k CCA which optimizes the population objective in a streaming fashion. Contributions. (a) First, we re-parameterize the top-k CCA problem as an optimization problem on specific matrix manifolds, and show that it is equivalent to the original formulation in equation 1. (b) Informed by the geometry of the manifold, we derive stochastic gradient descent algorithms for solving the re-parameterized problem with O(d 2 k) cost per iteration and provide convergence rate guarantees. (c) This analysis provides a direct mechanism to obtain an upper bound on the number of iterations needed to guarantee an error w.r.t. the population objective for the CCA problem. (d) The algorithm works in a streaming manner so it easily scales to large datasets and we do not need to assume access to the full dataset at the outset. (e) We present empirical evidence for both the standard CCA model and the DeepCCA setting Andrew et al. (2013) , describing advantages and limitations.

2. STOCHASTIC CCA: REFORMULATION, ALGORITHM AND ANALYSIS

The formulation of Stochastic CCA and the subsequent optimization scheme will seek to utilize the geometry of the feasible set for computational gains. Specifically, we will use the following manifolds (please see Absil et al. (2007) for more details): (a) Stiefel: St(p, n). The manifold consists of n × p, with p < n, column orthonormal matrices, i.e., St(p, n) = X ∈ R n×p |X T X = I p . (b) Grassmanian: Gr(p, n). The manifold consists of p-dimensional subspaces in R n , with p < n. (c) Rotations: SO(n). the manifold/group consists of n × n special orthogonal matrices, i.e., SO(n) = X ∈ R n×n |X T X = XX T = I n , det(X) = 1 . We summarize certain geometric properties/operations for these manifolds in the Appendix but have been leveraged in recent works for other problems also Li et al. (2020); Rezende et al. (2020) . Let us recall the objective function for CCA as given in (1). We denote X ∈ R N ×dx as the matrix consisting of the samples {x i } drawn from a zero mean random variable X ∼ X and Y ∈ R N ×dy denotes the matrix consisting of samples {y i } drawn from a zero mean random variable Y ∼ Y. For notational and formulation simplicity, we assume that d x = d y = d in the remainder of the paper although the results hold for general d x and d y . Let C X , C Y be the covariance matrices of X, Y. Also, let C XY be the cross-correlation matrix between X and Y. Then, we can write the CCA objective as max U,V F = trace U T C XY V subject to U T C X U = I k V T C Y V = I k (2) Here, U ∈ R d×k (V ∈ R d×k ) is the matrix consisting of {u j } ({v j }) , where ({u j } , {v j }) are the canonical directions. The constraints in equation 2 are called whitening constraints. Let us define matrices U , V ∈ R d×k which lie on the Stiefel manifold, St(k, d). Also, let S u , S v ∈ R k×k denote upper triangular matrices and Q u , Q v ∈ SO(k). We can rewrite the above equation and the constraint as follows. A Reformulation for CCA max U , V ,Su,Sv,Qu,Qv U = U QuSu; V = V QvSv F = trace U T C XY V (3a) subject to U T C X U = I k ; V T C Y V = I k (3b) U , V ∈ St(k, d); Q u , Q v ∈ SO(k); S u , S v is upper triangular Here, we will maximize (3a) with respect to U , V , S u , S v , Q u , and Q v satisfying equation 3b. Main adjustment from (2) to (3): In (2), while U and V should decorrelate C X and C Y respectively, the optimization/search is unrestricted and treats them as arbitrary matrices. In contrast, equation 3 additionally decomposes U and V as U = U Q u S u and V = V Q v S v with the components as structured matrices. Hence, the optimization is regularized. The above adjustment raises two questions: (i) does there exist a non-empty feasible solution set for (3)? (ii) if a solution to (3) can be found (which we will describe later), how "good" is this solution for the CCA objective problem, i.e., for (2)? Existence of a feasible solution: We need to evaluate if the constraints in (3b) can be satisfied at all. Observe that by using U to be the top-k principal directions of X, S u to be the 1/ √ top-k eigen values of C X and Q u to be any orthogonal matrix, we can easily satisfy the "whitening constraint" and hence U Q u S u is a feasible solution of U in (3) and similarly for V . From this starting point, we can optimize the objective while ensuring that we maintain feasibility. Is the solution for equation 3 a good approximation for equation 2?: We can show that under some assumptions, the estimator for canonical correlation, i.e., solution of equation 3, is consistent, i.e., solves equation 2. We will state this formally shortly. Before characterizing the properties of a solution for equation 3, we first provide some additional intuition behind equation 3 and describe how it helps computationally. Intuition behind the decomposition U = Ũ Q u S u : A key observation is the following. Recall that by performing principal component analysis (PCA), the resultant projection matrix will exactly satisfy the decorrelation condition needed for the "whitening constraint" in equation 2 (projection matrix consists of the eigen-vectors of X T X). A natural question to ask is: Can we utilize streaming PCA algorithm to help us obtain an efficient streaming CCA algorithm? Let us assume that our estimate for canonical correlation directions, i.e., solutions of equation 3, lies in the principal subspace calculated above. If so, we can use the decomposition U = Ũ A u (analogously for V ), where Ũ contains the principal directions, i.e., ∈ St(k, d) and A u is a full rank k × k matrix containing the coefficients of the span. But maintaining the full rank constraint during optimization is hard, so we further decompose A u into A u = Q u S u with Q u ∈ SO(k); S u is upper triangular. Additionally, we ensure the diagonal of S u to be non-zero to maintain full-rank of S u . During optimization, we can maintain the non-zero diagonal entries by optimizing the log of the diagonal entries instead. Why equation 3 helps? First, we note that CCA seeks to maximize the total correlation under the constraint that different components are decorrelated. The difficult part in the optimization is to ensure decorrelation, which leads to a higher complexity in existing streaming CCA algorithms. On the contrary, in equation 3, we separate equation 2 into finding the PCs and finding the linear coefficients for the span of principal directions. Then, by utilizing an efficient streaming PCA algorithm, a lower complexity can be achieved. We will defer describing the specific details of the optimization itself until the next sub-section. First, we will show formally why substituting equation 2 with equation 3a-equation 3b is sensible under some assumptions.

2.1. HOW TO USE THE REFORMULATION IN EQUATION 3?

We first start by stating some mild assumptions needed for the analysis. Assumptions: (a) The random variables X ∼ N (0, Σ x ) and Y ∼ N (0, Σ y ) with Σ x cI d and Σ y cI d for some c > 0. (b) The samples X and Y drawn from X and Y respectively have zero mean. (c) For a given k ≤ d, Σ x and Σ y have non-zero top-k eigen values. A high-level solution to optimize F in equation 3: Recall the following scheme which we briefly summarized earlier. (a) Initialize U , V ∈ St(k, d) as the top-k eigen vectors of C X = ( 1 /N)X T X and C Y = ( 1 /N)Y T Y respectively; Initialize Q u and Q v to be random SO(k) matrices; (b) Set S u and S v to be diagonal matrices with the diagonal entries to be the square root of the inverse of the top-k eigen values (to satisfy upper-triangular property); Observe that with this initialization, the constraints in equation 3b are satisfied. With a feasible solution for U and V in hand, we may optimize equation 3a while satisfying equation 3b. The specific details of how this is done is not critical at this point as long as we assume that a suitable numerical optimization scheme exists and can be implemented. With the component matrices, we can construct the solution as U = U Q u S u and V = V Q v S v . Why the solution makes sense? We now show how the presented solution, assuming access to an effective numerical procedure, approximates the CCA problem presented in equation 2. We formally state the result in the following theorem with a sketch of proof (appendix includes the full proof) by first stating the following proposition and a definition. Definition 1. A random variable X is called sub-Gaussian if the norm given by X := inf {d ≥ 0|E X [exp ( trace(X T X) /d 2 )] ≤ 2} is finite. Let U ∈ R d×k , then XU is sub-Gaussian Ver- shynin (2017). Proposition 1 (Reiß et al. (2020) ). Let X be a random variable which follows a sub-Gaussian distribution. Let X be the approximation of X ∈ R N ×d (samples drawn from X ) with the top-k principal vectors. Let C X be the covariance of X. Also, assume that λ i is the i th eigen value of C X for i = 1, • • • , d -1 and λ i ≥ λ i+1 for all i. Then, the PCA reconstruction error, denoted by E k = X -X (in the Frobenius norm sense) can be upper bounded as follows E k ≤ min √ 2k ∆ 2 , 2 ∆ 2 2 λ k -λ k+1 , where ∆ = C X -C X . The aforementioned proposition suggests that the error between the data matrix X and the reconstructed data matrix X using the top-k principal vectors is bounded. Recall from ( 2) and ( 3) that the value of the CCA objective is denoted by F and F . The following theorem states that we can bound the error, E = F -F (proof is in the Appendix). The proof includes upper-bounding E by the reconstruction error of the data projected on the principal directions using Prop. 1. Theorem 1. Using the hypothesis and assumptions above, the approximation error E = F -F is bounded and goes to zero while the whitening constraints in equation 3b are satisfied. Now, the only unresolved issue is an optimization scheme for equation 3a that keeps the constraints in equation 3b satisfied by leveraging the geometry of the feasible set.

2.2. HOW TO NUMERICALLY OPTIMIZE (3A) SATISFYING CONSTRAINTS IN (3B)?

Overview. We now describe how to maximize the formulation in equation 3a-equation 3b with respect to U , V , Q u , Q v , S u and S v . We will first compute top-k principal vectors to get U and V . Then, we will use a gradient update rule to solve for Q u , Q v , S u and S v to improve the objective. Since all these matrices are "structured", care must be taken to ensure that the matrices remain on their respective manifolds -which is where the geometry of the manifolds will offer desirable properties. We re-purpose a Riemannian stochastic gradient descent (RSGD) to do this, so call our algorithm RSG+. Of course, more sophisticated Riemannian optimization techniques can be substituted in. For instance, different Riemannian optimization methods are available in Absil et al. (2007) and optimization schemes for many manifolds are offered in PyManOpt Boumal et al. (2014) . The algorithm block is presented in Algorithm 1 (a direct implementable block for the algorithm including the expression for gradients is presented in the Appendix A.3). Let F pri = trace U T C X U ) + trace V T C Y V ) be the contribution from the principal directions which we used to ensure the "whitening constraint". Let F can = trace U T C XY V be the contribution from the canonical correlation directions. The algorithm consists of four main blocks denoted by different colors, namely (a) the Red block deals with gradient calculation of the objective function where we calculate the top-k principal vectors (denoted by F pri ) with respect to U , V ; (b) the Green block describes calculation of the gradient corresponding to the canonical directions (denoted by F can ) with respect to U , V , S u , S v , Q u and Q v ; (c) the Gray block combines the gradient computation from both F pri and F can with respect to unknowns U , V , S u , S v , Q u and Q v ; and finally (d) the Blue block performs a batch update of the canonical directions F can using Riemannian gradient updates. Gradient calculations. The gradient update for U , V is divided into two parts (a) The (Red block) gradient updates the "principal" directions (denoted by ∇ U F pri and ∇ V F pri ), which is specifically designed to satisfy the whitening constraint. Since this requires updating the principal subspaces, so, the gradient descent needs to proceed on the manifold of subspaces, i.e., on the Grassmannian. (b) The (green block) gradient from the objective function in equation 3, is denoted by ∇ U F can and ∇ V F can . In order to ensure that the Riemannian gradient update for U and V stays on the manifold St(k, d), we need to make sure that the gradients, i.e., ∇ U F can and ∇ V F can lies in the tangent space of St(k, d). In order to do that, we need to first calculate the Euclidean gradient and then project on to the tangent space of St(k, d). The gradient updates for Q u , Q v , S u , S v are given in the green block, denoted by ∇ Qu F can , ∇ Qv F can , ∇ Su F can and ∇ Sv F can . Note that unlike the previous step, this gradient only has components from canonical correlation computation. As before, this step requires first computing the Euclidean gradient and then projecting on to the tangent space of the underlying Riemannian manifolds involved, i.e., SO(k) and the space of upper triangular matrices. Finally, we get the gradient to update the canonical directions by combining the gradients which is shown in gray block. With these gradients we can perform a batch update as shown in the blue block. Using convergence results presented next in Propositions 2-3, this scheme can be shown (under some assumptions) to approximately optimize the CCA objective in equation 2. We can now move to the convergence properties of the algorithm. We present two results stating the asymptotic proof of convergence for top-k principal vectors and canonical directions in the algorithm.  (a) γ 2 l < ∞ (b) γ l = ∞. Suppose {A l } lie in a compact set K ⊂ M. We also suppose that ∃D > 0 such that, g A l ∇ A l F , ∇ A l F ≤ D. Then ∇ A l F → 0 and l → ∞. Notice that in our problem, the manifold M can be Gr(p, n), St(p, n) or SO(p). Hence all the assumptions in Proposition 3 are satisfied if we guarantee the step sizes satisfy the aforementioned condition. One example of the step sizes that satisfies the property is γ l = 1 l+1 .

2.3. CONVERGENCE RATE AND COMPLEXITY OF THE RSG+ ALGORITHM

In this section, we describe the convergence rate and complexity of the algorithm proposed in Algorithm 1. Observe that the key component of Algorithm 1 is a Riemannian gradient update. Let A t be the generic entity needed to be updated in the algorithm using the Riemannian gradient update Golub & Reinsch (1971) . All other calculations are dominated by this term. A t+1 = Exp At -γ t ∇ At F , d × k, which is O(d 2 k), see

3. EXPERIMENTS

We first evaluate RSG+ for extracting top-k canonical components on three benchmark datasets and show that it performs favorably compared with Arora et al. (2017) . Then, we show that RSG+ can also fits into feature learning in DeepCCA Andrew et al. (2013) , and can scale to large feature dimensions where the non-stochastic method fails to. Finally we show that RSG+ can be used to improve fairness of deep neural networks without needing labels of protected attributes during training. Algorithm 1: Riemannian SGD based algorithm (RSG+) to compute canonical directions Input: X ∈ R N ×dx , Y ∈ R N ×dy , k > 0 Output: U ∈ R dx×k , V ∈ R dy ×k 1 Initialize U , V , Qu, Qv, Su, Sv; Partition X, Y into batches of size B. Batch j th denoted by Xj and Yj ; 2016), partly due to its efficiency, especially for relatively large datasets. Let Û ∈ R dx×k , V ∈ R dy×k denote the estimated subspaces returned by RSG+, and U * ∈ R dx×k , V * ∈ R dy×k denote the true canonical subspaces (all for top-k). The PCC is defined as PCC = TCC(X Û,Y V ) TCC(XU * ,Y V * ) , where TCC is the sum of canonical correlations between two matrices. Performance. See A.4 for the implementation deails. The performance in terms of PCC as a function of # of seen samples (coming in a streaming way) are shown in Fig. 1 , and the runtime is reported in A.5 . Our RSG+ captures more correlation than MSG Arora et al. (2017) while being 5 -10 times faster. One case where our RSG+ underperform Arora et al. ( 2017) is when the top-k eigenvalues are dominated by the top-l eigenvalues with l < k (Fig. 1b ): on Mediamill dataset, the top-4 eigenvalues of the covariance matrix in view 1 are: 8.61, 2.99, 1.15, 0.37. The first eigenvalue is dominantly large compared with the rest and our RSG+ performs better for k = 1 and worse than Arora et al. (2017) for k = 2, 4. We also plot the runtime of RSG+ under different data dimension (set d x = d y = d) and number of total samples sampled from joint gaussian distribution in A.5. 2 for j ∈ 1, • • • , N B do 3 Gradient for top-k principal vectors: calculating ∇ U Fpri, ∇ V Fpri 1. Partition Xj (Yj) into L (L = B k ) blocks of size dx × k (dy × k); 2. Let We implemented the method from Yger et al. (2012) and conduct experiments on the three datasets above. The results are shown in Table 1 . We tune the step size between [0.0001, 0.1] and β = 0.99 as used in their paper. On MNIST and MEDIAMILL, the method performs comparably with ours except k = 4 case on MNIST where it does not converge well. Since this algorithms also has a d 3 complexity, the runtime is 100× more than ours on MNIST and 20× more on Mediamill. On CIFAR10, we fail to find a suitable step size for convergence.

3.2. CCA FOR DEEP FEATURE LEARNING

Background and motivation. A deep neural network (DNN) extension of CCA was proposed by Andrew et al. (2013) and has become popular in the multi-view representation learning tasks. The idea is to learn a deep neural network as the mapping from original data space to a latent space where the canonical correlations are maximized. We refer the reader to Andrew et al. (2013) for details of the task. Since deep neural networks are usually trained using SGD on mini-batches, this requires getting estimate of CCA objective at every iteration in a streaming fashion, thus our RSG+ can be a natural choice here. We conduct experiments on a noisy version of MNIST dataset to evaluate RSG+. ]. View 2 is randomly sampled from the same class as view 1. Then we add independent uniform noise from [0, 1] to each pixel. Finally the image is truncated into [0, 1] to form the view 2. Background and motivation. Fairness is becoming an important issue to consider in the design of learning algorithms. A common strategy to make an algorithm fair is to remove the influence of one/more protected attributes when training the models, see Lokhande et al. (2020) . Most methods assume that the labels of protected attributes are known during training but this may not always be possible. CCA enables considering a slightly different setting, where we may not have per-sample protected attributes which may be sensitive or hard to obtain for third-parties Price & Cohen (2019). On the other hand, we assume that a model trained to predict the protected attribute labels has been trained and is provided. For example, if the protected attribute is gender, we only assume that a well trained classifier which is trained to predict gender from the samples is available rather than sample-wise gender values themselves. We next demonstrate that fairness of the model, using standard measures, can be improved via constraints on correlation values from CCA. Method. Our strategy is inspired by Morcos et al. (2018) which showed that canonical correlations can reveal the similarity in neural networks: when two networks (same architecture) are trained using different labels/schemes for example, canonical correlations can indicate how similar their features are. Our observation is the following. Consider a classifier that is trained on gender (the protected attribute), and another classifier that is trained on attractiveness, if the features extracted by the latter model share a high similarity with the one trained to predict gender, then it is more likely that the latter model is influenced by features in the image pertinent to gender, which will lead to an unfairly biased trained model. We show that by imposing a loss on the canonical correlation between the network being trained (but we lack per-sample protected attribute information) and a well trained classifier pre-trained on the protected attributes, we can get a more fair model. This may enable training fairer models in settings which would otherwise be difficult. Implementation details. To simulate the case where we only have a pretrained network on protected attributes, we train a Resnet-18 He et al. (2016) on gender attribute, and when we train the classifier to predict attractiveness, we add a loss using the canonical correlations between these two networks on intermediate layers: L total = L cross-entropy + L CCA where the first term is the standard cross entropy term and the second term is the canonical correlation. See A.7 for more details of training/evaluation. Results. We choose two commonly used error metrics for fairness: difference in Equality of Opportunity Hardt et al. (2016) (DEO), and difference in Demographic Parity Yao & Huang (2017) (DDP). See appendix A.6 for more detailed explaination of the two metrics. We conduct experiments by applying the canonical correlation loss on three different layers in Resnet-18. In Table 3 , we can see that applying canonical correlation loss generally improves the DEO and DDP metrics (lower is better) over the standard model (trained using cross entropy loss only). Specifically, applying the loss on early layers like conv0 and conv1 gets better performance than applying at a relatively late layer like conv2. Another promising aspect of our approach is that is can easily handle the case where the protected attribute is a continuous variable (as long as a well trained regression network on the protected attribute is given) while other methods like Lokhande et al. ( 2020); Zhang et al. ( 2018) need to first discretize the variable and then enforce constraints which can be much more involved.

4. RELATED WORK

Stochastic CCA: There has been much interest in designing scalable and provable algorithms for CCA: Ma et al. (2015) proposed the first stochastic algorithm for CCA, while only local convergence is proven for non-stochastic version. Wang et al. (2016) designed algorithm which uses alternating SVRG combined with shift-and-invert pre-conditioning, with global convergence. These stochastic methods, together with Ge et al. ( 2016) Allen-Zhu & Li (2016) , which reduce CCA problem to generalized eigenvalue problem and solve it by performing efficient power method, all belongs to the methods that try to solve empirical CCA problem, it can be seen as ERM approxiamtion of the priginal population objective, which requires solving numerical optimization of the empirical CCA objective on a fixed data set. These methods usually assume the access to the full dataset in the beginning, which is not very suitable for many practical applications where data tend to come in a streaming way. Recently, there are increasingly interest in considering population CCA problem Arora et al. ( 2017) Gao et al. (2019) . The main difficulty in population setting is we have limited knowledge about the objective unless we know the distribution of X and Y. Arora et al. (2017) handles this problem by deriving an estimation of gradient of population objecitve whose error can be properly bounded so that applying proximal gradient to a convex relexed objective will provably converge. Gao et al. (2019) provides tightened analysis of the time complexity of the algorithm in Wang et al. (2016) , and provides sample complexity under certain distribution. The problem we are trying to solve in this work is the same as that in Arora et al. ( 2017 

5. CONCLUSIONS

In this work, we presented a stochastic approach (RSG+) for the CCA model based on the observation that the solution of CCA can be decomposed into a product of matrices which lie on certain structured spaces. This affords specialized numerical schemes and makes the optimization more efficient. The optimization is based on Riemannian stochastic gradient descent and we provide a proof for its O( 1 t ) convergence rate with O(d 2 k) time complexity per iteration. In experimental evaluations, we find that our RSG+ behaves favorably relative to the baseline stochastic CCA method in capturing the correlation in the datasets. We also show the use of RSG+ in the DeepCCA setting showing feasibility when scaling to large dimensions as well as in an interesting use case in training fair models.

A APPENDIX A.1 A BRIEF REVIEW OF RELEVANT DIFFERENTIAL GEOMETRY CONCEPTS

To make the paper self-contained, we briefly review certain differential geometry concepts. We only include a condensed description -needed for our algorithm and analysis -and refer the interested reader to Boothby (1986) for a comprehensive and rigorous treatment of the topic. Riemannian Manifold: A Riemannian manifold, M, (of dimension m) is defined as a (smooth) topological space which is locally diffeomorphic to the Euclidean space R m . Additionally, M is equipped with a Riemannian metric g which can be defined as g X : T X M × T X M → R, where T X M is the tangent space at X of M, see Fig. 2 . If X ∈ M, the Riemannian Exponential map at X, denoted by Exp X : T X M → M is defined as γ(1) where γ : [0, 1] → M. We can find γ by solving the following differential equation: γ(0) = X, (∀t 0 ∈ [0, 1]) dγ dt t=t0 = U. In general Exp X is not invertible but the inverse Exp -1 X : U ⊂ M → T X M is defined only if U = B r (X) , where r is called the injectivity radius Boothby (1986) of M. This concept will be useful to define the mechanics of gradient descent on the manifold. In our reformulation, we will shortly make use of the following manifolds, specifically, when decomposing U and V into a product of several matrices. (a) St(p, n): the manifold consists of n × p column orthonormal matrices (b) Gr(p, n): the manifold consists of p-dimensional subspaces in R n (c) SO(n), the manifold/group consists of n × n special orthogonal matrices, i.e., space of orthogonal matrices with determinant 1. Differential Geometry of SO(n): SO(n) is a compact Riemannian manifold, hence by the Hopf-Rinow theorem, it is also a geodesically complete manifold Helgason (2001) . Its geometry is well understood and we recall a few relevant concepts here and refer the reader to Helgason (2001) for details. SO(n) has a Lie group structure and the corresponding Lie algebra, so(n), is defined as, so(n) = {W ∈ R n×n |W T = -W }. In other words, so(n) (the set of Left invariant vector fields with associated Lie bracket) is the set of n × n anti-symmetric matrices. The Lie bracket, [, ], operator on so(n) is defined as the commutator, i.e., for U, V ∈ so(n), [U, V ] = U V -V U . Now, we can define a Riemannian metric on SO(n) as follows: U, V X = trace U T V where, U, V ∈ T X (SO(n)), X ∈ SO(n). Note that, it can be shown that this is a bi-invariant Riemannian metric. Under this bi-invariant metric, now we define the Riemannian exponential and inverse exponential map as follows. Let, X, Y ∈ SO(n), U ∈ T X (SO(n)). Then, Exp -1 X (Y ) = X log(X T Y ) and Exp X (U ) = X exp(X T U ) where, exp, log are the matrix exponential and logarithm respectively.

Differential Geometry of the Stiefel manifold:

The set of all full column rank (n × p) dimensional real matrices form a Stiefel manifold, St(p, n), where n ≥ p. A compact Stiefel manifold is the set of all column orthonormal real matrices. When p < n, St(p, n) can be identified with SO(n)/SO(n -p). Note that, when we consider the quotient space, SO(n)/SO(n -p), we assume that SO(n -p) F (SO(n -p)) is a subgroup of SO(n), where, F : SO(n -p) → SO(n) defined by X → I p 0 0 X is an isomorphism from SO(n -p) to F (SO(n -p)). Differential Geometry of the Grassmannian Gr(p, n): The Grassmann manifold (or the Grassmannian) is defined as the set of all p-dimensional linear subspaces in R n and is denoted by Gr(p, n) , At every point X ∈ St(p, n), we can define the vertical space, V X ⊂ T X St(p, n) to be Ker(Π * X ). Further, given g St , we define the horizontal space, H X to be the g St -orthogonal complement of V X . Now, from the theory of principal bundles, for every vector field U on Gr(p, n), we define the horizontal lift of U to be the unique vector field U on St(p, n) for which U X ∈ H X and Π * X U X = U Π(X) , for all X ∈ St(p, n). As, Π is a Riemannian submersion, the isomorphism Π where p ∈ Z + , n ∈ Z + , n ≥ p. * X | H X : H X → T Π(X) Gr(p, n) is an isometry from (H X , g St X ) to (T Π(X) Gr(p, n), g Gr Π(X) ). So, g Gr Π(X) is defined as: g Gr Π(X) ( U Π(X) , V Π(X) ) = g St X (U X , V X ) = trace((X T X) -1 U T X V X ) (4) where, U , V ∈ T Π(X) Gr(p, n) and Π * X U X = U Π(X) , Π * X V X = V Π(X) , U X ∈ H X and V X ∈ H X . We covered the exponential map and the Riemannian metric above, and their explicit formulation for manifolds listed above is provided for easy reference in Table 4 . g X (U, V ) Exp X (U ) Exp -1 X (Y ) St(p, n) Kaneko et al. (2012) trace U T V U V T , (Y -X) -X(Y -X) T X U S V T = svd(X + U ) Gr(p, n) Absil et al. (2004) trace Π -1 * (U ) T Π -1 * (V ) U V T , Ȳ XT Ȳ -1 -X, U S V T = svd( X + U ) X = Π( X), Y = Π( Ȳ ) SO(n) Subbarao & Meer (2009) trace X T U X T V Xexpm X T U Xlogm X T Y Table 4 : Explicit forms for some operations we need. Π(X) returns X's column space; Π * is Π's differential.

A.2 PROOF OF THEOREM 1

We first restate the assumptions from section 3. Let F be the trace value solution for Eq. ( 2), and F be the trace value solution for Eqs. equation 3a, equation 3b, we next restate Theorem 1 and give its proof: Theorem. Under the assumptions and notations above, the approximation error E = F -F is bounded and goes to zero while the whitening constraints in equation 3b are satisfied. Proof. Let Q u , S u , Q v , S v be the solutions for Eqs. equation 3a and equation 3b, U , V be matrices consisting of top-k eigen vectors of ( 1 /N)X T X and ( 1 /N)Y T Y respectively, U, V be solutions for equation 2. Let X u = X U Q u S u and Y v = Y V Q v S v . Also let X u = XU and Y v = Y V . Observe that mean of X u , Y v , X u and Y v are zero. Moreover the sample covariance of X u and Y v are given by U T C X U and V T C Y V respectively. Thus by the constraint in equation 2, X T u X u = I k and Y T v Y v = I k . Let these covariance matrices be denoted by C(X u ) and C(Y v ) respectively. Analogously the sample covariance of X u and Y v are given by S T u Q T u U T C X U Q u S u and S T v Q T v V T C Y U Q v S v respectively. Let these covariance matrices be denoted by C( X u ) and C( Y v ) respectively. Using Def. 1, we know X u , X v , X u and Y v follow sub-Gaussian distributions. Let F = trace U T C XY V which can be rewritten as F = trace X T u Y v . Moreover, let F = trace S T u Q T u U T C XY V Q v S v which similarly can be rewritten by F = trace X T u Y v . Consider the approximation error between the objective functions as E = |F -F |. We can rewrite E = |trace X T u Y v -trace X T u Y v |. Due to von Neumann's trace inequality and Cauchy-Schwarz inequality, we have E = |trace X T u Y v -X T u Y v | ≤ |trace X u -X u T Y v -Y v | (using Von Neumann's trace inequality) ≤ i σ i ( X u -X u )σ i ( Y v -Y v ) ≤ X u -X u F Y v -Y v F (using Cauchy-Scwartz's inequality) (A.1) where σ i (A) denote the i th singular value of matrix A and • F denotes the Frobenius norm. Now, using Proposition 1, we get X u -X u F ≤ min √ 2k ∆ x 2 , 2 ∆ x 2 2 λ x k -λ x k+1 Y v -Y v F ≤ min √ 2k ∆ y 2 , 2 ∆ y 2 2 λ y k -λ y k+1 (A.2) where, ∆ x = C(X u ) -C( X u ), ∆ y = C(Y v ) -C( Y v ). Here λ x s and λ y s are the eigen values of C(X u ) and C(Y v ) respectively. Now, assume that C(X u ) = I k and C(Y v ) = I k as X u and Y v are solutions of Eq. equation 2. Furthermore assume λ x k -λ x k+1 ≥ Λ and λ y k -λ y k+1 ≥ Λ for some Λ > 0. Then, we can rewrite Eq. equation A.1 as  E ≤ min √ 2k I k -C( X u ) 2 , 2 I k -C( X u ) 2 2 Λ min √ 2k I k -C( Y v ) 2 , 2 I k -C( Y v

A.3 RSG+ ALGORITHM

Here we show our algorithm with more details about the gradients in every step in Alg.2. A.4 IMPLEMENTATION DETAILS OF CCA ON FIXED DATASET Implementation details. On all three benchmark datasets, we only passed the data once for both our RSG+ and MSG Arora et al. (2017) and we use the code from Arora et al. (2017) to produce MSG results. We conducted experiments on different dimensions of target space: k = 1, 2, 4. The choice of k is motivated by the fact that the spectrum of the datasets decays quickly. Since our RSG+ processes data in small blocks, we let data come in mini-batches (mini-batch size was set to 100).

A.5 RUNTIME OF RSG+ AND BASELINE METHODS

The runtime comparison of RSG+ and MSG is reported in Table 5 . Our algorithm is 5-10 times faster. We also plot the runtime of our algorithm under different data dimension (set d x = d y = d) and number of total samples sampled from joint gaussian distribution in Fig. 3 .

A.6 ERROR METRICS FOR FAIRNESS

Equality of Opportunity (EO) Hardt et al. (2016) : A classifier h is said to satisfy EO if the prediction is independent of the protected attribute s (in our experiment s is a binary variable where s = 1 stands for Male and s = 0 stands for Female) for classification label y ∈ {0, 1}. We use the difference of false negative rate (conditioned on y = 1) across two groups identified by protected attribute s as the error metric, and we denote it as DEO. Demographic Parity (DP) Yao & Huang (2017) : A classifier h satisfies DP if the likelihodd of making a misclassification among the positive predictions of the classifier is independent of the protected attribute s. We denote the difference of demographic parity between two groups identified by the protected attribute as DDP.

A.7 IMPLEMENTATION DETAILS OF FAIRNESS EXPERIMENTS

Implementation details. The network is trained for 20 epochs with learning rate 0.01 and batch size 256. We follow Donini et al. (2018) to use NVP (novel validation procedure) to evaluate our result: first we search for hyperparameters that achieves the highest classification score and then report the performance of the model which gets minimum fairness error metrics with accuracy within the highest 90% accuracies. When we apply our RSG+ on certain layers, we first use randomized projection to project the feature into 1k dimension, and then extract top-10 canonical components for training. Similar to our previous experiments on DeepCCA, the batch method does not scale to 1k dimension. A.8 RESNET-18 ARCHITECTURE AND POSITION OF CONV-0,1,2 IN TABLE 3 The Resnet-18 contains a first convolutional layer followed by normalization, nonlinear activation, and max pooling. Then it has four residual blocks, followed by average polling and a fully connected layer. We denote the position after the first convolutional layer as conv0, the position after the first residual block as conv1 and the position after the second residual block as conv2. We choose early layers since late layers close to the final fully connected layer will have feature that is more directly relevant to the classification variable (attractiveness in this case).



U = U QuSu and V = V QvSv;



Proposition 2 (Chakraborty et al. (2020)). (Asymptotically) If the samples, X, are drawn from a Gaussian distribution, then the gradient update rule presented in Step 5 in Algorithm 1 returns an orthonormal basis -the top-k principal vectors of the covariance matrix C X . Proposition 3. (Bonnabel (2013)) Consider a connected Riemannian manifold M with injectivity radius bounded from below by I > 0. Assume that the sequence of step sizes (γ l ) satisfy the condition

Figure 1: Performance on three datasets in terms of PCC as a function of # of seen samples.

CelebA Wang et al. (2015b)  consists of 200K celebrity face images from the internet. There are up to 40 labels, each of which is binary-valued. Here, we follow Lokhande et al. (2020) to focus on the attactiveness attribute (which we want to train a classifier to predict) and the gender is treated as "protected" since it may lead to an unfair classifier according toLokhande et al. (2020).

);Gao et al. (2019): to optimize the population objective of CCA in a streaming fashion. Riemannian Optimization: Riemannian optimization is a generalization of standard Euclidean optimization methods to smooth manifolds, which takes the following form: Given f : M → R, solve min x∈M f (x), where M is a Riemannian manifold. One advantage is that it provides a nice way to express many constrained optimization problems as unconstrained problems. Applications include matrix and tensor factorizationIshteva et al. (2011),Tan et al. (2014),PCA Edelman  et al. (1998),CCA Yger et al. (2012), and so on.Yger et al. (2012) rewrites CCA formulation as Riemannian optimization on Stiefel manifold. In our work, we further explore the ability of Riemannian optimization framework, decomposing the linear space spanned by canonical vectors into products of several matrices which lie in several different Riemannian manifolds.

Figure 2: Schematic description of an exemplar manifold (M) and the visual illustration of Exp and Exp -1 map.

Assumptions: (a) The random variables X ∼ N (0, Σ x ) and Y ∼ N (0, Σ y ) with Σ x cI d and Σ y cI d for some c > 0. (b) The samples X and Y drawn from X and Y respectively have zero mean. (c) For a given k ≤ d, Σ x and Σ y have non-zero top-k eigen values.

And as C( X u ) → I k or C( Y v ) → I k , E → 0. Observe that, the limiting conditions for C( X u ) and C( Y v ) can be satisfied by the "whitening" constraint. In other words, as C(X u ) = I k and C(Y v ) = I k , C( X u ) and C( Y v ) converge to C(X u ) and C(Y v ), the approximation error goes to zero.

where γ t is the step size at time step t. Also assume {A t } ⊂ M for a Riemannian manifold M. The following proposition states that under certain assumptions, the Riemannian gradient update has a convergence rate of O 1 t .Proposition 4.(Nemirovski et al. (2009);Bécigneul & Ganea (2018)) Let {A t } lie inside a geodesic ball of radius less than the minimum of the injectivity radius and the strong convexity radius of M. Assume M to be a geodesically complete Riemannian manifold with sectional curvature lower bounded by κ ≤ 0. Moreover, assume that the step size {γ t } diverges and the squared step size converges. Then, the Riemannian gradient descent update given byA t+1 = Exp At -γ t ∇ At F with a bounded ∇ At F , i.e., ∇ At F ≤ C < ∞ for some C ≥ 0, converges in the rate of O 1 t .All Riemannian manifolds we used, i.e., Gr(k, d), St(k, d) and SO(k) are geodesically complete, and these manifolds have non-negative sectional curvatures, i.e., lower bounded by κ = 0. Now, as long as the Riemannian updates lie inside the geodesic ball of radius less than the minimum of injectivity and convexity radius, the convergence rate for RGD applies in our setting.Running time. To evaluate time complexity, we must look at the main compute-heavy steps needed.The basic modules are Exp and Exp -1 maps for St(k, d), Gr(k, d) and SO(k) manifolds (see Table4in the appendix). Observe that the complexity of these modules is influenced by the complexity of svd needed for the Exp map for the St and Gr manifolds. Our algorithm involves structured matrices

the l th block be denoted by Z x l (Z y l ); 3. Orthogonalize each block and let the orthogonalized block be denoted by Ẑx U Fcan, ∇ V Fcan, ∇Q u Fcan, ∇Q v Fcan, ∇S u Fcan, ∇S v Fcan Calculation of the Riemannian gradients of U , V , Qu, Qv, Su and Sv from equation 3, i.e., objective from CCA. We conduct experiments on three benchmark datasets(MNIST LeCun et al.  (2010), Mediamill Snoek et al. (2006)  and CIFAR-10 Krizhevsky (2009)) to evaluate the performance of RSG+ to extract top-k canonical components. To our best knowledge,Arora et al. (2017) is the only previous work which stochastically optimizes the population objective in a streaming fashion and can extract top-k components, so we compare our RSG+ with the matrix stochastic gradient (MSG) method proposed in Arora et al. (2017) (There are two methods proposed in Arora et al. (2017) and we choose MSG because it performs better in the experiments of Arora et al. (2017)). The details about the three datasets and how we process them are as follows: MNIST LeCun et al. (2010): MNIST contains grey-scale images of size 28 × 28. We use its full training set containing 60K images. Every image is split into left/right half, which are used as the two views. Mediamill Snoek et al. (2006): Mediamill contains around 25.8K paired features of videos and corresponding commentary of dimension 120, 101 respectively. CIFAR-10 Krizhevsky (2009): CIFAR-10 contains 60K 32 × 32 color images. Like MNIST, we split the images into left/right half and use them as two views.Evaluation metric. We choose to use Proportion of Correlations Captured (PCC) which is widely usedMa et al. (2015);Ge et al. (

Results ofYger et al. (2012) (on CIFAR-10, our implementation ofYger et al. (2012) faces convergence issues).

Results of feature learning on MNIST. N/A means fails to yield a result on our hardware. After the network is trained on the CCA objective, we use a linear Support Vector Machine (SVM) to measure classification accuracy on output latent features.Andrew et al. (2013) uses the closed form CCA objective on the current batch directly, which costs O(d 3 ) memory and time for every iteration.

Grassmannian is a symmetric space and can be identified with the quotient space SO(n)/S (O(p) × O(n -p)), where S (O(p) × O(n -p)) is the set of all n × n matrices whose top left p × p and bottom right n -p × n -p submatrices are orthogonal and all other entries are 0, and overall the determinant is 1. A point X ∈ Gr(p, n) can be specified by a basis, X. We say that X = Col(X) if X is a basis of X , where Col(.) is the column span operator. It is easy to see that the general linear group GL(p) acts isometrically, freely and properly on St(p, n). Moreover, Gr(p, n) can be identified with the quotient space St(p, n)/GL(p). Hence, the projection map Π : St(p, n) → Gr(p, n) is a Riemannian submersion, where Π(X) Col(X). Moreover, the triplet (St(p, n), Π, Gr(p, n)) is a fiber bundle.

Wallclock runtime of one pass through the data of our RSG+ and MSG on MNIST, Mediamill and CIFAR (average of 5 runs).

annex

Algorithm 2: Riemannian SGD based algorithm (RSG+) to compute canonical directions1 Initialize U , V , Qu, Qv, Su, Sv ; 2 Partition data X, Y into batches of size B. Let j th batch be denoted by Xj and Yj ; Here, Upper returns the upper triangular matrix of the input matrix andgive the Euclidean gradients. For completeness, the closed form expression of the gradients is, 

