A GENERALIZED EIGENGAME WITH EXTENSIONS TO DEEP MULTIVIEW REPRESENTATION LEARNING

Abstract

Generalized Eigenvalue Problems (GEPs) encompass a range of interesting dimensionality reduction methods. Development of efficient stochastic approaches to these problems would allow them to scale to larger datasets. Canonical Correlation Analysis (CCA) is one example of a GEP for dimensionality reduction which has found extensive use in problems with two or more views of the data. Deep learning extensions of CCA require large mini-batch sizes, and therefore large memory consumption, in the stochastic setting to achieve good performance and this has limited its application in practice. Inspired by the Generalized Hebbian Algorithm, we develop an approach to solving stochastic GEPs in which all constraints are softly enforced by Lagrange multipliers. Then by considering the integral of this Lagrangian function, its pseudo-utility, and inspired by recent formulations of Principal Components Analysis and GEPs as games with differentiable utilities, we develop a game-theory inspired approach to solving GEPs. We show that our approaches share much of the theoretical grounding of the previous Hebbian and game theoretic approaches for the linear case but our method permits extension to general function approximators like neural networks for certain GEPs for dimensionality reduction including CCA which means our method can be used for deep multiview representation learning. We demonstrate the effectiveness of our method for solving GEPs in the stochastic setting using canonical multiview datasets and demonstrate state-of-the-art performance for optimizing Deep CCA.

1. INTRODUCTION

A Generalised Eigenvalue Problem (GEP) is defined by two symmetricfoot_0 matrices A, B ∈ R d×d . They are usually characterised by the set of solutions to the equation Aw = λBw (1) with λ ∈ R, w ∈ R d , called (generalised) eigenvalue and (generalised) eigenvector respectively. Note that by taking B = I we recover the standard eigenvalue problem. We shall only be concerned with the case where B is positive definite to avoid degeneracy; in this case one can find a basis of eigenvectors spanning R d . Without loss of generality, take w 1 , . . . , w d such a basis of eigenvectors, with decreasing corresponding eigenvalues λ 1 ≥ • • • ≥ λ d . The following variational characterisation Stewart & Sun (1990) provides a useful alternative, iterative definition: w k solves max w∈R d w ⊤ Aw subject to w ⊤ Bw = 1, w ⊤ Bw j = 0 for j = 1, . . . , k 1. (2) There is also a simpler (non-iterative) variational characterisation for the top-k subspace (that spanned by {w 1 , . . . , w k }), namely max W ∈R d×k trace(W ⊤ AW ) subject to W ⊤ BW = I k (3) again see Stewart & Sun (1990) ; the drawback of this characterisation is it only recovers the subspace and not the individual eigenvectors. We shall see that these two different characterisations lead to different algorithms for the GEP. Many classical dimensionality reduction methods can be viewed as GEPs including but not limited to Principal Components Analysis (Hotelling, 1933) , Partial Least Squares Haenlein & Kaplan (2004) , Fisher Discriminant Analysis Mika et al. (1999) , and Canonical Correlation Analysis (CCA) (Hotelling, 1992) . Each of the problems above is defined at a population level, using population values of the matrices A, B, usually functionals of some appropriate covariance matrices. The practical challenge is the sample version: to estimate the population GEP where we only have estimates of A, B through some finite number of samples (z n ) N n=1 ; classically, one just solves the GEP with A, B estimated by plugging in the relevant sample covariance matrices. However for very large datasets, the dimensionality of the associated GEPs makes it memory and compute intensive to compute solutions using existing full-batch algorithms; these are usually variants of the singular value decomposition where successive eigenvalue eigenvector pairs are calculated sequentially by deflation Mackey (2008) and so cannot exploit parallelism over the eigenvectors. This work was motivated in particular by CCA, a classical method for learning representations of data with two or more distinct views: a problem known as multiview (representation) learning. Multiview learning methods are useful for learning representations of data with multiple sets of features, or 'views'. CCA identifies projections or subspaces in at least two different views that are highly correlated and can be used to generate robust low-dimensional representations for a downstream prediction task, to discover relationships between views, or to generate representations of a view that is missing at test time. CCA has been widely applied across a range of fields such as Neuroimaging (Krishnan et al., 2011 ), Finance (Cassel et al., 2000) , and Imaging Genetics (Hansen et al., 2021) . Deep learning functional forms are often extremely effective for modelling extremely large datasets as they have more expressivity than linear models and scale better than kernel methods. While PCA has found a natural stochastic non-linear extension in the popular autoencoder architecture (Kramer, 1991) , applications of Deep CCA (Andrew et al., 2013) have been more limited because estimation of the constraints in the problem outside the full batch setting are more challenging to optimize. In particular, DCCA performs badly when its objective is maximized using stochastic mini-batches. This is unfortunate as DCCA would appear to be well suited to a number of multiview machine learning applications as a number of successful deep multiview machine learning (Suzuki & Matsuo, 2022) and certain self-supervised learning approaches (Zbontar et al., 2021) are designed around similar principals to DCCA; to maximize the consensus between non-linear models of different views Nguyen & Wang (2020) . Recently, a number of algorithms have been proposed to approximate GEPs Arora et al. (2012) , and CCA specifically Bhatia et al. (2018) , in the 'stochastic' or 'data-streaming' setting; these can have big computational savings. Typically, the computational complexity of classical GEP algorithms is O (N + k)d 2 ; by exploiting parallelism (both between eigenvectors and between samples in a mini-batch), we can reduce this down to O (dk) (Arora et al., 2016) . Stochastic algorithms also introduce a form of regularisation which can be very helpful in these high-dimensional settings. A key motivation for us was a recent line of work reformulating top-k eigenvalue problems as games (Gemp et al., 2020; 2021) , later extended to GEPs in Gemp et al. (2022) . We shall refer to these ideas as the 'Eigengame framework'. Unfortunately, their GEP extension is very complicated, with 3 different hyperparameters; this complication is needed because they constrain their estimates to lie on the unit sphere, which is a natural geometry for the usual eigenvalue problem but not natural for the GEP. By replacing this unit sphere constraint with a Lagrange multiplier penalty, we obtain a much simpler method (GHA-GEP) with only a single hyperparameter; this is a big practical improvement because the convergence of the algorithms is mostly sensitive to step-size (learning rate) parameter Li & Jordan (2021) , and it allows a practitioner to explore many more learning rates for the same computational budget. We also propose a second class of method (δ-EigenGame) defined via optimising explicit utility functions, rather than being defined via updates, which enjoys the same practical advantages and similar performance. These utilities give unconstrained variational forms for GEPs that we have not seen elsewhere in the literature and may be of independent interest; their key practical advantage is that they only contain linear factors of A, B so we can easily obtain unbiased updates for gradients. The other key advantage of these utility-based methods is that they can easily be extended to use deep learning to solve problems motivated by GEPs. In particular we propose a simple but powerful method for the Deep CCA problem.

1.1. NOTATION

We have collected here some notational conventions which we think may provide a helpful reference for the reader. We shall always have A, B ∈ R d . We denote (estimates to or dummy variables for) the i th generalised eigenvectors by w i ; and denote CCA directions u i ∈ R p , v i ∈ R q . The number of directions we want to estimate will be k. For stochastic algorithms, we denote batch-size by b. We use ⟨•, •⟩ for inner products; implicitly we always take Euclidean inner product over vectors and Frobenius or 'trace' inner product for matrices.

2. A CONSTRAINT-FREE ALGORITHM FOR GEPS

Our first proposed method solves the general form of the generalized eigenvalue problem in equation ( 2) for the top-k eigenvalues and their associated eigenvectors in parallel. We are thus interested in both the top-k subspace problem and the top-k eigenvectors themselves. Our method extends the Generalized Hebbian Algorithm to GEPs, and we thus refer to it as GHA-GEP. In the full-batch version of our algorithm, each eigenvector estimate has updates with the form ∆ GHA-GEP i = Reward A ŵi - Penalty j≤i B ŵj ŵ⊤ j A ŵi = Reward A ŵi - Variance Penalty B ŵi ŵ⊤ i A ŵi - Orthogonality Penalty j<i B ŵj ŵ⊤ j A ŵi (4) = Reward A ŵi - Penalty j≤i B ŵj Γ ij ( ) where ŵj is our estimator to the eigenvector associated with the j th largest eigenvalue and in the stochastic setting, we can replace A and B with their unbiased estimates Â and B. We will use the notation Γ ij = ŵ⊤ j A ŵi to facilitate comparison with previous work in Appendix A. Γ ij has a natural interpretation as Lagrange multiplier for the constraint w ⊤ i Bw j = 0; indeed, Chen et al. (2019) prove that ŵ⊤ j A ŵi is the optimal value of the corresponding Lagrange multiplier for their GEP formulation; we summarise this derivation in Appendix C.2 for ease of reference. We also label the terms as rewards and penalties to facilitate discussion with respect to the EigenGame framework in Appendix A.3 and recent work in self-supervised learning in Appendix E. Proposition 2.1 (Unique stationary point). Given exact parents and assuming the top-k generalized eigenvalues of A and B are distinct and positive, the only stable stationary point of the iteration defined by (5) eigenvector w i (up to sign).

2.1. DEFINING UTILITIES AND PSEUDO-UTILITIES WITH LAGRANGIAN FUNCTIONS

Now observe that our proposed updates can be written as the gradients of a Lagrangian pseudo-utility function: PU GHA-GEP i (w i |w j<i , Γ) = 1 2 ŵ⊤ i A ŵi + 1 2 Γ ii (1 -ŵ⊤ i B ŵi ) - j<i Γ ij ŵ⊤ j B ŵi . We show how this result is closely related to the pseudo-utility functions in Chen et al. (2019) and suggests an alternative pseudo-utility function for the work in Gemp et al. (2021) in Appendix C.3 which, unlike the original work, does not require stop gradient operators. If we plug in the relevant w i and w j terms into Γ, we obtain the following utility function: U δ i (w i ; w j<i ) = 1 2 ŵ⊤ i A ŵi + 1 2 ŵ⊤ i A ŵi 1 -ŵ⊤ i B ŵi - j<i ŵ⊤ i A ŵj ŵ⊤ j B ŵi = ( ŵ⊤ i A ŵi ) -1 2 ( ŵ⊤ i A ŵi )( ŵ⊤ i B ŵi ) - j<i ( ŵ⊤ i A ŵj )( ŵ⊤ j B ŵi ) A remarkable fact is that this utility function actually defines a solution to the GEP problem! We prove the following consistency result in Appendix B.1. Proposition 2.2 (Unique stationary point). Assuming the top-i generalized eigenvalues of the GEP (2) are positive and distinct. Then the unique maximizer of the utility in (7) for exact parents is precisely the i th eigenvector (up to sign). An immediate corollary is: Corollary 2.1. The top-k generalized eigenvectors form the unique, strict Nash equilibrium of ∆-EigenGame Furthermore, the penalty terms in the utility function ( 6) have a natural interpretation as a projection deflation as shown in appendix C.5. This utility function allows us to formalise ∆-EigenGame, whose solution corresponds to the top-k solution of equation ( 2). Definition 2.1. Let ∆-EigenGame be the game with players i ∈ {1, ..., k}, strategy space ŵi ∈ R d , where d is the dimensionality of A and B, and utilities U δ i defined in equation (7 Next note that it is easy to compute the derivative ∆ δ i = ∂U δ i (w i ; w j<i ) ∂w i (8) = 2A ŵi -{A ŵi ( ŵ⊤ i B ŵi ) + ( ŵ⊤ i A ŵi )B ŵi } - j<i {A ŵj ( ŵ⊤ j B ŵi ) + ( ŵ⊤ j A ŵi )B ŵj } = ∆ GHA-GEP i + {A ŵi - j≤i A ŵj ( ŵ⊤ j B ŵi )} This motivates an alternative algorithm for the GEP which we call δ-EigenGame (where, consistent with previous work, we use upper case for the game and lower case for its associated algorithm).

2.2. STOCHASTIC/DATA-STREAMING VERSIONS

This paper is motivated by cases where the algorithm only has access to unbiased sample estimates of A and B. These estimates, denoted Â and B, are therefore random variables. A nice property of both our proposed GHA-GEP and δ-EigenGame is that A and B appear as multiplications in both of their updates (as opposed to as divisors). This means that we can simply substitute them for our unbiased estimates at each iteration. For the GHA-GEP algorithm this gives us updates based on stochastic unbiased estimates of the gradient ∆GHA-GEP i = Â ŵi - j≤i B ŵj ŵ⊤ j Â ŵi . Which we can use to form algorithm 1. Likewise we can form stochastic updates for δ-EigenGame ∆ δ i = 2 Â ŵi -{ Â ŵi ( ŵ⊤ i B ŵi ) + ( ŵ⊤ i Â ŵi ) B ŵi } - j<i { Â ŵj ( ŵ⊤ j B ŵi ) + ( ŵ⊤ j Â ŵi ) B ŵj } (10) Which give us algorithm 2. Furthermore, the simplicity of the form of the updates means that, in contrast to previous work, our updates in the stochastic setting require only one hyperparameter -the learning rate. 

2.3. COMPLEXITY AND IMPLEMENTATION

For the GEPs we are motivated by, and in particular for CCA, Â and B are low rank matrices (specifically, they have at most rank b where b is the mini-batch size). This means that, like previous variants of EigenGame, our algorithm has a per-iteration cost of O(bdk 2 ). We can similarly leverage parallel computing in both the eigenvectors (players) and data to achieve a theoretical complexity of O(dk). A particular benefit of our proposed form is that we only require one hyperparameter which makes hyperparameter tuning particularly efficient. This is particularly important as prior work has demonstrated that methods related to the stochastic power method are highly sensitive to the choice of learning rate Li & Jordan (2021). Indeed, by using a decaying learning rate the user can in principle run our algorithm just once to a desired accuracy given their computational budget. This is in contrast to recent work proposing an EigenGame solution to stochastic GEPs (Gemp et al., 2022) which requires three hyperparameters.

3. APPLICATION TO CCA AND EXTENSION TO FOR DEEP CCA

Previous EigenGame approaches have not been extended to include deep learning functions. Gemp et al. (2020) noted that the objectives of the players in α-EigenGame were all generalized inner products which should extend to general function approximators. However, it was unclear how to translate the constraints in previous EigenGame approaches to the neural network setting. In contrast, we have shown that our work is constraint free but can still be written completely as generalized inner products for certain GEPs and, in particular, dimensionality reduction methods like CCA.

3.1. CANONICAL CORRELATION ANALYSIS

Suppose we have vector-valued random variables X, Y ∈ R p , R q respectively. Then CCA (Hotelling, 1992) defines a sequence of pairs of 'canonical directions' (u i , v i ) ∈ R p+q by the iterative maximisations max u∈R p ,v∈R q Cov(u ⊤ X, v ⊤ Y ) subject to Cov(u ⊤ X) = Cov(v ⊤ Y ) = 1, Cov(u ⊤ X, u ⊤ j X) = Cov(v ⊤ Y, v ⊤ j Y ) = 0 for j < i. Now write Cov(X) = Σ XX , Cov(Y ) = Σ Y Y , Cov(X, Y ) = Σ XY . It is straightforward to show (Borga (1998) ) that CCA corresponds to a GEP with A = 0 Σ XY Σ Y X 0 , B = Σ XX 0 0 Σ Y Y , w = u v , d = p + q. ( ) For the sample version of CCA, suppose we have observations (x n , y n ) N n=1 , which have been preprocessed to have mean zero. Then the classical CCA estimator solves the GEV above with covariances replaced by sample covariances Anderson (2003) . To define our algorithm in the stochastic case, suppose that at time step t we define Ât , Bt by plugging sample covariances of the mini-batch at time t.

3.2. δ-EIGENGAME FOR CCA

We defined CCA by maximising correlation between linear functionals of the two views of data; we can extend this to DCCA by instead considering non-linear functionals defined by deep neural networks. Consider neural networks f, g which respectively map X and Y to a d dimensional subspace. We will refer to the k th dimension of these subspaces using f k (X) and g k (X) where f (X) = [f 1 (X), ..., f d (X)] and g(X) = [g 1 (X), ..., g d (X)]. Deep CCA finds f and g which maximize Corr(f i (X), g i (Y )) subject to orthogonality constraints. To motivate an algorithm, note that ( 7) is just a function of the inner products ⟨ ŵi , A ŵj ⟩ = Cov(u ⊤ i X, v ⊤ j Y ) + Cov(v ⊤ i Y, u ⊤ j X) ⟨ ŵi , B ŵj ⟩ = Cov(u ⊤ i X, u ⊤ j X) + Cov(v ⊤ i Y, v ⊤ j Y ) So replacing u ⊤ i X with f i (X) and v ⊤ i Y with g i (Y ), and using the short-hand Ãij = Cov(f i (X), g j (Y )) + Cov(g i (Y ), f j (X)) (13) Bij = Cov(f i (X), f j (X)) + Cov(g i (Y ), g j (Y )) we obtain the objective U δ i (f i , g i |f j<i , g j<i ) = 2 Ãii -Ãii Bii -2 j<i Ãij Bij (15) Next observe by symmetry of matrices Ã, B that if we sum the first k utilities we obtain U sum k = k i=1 U δ i = k i=1 2 Ãii - k i=1 Ãii Bii -2 k i=1 j<i Ãij Bij = 2 trace( Ã) - k i,j=1 Ãij Bij = 2 trace( Ã) -trace( Ã B⊤ ) = trace Ã(2I k -B) The key strength of this covariance based formulation is that we can obtain a full-batch algorithm by simply plugging in the sample covariance over the full batch; and obtain a mini-batch update by plugging in sample covariances on the mini-batch. We define DCCA-EigenGame in algorithm 3, where we slightly abuse notation: we write mini-batches in matrix form X t ∈ R p×b , Y t ∈ R q×b and use short hand f (X t ), f (Y t ) to denote applying f, g to each sample in the mini-batch. Algorithm 3 DCCA EigenGame Input: Stream of data with mini-batch size b X t ∈ R b×p , Y t ∈ R b×q ) , neural networks f (X), g(Y ) parameterized by θ and ψ, learning rate η for t = 1 to T do Construct unbiased estimates Ã and B from f (X t ) and g(Y t ) U ← trace Ã(2I k -B) ∇f ← ∂U ∂f , ∇g ← ∂U ∂g θt+1 ← θt + η ∇f , ψt+1 ← ψt + η ∇g end for We have motivated a loss function for SGD by a heuristic argument. We now give a theoretical result justifying the choice. Recall the top-k variational characterisation of the GEP in (3) was hard to use in practice because of the constraints; we can use this to prove that the form above characterises the GEP. Proposition 3.1 (Subspace characterisation). The top-k subspace for the GEP (1) can be characterised by max W ∈R d×k trace W ⊤ AW (2 I k -W ⊤ BW ) We prove this result in Appendix B.3. We also provide an alternative derivation of the utility of ( 16 

4. RELATED WORK

In particular we note the contemporaneous work in Gemp et al. (2022) , termed γ-EigenGame, which directly addresses the stochastic GEP setting we have described in this work using an EigenGameinspired approach. Since their method was designed around the Rayleigh quotient form of GEPs, it takes a different and more complicated form and requires additional hyperparameters in order to remove bias from the updates in the stochastic setting due to their proposed utility function containing random variables in denominator terms. It also isn't clear that their updates are the gradients of a utility function. Meng et al. (2021) developed an algorithm, termed RSG+, for streaming CCA which stochastically approximates the principal components of each view in order to approximate the top-k CCA problem, in effect transforming the data so that B = I to simplify the problem. Arora et al. (2017) developed a Matrix Stochastic Gradient method for finding the top-k CCA subspace. However, the efficiency of this method depends on mini-batch samples of 1 and scales poorly to larger mini-batch sizes. While there have also been a number of approaches to the top-1 CCA problem (Li & Jordan, 2021; Bhatia et al., 2018) , the closest methods in motivation and performance to our work on the linear problem are γ-EigenGame, SGHA, and RSG+. The original DCCA (Andrew et al., 2013) was defined by the objective max tracenorm( Σ-1/2 XX ΣXY Σ-1/2 Y Y ) and demonstrated strong performance in multiview learning tasks when optimized with the full batch L-BFGS optimizer (Liu & Nocedal, 1989) . However when the objective is evaluated for small minibatches, the whitening matrices Σ-1/2 XX and Σ-1/2 Y Y are likely to be ill-conditioned, causing gradient estimation to be biased. Wang et al. (2015b) observed that despite the biased gradients, the original DCCA objective could still be used in the stochastic setting for large enough mini-batches, a method referred to in the literature as stochastic optimization with large mini-batches (DCCA-STOL). Wang et al. (2015c) developed a method which adaptively approximated the covariance of the embedding for each view in order to whiten the targets of a regression in each view. This mean square error type loss can then be decoupled across samples in a method called non-linear orthogonal iterations (DCCA-NOI). To the best of our knowledge this method is the current state-of-the-art for DCCA optimisation using stochastic mini-batches. 

5. EXPERIMENTS

In this section we replicate experiments from recent work on stochastic CCA and Deep CCA in order to demonstrate the accuracy and efficiency of our method.

5.1. STOCHASTIC SOLUTIONS TO CCA

In this section we compare GHA-GEP and δ-EigenGame to previous methods for approximating CCA in the stochastic setting. We optimize for the top-8 eigenvectors for the MediaMill, Split MNIST and Split CIFAR datasets, replicating Gemp et al. (2022) ; Meng et al. (2021) with double the number of components and mini-batch size 128 and comparing our method to theirs. We use the Scipy (Virtanen et al., 2020) package to solve the population GEPs as a ground truth value and use the proportion of correlation captured (PCC) captured by the learnt subspace as compared to this population ground truth (defined in Appendix F.2). Figure 1 shows that for all three datasets, both GHA-GEP and δ-EigenGame exhibit faster convergence on both a per-iteration basis compared to prior work and likewise in terms of runtime in figure 2 . They also demonstrate comparable or higher PCC at convergence. In these experiments δ-EigenGame was found to outperform GHA-GEP. These results were broadly consistent across mini-batch sizes from 32 to 128 which we demonstrate in further experiments in Appendix G.1. The strong performance of GHA-GEP and δ-EigenGame is likely to be because their updates adaptively weight the objective and constraints of the problem and are not constrained arbitrarily to the unit sphere. We further explore the shape of the utility function in Appendix C.4.

5.2. STOCHASTIC SOLUTIONS TO DEEP CCA

In this section we compare DCCA-EigenGame and DCCA-SGHA to previous methods for optimizing DCCA in the stochastic setting. We replicated an experiment from Wang et al. (2015c) and compare our proposed methods to DCCA-NOI and DCCA-STOL. Like previous work, we use the total correlation captured (TCC) of the learnt subspace as a metric (defined in Appendix F.1). In all three datasets, figure 3 shows that DCCA-EigenGam finds higher correlations in the validation data than all methods except DCCA-STOL with n = 500 with typically faster convergence in early iterations compared to DCCA-NOI.

6. CONCLUSION

We have presented two novel algorithms for optimizing stochastic GEPs. The first, GHA-GEP was based on extending the popular GHA and we showed how it could be understood as optimising a Lagrangian psuedo-utility function. The second, δ-EigenGame, was developed by swapping the Lagrange multipliers to give a proper utility function which allowed us to define the solution of a GEP with ∆-EigenGame. Our proposed methods have simple and elegant forms and require only one choice of hyperparameter, making them extremely practical and both demonstrated comparable or better runtime and performance as compared to prior work. We also showed how this approach can also be used to optimize Deep CCA and demonstrated state-of-the-art performance when using stochastic mini-batches. We believe that this will allow researchers to apply DCCA to a much wider range of problems. In future work, we will apply δ-EigenGame to other practically interesting GEPs like Generalized CCA for more than two views and Fisher Discriminant Analysis. We will also explore the extensions of other GEPs to the deep learning case in order to build principled deep representations. 

A COMPARISON TO PREVIOUS WORK

A.1 GENERALIZED HEBBIAN ALGORITHM Our update is closely related to the Generalized Hebbian Algorithm (GHA) (Sanger, 1989) for solving the PCA problem with updates: ∆ GHA i = A ŵi - j≤i ŵj ŵ⊤ j A ŵi = A ŵi - j≤i ŵj Γ ij ( ) which was originally designed to be solved sequentially rather than in parallel. Note that for GEPs where B = I d like PCA, our proposed method collapses exactly to GHA.

A.2 STOCHASTIC GENERALIZED HEBBIAN ALGORITHM

To understand how our method extends GHA to generalized eigenvalue problems we consider the Stochastic Generalized Hebbian Algorithm (SGHA) (Chen et al., 2019) foot_1 . SGHA is derived from the min-max Lagrangian form of (3): min W ∈R d×k max Γ∈R k×k L(W, Γ) = -tr W ⊤ AW + Γ, W ⊤ BW -I k ( ) Where W is a matrix that captures the top-k subspace (but not necessarily the top-k eigenvectors) and Γ is a Lagrange multiplier that enforces the constraint in (2) along with the B-orthogonality of each eigenvector. By solving for the KKT conditions of equation ( 20), we have Γ = W ⊤ AW and the authors propose to combine the primal and dual updates into a single step to give symmetrical updates for each eigenvector: ∆ SGHA i = A ŵi - j B ŵj ŵ⊤ j A ŵi = A ŵi - j B ŵj Γ ij (21) Where we highlight in red the key difference between our method and SGHA: that there is no hierarchy imposed on eigenvectors, so method can only recover top-k subspace; this is in contrast to our proposal, all eigengame methods, and indeed the original GHA method. As noted by Gemp et al. (2020) , imposing a hierarchy often appears to improve the stability of the algorithm in experiments and has the additional benefit of returning ordered eigenvectors.

A.3 µ-EIGENGAME

Finally, our method is closely related to µ-EigenGame Gemp et al. ( 2021) -though this is only defined for the B = I d case. Their method restricts estimates to lie on the unit sphere, using Riemannian optimization tools to update in directions defined by ∆ µ i = A ŵi - j<i ŵj ŵ⊤ j A ŵi = A ŵi - j<i ŵj Γ ij ( ) where we again highlight in red the difference compared to our proposal: µ-EigenGame however does not have the j = i term in its penalty (and therefore does not use the Γ ii Lagrange multipliers associated with the unit variance constraint ŵ⊤ i ŵi = 1).

B PROOFS AND FURTHER THEORETICAL ANALYSIS B.1 ∆-EIGENGAME THEORY

We recall proposition 2.2: Proposition 2.2 (Unique stationary point). Assuming the top-i generalized eigenvalues of the GEP (2) are positive and distinct. Then the unique maximizer of the utility in (7) for exact parents is precisely the i th eigenvector (up to sign). Proof. For ease of reading the proofs in this appendix, we slightly change notation, and index the normalised solutions to the GEV with superscripts: ⟨w (i) , Bw (i) ⟩ = 1, ⟨w (i) , Aw (i) ⟩ = λ (i) ∀i, while we continue to index our estimates with subscripts. We can write our estimates in this basis to define the coefficients ŵi = p ν (p) i w (p) . Next define m i = p (ν (p) i ) 2 z (j) i = (ν (j) i ) 2 p (ν (p) i ) 2 = (ν (j) i ) 2 m i so that the vector z i = (z (j) i ) j takes values in the simplex. Then we have: ⟨ ŵi , Aw (j) ⟩ = λ (j) ν (j) i ⟨ ŵi , Bw (j) ⟩ = ν j i ⟨ ŵi , A ŵi ⟩ = p λ (p) (ν (p) i ) 2 ⟨ ŵi , B ŵi ⟩ = p (ν (p) i ) 2 = m i Consider the utility function for player i: u i ( ŵi |v j<i ) = 2⟨ ŵi , A ŵi ⟩ -⟨ ŵi , B ŵi ⟩⟨ ŵi , A ŵi ⟩ -2 j<i ⟨ ŵi , B ŵj ⟩⟨ ŵj , A ŵi ⟩ = 2 p λ (p) (ν (p) i ) 2 -2 j<i λ (j) (ν j ) 2 -( p λ (p) (ν (p) i ) 2 )( p (ν (p) i ) 2 ) = 2 j≥i λ (j) (ν (j) i ) 2 -( p λ (p) (ν (p) i ) 2 )( p (ν (p) i ) 2 ) = (2m i -m 2 i ) j≥i λ (j) z (j) i -m 2 i j<i λ (j) z (j) i Which is maximized when m i = 1 z (j) i = δ ij , which implies that ν i i = ±1 and ν j i = 0 for j ̸ = i. Note that in the previous lemma, the utility function took a simple form when we chose the true generalised eigenvectors as a basis; indeed, when using coefficients with respect to the basis, the utility only depended on the generalised eigenvalues and not the basis itself. This simple form shows how our utility interacts naturally with the geometry of the GEV problem. We will now analyse the corresponding simple form of the update steps.

B.2 GHA-GEP THEORY

We recall proposition 2.1: Proposition 2.1 (Unique stationary point). Given exact parents and assuming the top-k generalized eigenvalues of A and B are distinct and positive, the only stable stationary point of the iteration defined by ( 5) eigenvector w i (up to sign). Proof. Let w (j) , ν (j) i , ω j i be defined as in the proof of proposition 2.2. Then the update ∆ δ i = A ŵi -B ŵi ŵ⊤ i A ŵi - j<i B ŵj ŵ⊤ j A ŵi = A ŵi - j≤i B ŵj ŵ⊤ j A ŵi = p λ (p) ν (p) i Bw (p) - j≤i p ν (p) j Bw (p) q λ (q) ν (q) i ν (q) j = p Bw (p) λ (p) ν (p) i - j≤i ν (p) j q λ (q) ν (q) i ν (q) j If we define the diagonal matrix Λ = diag({λ (p) } p ) and the vector ν i = (ν (p) i ) d p=1 then we can rewrite: q λ (q) ν (q) i ν (q) j = ν ⊤ i Λν j . Now by orthonormality of the w (p) , we can equate coefficients and combine to vector form to obtain the update step for the coefficient vectors, which we shall notate by ∆(ν) i = Λν i - j≤i ν j (ν ⊤ j Λν i ) =   I - j≤i ν j ν ⊤ j   Λν i =   I - j<i ν j ν ⊤ j   Λν i -ν i (ν ⊤ i Λν i ) Only now do we consider the assumption of exact parents. This corresponds to the coefficient vectors ν j = e j ∀j < i where e j is the jth unit vector ((e j ) k = δ jk ). Then   I - j<i ν j ν ⊤ j   = 0 (i-1)×(i-1) 0 (i-1)×(i-1) 0 (d-i+1)×(i-1) I (d-i+1)×(d-i+1) So if we write λi = ν ⊤ i Λν i the update equations for our coefficients become: ∆(ν) (p) i = -λi ν (p) i for p < i ∆(ν) (p) i = (λ (p) -λi )ν (p) i for p ≥ i So if we observe the qualitative behaviour: • For p < i the coefficients are shrunk towards zero (for sufficiently small step sizes) • For p ≥ i the coefficients grow/decay depending on their generalised eigenvalue. The larger the eigenvalue, the more the components grow / the less they shrink. So over time only the ith component will be selected. • The overall magnitude of the solution shrinks faster when λi is large. A stationary point of the iteration therefore requires ν (p) i = 0 for p < i. Then for each p ≥ i we must have either λi = λ (p) or ν (p) i = 0. Furthermore, ν i grows at a faster rate than any of the other components, so provided this was non-zero at initialisation, it will be extracted uniquely. Finally, if ν (p) i = 0∀p ̸ = i but ν (i) i ̸ = 0 then we must have λ (i) = λi and also λi = λ (i) (ν (i) i ) 2 ; so combining with we get ν (i) i = ±1, as required.

B.2.1 DISCUSSION OF CONTINUOUS DYNAMICS

In particular note that in the continuous time case above with exact parents, we can write the solutions to d dt ν i (t) = ∆(ν) i with ∆(ν) i as in (24,25) as ν (p) i (t) = ν (p) i (0) exp 1 {p≥i} λ (p) t - t s=0 λi (s)ds So when z (i) i (0) ̸ = 0 (hopefully this is almost sure), the trajectories on the simplex satisfy z (j) i (t) z (i) i (t) = z (j) i (0) z (i) i (0) exp 2(1 {p≥i} λ (p) -λ (i) )t → 0 as t → ∞ and we do indeed select the correct coefficient vector at an exponential rate. Note that in particular this equation for trajectory on the simplex is decoupled from the trajectory of the norm m i of the coefficients. Of course, we are really interested in the case of in-exact parents. We can provide a heuristic argument similar to one of Gemp et al. (2021) . Note that the updates for w i only depend on it's parents w j<i , and one can show that an O(ϵ) error in the parents propagates to an O(ϵ) direction in the child gradient. We know w 1 will converge very fast to an arbitrary accuracy; then the gradient for w 2 will be very close to that corresponding to exact parents, so will quickly converge to that a similar order of accuracy; then the gradient of w 3 will be close to that for exact parents and so on.

B.2.2 EXTENDING TO STOCHASTIC CASE

The real case of interest is the discrete time case with mini-batches. Gemp et al. (2021) claim that their algorithm converges almost surely provided the step-size sequence η t satisfies ∞ t=1 η t = ∞, ∞ t=1 η 2 t < ∞ (26) Their key tool is a result on Stochastic Approximation (SA) on Riemannian manifolds Shah (2017). This result extends the now-classical ODE method for analysis of SA schemes to Riemannian manifolds, mostly drawing on the presentation of Borkar (2008) . One key difficulty of applying the literature on SA schemes is obtaining stability bounds (saying that the estimates never get too big); this becomes trivial when considering updates on compact manifolds like the unit sphere, which is why Gemp et al. (2021) are able to apply their SA tool 'out-of-the-box'. In our case, because we do not restrict to the unit sphere, we are able to apply more classical results on SA, for example Kushner et al. (2003) , however, we would need to prove the corresponding stability estimates. These should hold intuitively because the variance penalty term should keep estimates small, but they are technically difficult. We note that obtaining such stability estimates has attracted a lot of theoretical attention, but in practice they are often unnecessary because only a bounded subset of the parameter space is physically sensible. This applies to our GEP: it only makes sense to consider vectors w with w T Bw ≤ 1; and though B is unknown in general, we may well be able to lower bound its eigenvalues, giving a bounded parameter space of interest. We could modify our algorithm to project onto this bounded subset of parameter space. The theory of Kushner et al. (2003) can be applied to such a case of projected SA; indeed this theory has the added advantage of requiring weaker conditions on the step-sizes, namely only that ∞ t=1 η t = ∞, η t → 0 Note such step-size schedules with slower decay is sometimes observed to give better empirical results in other SA problems Kushner et al. (2003) . We now point out what we understand to be a technical oversight in the proof of almost sure convergence in Gemp et al. (2021) . They proof that w i converges a.s. to true value given fixed exact parents w j<i appears valid; as does the conclusion that w i converges a.s. to a corresponding optimum given fixed inexact parents; and also does their statement that if parents are close to correct then the corresponding optimum is close to correct. However, this does not say anything about convergence of w i when the parents are inexact and varying; in particular the arguments of Shah (2017) do not apply in this case. We believe that it would be possible to fix this oversight by considering a suitable coupling of solution paths starting from an ϵ-covering of a neighbourhood of the true solution. Alternatively, it may be possible to apply the result of Shah (2017) to the combined estimates (w 1 , . . . , w k ). Similar analysis will be needed for GEP-GHA because we also propose parallel updates for computational speed. We next note that this SA literature also applies to our δ-eigengame algorithm, whose updates are unbiased estimates to the gradient of the utilities U δ i . In this case, analysis may be more straightforward because we can also apply other existing literature on stochastic gradient descent. We have not yet had time to make the discussion above more rigorous; we plan to do so in future work. Our algorithms fit very naturally into the well-studied SA framework, and we expect this literature to contain useful intuition and suggestions for implementation, as well as theoretical guarantees.

B.3 SUBSPACE CHARACTERISATION

We recall proposition 3.1: Proposition 3.1 (Subspace characterisation). The top-k subspace for the GEP (1) can be characterised by max W ∈R d×k trace W ⊤ AW (2 I k -W ⊤ BW ) Proof. Let the objective function be h(W ). Firstly note that because B is assumed positive definite, h becomes negative for sufficiently large values of W , in particularly outside some compact set C. Second, note that for small W , h is positive. Therefore any maximizer must be in C. Next note that h is a composition of the trace (a linear map) with matrix polynomial so is differentiable, and in particular continuous; so the maximum on C is attained. Further, at any maximizer W , the derivative h ′ (W ) is zero. We now compute it: h ′ (W ) = 2AW (W ⊤ BW -2I) + 2BW (W ⊤ AW ) C.4 UTILITY SHAPE Lemma C.1. Let ŵi = m(cos (θ i ) w i + sin (θ i ) ∆ i ) where ŵ⊤ i B ŵi = m, then: U i ( ŵi , w j<i ) = U i (mw i , w j<i ) -sin 2 (θ i )(U i (mw i , w j<i ) -U i (m∆ i , w j<i )) We will show that this result follows from similar logic to Gemp et al. (2020) once the scaling factor m is accounted for. Proof. Let ∆ i = d l=1 p l w l , ∥p∥ = 1. Decomposing the utility function for player i we have: U i ( ŵi , w j<i ) = 2⟨ ŵi , A ŵi ⟩ -⟨ ŵi , B ŵi ⟩⟨ ŵi , A ŵi ⟩ -2 j<i ⟨ ŵi , Bw (j) ⟩⟨w (j) , A ŵi ⟩ (35) = (2m -m 2 )(cos 2 (θ i )λ ii + sin 2 (θ i )⟨∆ i , λ∆ i ⟩) -2m j<i ⟨cos (θ i ) w i + sin (θ i ) ∆ i , Aw j ⟩ ⟨cos (θ i ) w i + sin (θ i ) ∆ i , Bw j ⟩ (36) = (2m -m 2 )(cos 2 (θ i )λ ii + sin 2 (θ i )⟨∆ i , λ∆ i ⟩) -2m j<i sin 2 (θ i ) ⟨∆ i , Aw j ⟩ ⟨∆ i , Bw j ⟩ (37) = (2m -m 2 )λ ii -(2m -m 2 ) sin 2 (θ i )λ ii + sin 2 (θ i )((2m -m 2 )⟨∆ i , A∆ i ⟩ -2m j<i ⟨∆ i , Aw j ⟩ ⟨∆ i , Bw j ⟩ (38) = (2m -m 2 )λ ii -sin 2 (θ i )((2m -m 2 )λ ii + (2m -m 2 )⟨∆ i , A∆ i ⟩ -2m j<i ⟨∆ i , Aw j ⟩ ⟨∆ i , Bw j ⟩) (39) = u i (mw i , w j<i ) -sin 2 (θ i )(U i (mw i , w j<i ) -U i (m∆ i , w j<i )) C.5 UTILITY AS A PROJECTION DEFLATION U δ i = 2 ŵ⊤ i [ projection deflation I - j<i B ŵj ŵ⊤ j ]A ŵi -ŵ⊤ i B ŵi ŵ⊤ i A ŵi Analogously to the previous work in α-EigenGame (Gemp et al., 2020) , the matrix [I -j≤i B ŵj ŵ⊤ j ] has a natural interpretation as a projection deflation. models using the Jaxline framework. We used WandB (Biewald, 2020) for experiment tracking to develop insights for this paper.

F.5 DEEP CCA

We use two variants of paired MNIST datasets. The first is identical to the split MNIST dataset in the previous section. The second harder problem is closely related to an experiment by Wang et al. (2015a) . Their 'noisy' paired MNIST data takes two different digits with the same class. The first is rotated randomly while the second has additive gaussian noise. Finally, we use the X-Ray Microbeam (XRMB) dataset from Arora et al. (2016) . For all of the datasets, we use two encoders with 50 latent dimensions and two hidden layers with size 800 and leaky ReLu activation functions, a similar architecture to that used in Wang et al. (2015c) . We compare our proposed method to DCCA-NOI at mini-batch sizes of 20 and 100 and DCCA-STOL with mini-batch size 100 and 500 (DCCA-STOL cannot be used for mini-batch sizes less than the number of latent dimensions).

F.6 DCCA HYPERPARAMETERS

We trained for 30 epochs on each dataset with mini-batch sizes of 20 and 100. Learning rate was tuned from η = (10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 ) and the DCCA-NOI parameter ρ was tuned between 0 and 1. We use PyTorch (Paszke et al., 2019) with the Adam optimizer (Kingma & Ba, 2014) .

G ADDITIONAL EXPERIMENTS G.1 STOCHASTIC CCA WITH SMALLER MINI-BATCH SIZES

In this section we repeat the experiments described in the main text with smaller mini-batch sizes (64 and 32). Results for mini-batch sizes 32 and 64 are broadly similar to those in the main text for mini-batch size 128. In the MNIST data we can see again that there is a tradeoff between speed of convergence in early iterations and the quality of the solution. 

G.2 PARTIAL LEAST SQUARES

The Partial Least Squares (PLS) Wold et al. (1984) problem can also be formulated as a similar GEP but with B replaced by the identity matrix. PLS is equivalent to finding the singular value decomposition (SVD) of the covariance matrix X ⊤ Y . It has an interpretation as a (infinitely) ridge regularised CCA where the covariance matrices Σ XX and Σ Y Y are replaced by identity matrices; this corresponds to assuming no collinearity between variables.

G.2.1 PLS WITH STOCHASTIC MINI-BATCHES

In this experiment we compare our method to the Stochastic Power method Arora et al. (2016) , γ-EigenGame, and SGHA for the stochastic PLS problem. For these experiments we use the Proportion of Variance captured (PV). This is the sum of the singular values of the learnt representation using each stochastic optimisation method as a proportion of the sum of the singular values of the learnt representation using the population ground truth (i.e. the sum of the top-k singular values of the covariance matrix X ⊤ Y ). Figure 5 shows that all of the methods perform similarly in terms of variance captured across the datasets. While the stochastic power method is very fast to converge in the MNIST and CIFAR data, it solutions can be suboptimal. The performance of δ-EigenGame is arguably more suprising for the PLS problem because both the Stochastic Power method and γ-EigenGame explicitly enforce the constraints at each iteration whereas δ-EigenGame only enforces the constraint via penalty terms. 



or, more generally, HermitianStewart & Sun (1990) Though the authors called this SGHA, it is rather different to the original proposal, because it is a subspace method rather than an iterative one



) from the paper of Wang et al. (2015c) in Appendix C.1.

Figure 1: CCA with stochastic mini-batches: proportion of correlation captured with respect to Scipy ground truth b yGHA-GEP and δ-EigenGame vs prior work. The maximum value is 1.

Figure 2: CCA with stochastic mini-batches: proportion of correlation captured with respect to Scipy ground truth by GHA-GEP and δ-EigenGame vs prior work. The maximum value is 1.

Figure 3: Total correlation captured by the 50 latent dimensions in the validation data. The maximum value is 50. The top row show results for mini-batch size 20 and the bottom row show results for mini-batch size 100

Figure 4: CCA with stochastic mini-batches of size 64 (top) and 32 (bottom): proportion of correlation captured with respect to Scipy ground truth by δ-EigenGame vs prior work. The maximum value is 1.

Figure 5: PLS with stochastic mini-batches: proportion of variance captured with respect to Scipy ground truth by δ-EigenGame vs prior work. The maximum value is 1.

Figure 6: PLS with stochastic mini-batches of size 64 (top) and 32 (bottom): proportion of variance captured with respect to Scipy ground truth by δ-EigenGame vs prior work. The maximum value is 1.

Algorithm 1 A Sample Based Generalized Hebbian Algorithm for GEP Input: data stream Z t consisting of b samples from z n . Learning rate (η t ) t . Number of time steps T . Number of eigenvectors to compute k. Initialise: ( ŵi ) K i=1 with random uniform entries for t = 1 to T do Construct independent unbiased estimates Â and B from Z t for i = 1 to k do ŵi ← ŵi + η t ∆GHA-GEP A Sample Based δ-EigenGame for GEP Input: data stream Z t consisting of b samples from z n . Learning rate (η t ) t . Number of time steps T . Number of eigenvectors to compute k.

Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261-272, 2020. Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In International conference on machine learning, pp. 1083-1092. PMLR, 2015a.

annex

Setting to zero, left multiplying by W ⊤ , and using the previous notation Ã = W ⊤ AW, B = W ⊤ BW gives B Ã + Ã B = 2 Ã and so ( B -I) Ã + Ã(I -B) = 0 so S := ( B -I) Ã is skew-symmetric (using that A, B are both symmetric). But then right multiplying by Ã-1 gives symmetricSo in fact we must have both sides equal to zero and therefore B = I. But then by the arguments at the start of the proof, we see that any maximizer of h must in fact have W ⊤ BW = I; so ( 17) is indeed equivalent to (3) and any maximizer recovers the top-k subspace.We now indulge in some vague intuition: a key strength of this unconstrained formulation is that it is straightforward to transform with respect to arbitrary changes of basis; therefore one can do analysis in the basis of generalised eigenvectors. By contrast, the orthogonality constraint in (3) only permits orthogonal changes of basis. This may give intuition to why µ-eigengame only works in the B = I case but our approach is effective for general GEPs.C FURTHER CONNECTIONS TO PREVIOUS WORK where F = (f (x 1 ), . . . , f (x N )) , G = (g(y 1 ), . . . , g(y n )) are matrices whose columns are images of the training data under functions f, g defined by neural networks in some class of functions F, G with input and output dimensions (p, d x ), (q, d y ) respectively. Observe that this optimisation is really targeting the population problem). We now abuse notation to write R k (f (X), g(Y )) to correspond to the sum of the first k canonical correlations of the pair of random variables f (X), g(Y ). It is well known that if we fix f, g in the above then the optimisation problem defines the top-k subspace for CCA. So we can write the optimisation asBut then using proposition 3.1, this optimisation is also equivalent towhere we defineWe have now almost recovered the form of ( 16), the only difference is that there is an optimisation over W in the above. To finish the derivation we follow Wang et al. (2015c) and define the augmented function classes:which precisely matches our objective in ( 16).We now comment on this analysis: the definition from Wang et al. (2015c) proposes DCCA to find a pair of low-dimensional feature maps under which the two sets of data are highly correlated.Intuitively, this analysis says that if we take a sufficiently expressive class of neural networks, we only need to consider a k dimensional latent space to recover the top-k subspace of 'deep canonical directions'. Note also that one only needs to apply a k-dimensional classical CCA to recover the top-k directions from this subspace. Finally, we warn that in general these directions may be highly non-unique, and that many of the nice properties of CCA are dependent on the structure of Euclidean space and do not hold for DCCA.C.2 DERIVATION OF SGHA ALGORITHM FROM CHEN ET AL. ( 2019)The Lagrangian function in Chen et al. (2019) corresponding to (3) is given as:Differentiating with respect to W givesLeft multiplying by W T and using the constraint W T BW = I k shows that at any stationary point we haveThey then plug this value of Γ into a gradient descent step for W to obtain an update direction:(where we follow their exposition and drop the factor of 2 at this point). Note that this technique of plugging in the optimal dual variable is non-standard to our knowledge. The algorithm needs their theoretical results for more concrete justification.

C.3 PSEUDO-UTILITIES IN PREVIOUS WORK

Recall we defined the pseudo-utility of GHA-GEP to be Considering a single 'player' this utility can be written:We can also write the updates of µ-EigenGame Gemp et al. ( 2021) as a Lagrangian pseudo-utility. Note that this is a slightly different expression to that given by the authors.

D VECTORIZED δ-EIGENGAME

Algorithm 4 Vectorized δ-EigenGameInput: data stream Z t consisting of b samples from z n , learning rate η for t = 1 to T do Construct independent unbiased estimates Â andWhere triu returns a matrix with the entries below the main diagonal set to zero.

E A CONNECTION TO SELF-SUPERVISED LEARNING METHODS

Reorganizing our update equation ( 5), we find that the intuition of our method can also be understood as three terms encouraging variance in A, penalizing variance in B, and discouraging covariance.The motivation is similar to that in recent work in self-supervised learning (Zbontar et al., 2021) and in particular the VICReg method in Bardes et al. (2021) . Recent work has shown links between several self-supervised learning approaches and classical spectral embedding methods (Balestriero & LeCun, 2022) , some of which could be represented by GEPs. Like CCA, many self-supervised learning approaches are based on finding a function which is invariant to an image and its augmented version i.e. the learnt representations of both are correlated.

F EXPERIMENT DETAILS F.1 TOTAL CORRELATION CAPTURED (TCC)

This is the sum of the canonical correlations of the learnt representation (i.e. the sum of the top-k canonical correlations of X and Y ).

F.2 PROPORTION OF CORRELATION CAPTURED (PCC)

This is the sum of the canonical correlations of the learnt representation as a proportion of the sum of the canonical correlations of the learnt representation using the population ground truth (i.e. the sum of the top-k canonical correlations of X and Y ).

F.3 STOCHASTIC CCA

The latter two datasets are formed from left and right halves of the canonical datasets (LeCun et al., 2010; Krizhevsky et al., 2009) . With the same initialization for all methods, we trained for 10 epochs on each dataset with a mini-batch size of 128 and illustrate the models with the best performance in the validation set.

F.4 STOCHASTIC CCA HYPERPARAMETERS

Learning rate was tuned from η = (10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 ) and γ-EigenGame parameter γ was tuned from the same range. We used Jax (Babuschkin et al., 2020) to optimize the linear CCA

