A GENERALIZED EIGENGAME WITH EXTENSIONS TO DEEP MULTIVIEW REPRESENTATION LEARNING

Abstract

Generalized Eigenvalue Problems (GEPs) encompass a range of interesting dimensionality reduction methods. Development of efficient stochastic approaches to these problems would allow them to scale to larger datasets. Canonical Correlation Analysis (CCA) is one example of a GEP for dimensionality reduction which has found extensive use in problems with two or more views of the data. Deep learning extensions of CCA require large mini-batch sizes, and therefore large memory consumption, in the stochastic setting to achieve good performance and this has limited its application in practice. Inspired by the Generalized Hebbian Algorithm, we develop an approach to solving stochastic GEPs in which all constraints are softly enforced by Lagrange multipliers. Then by considering the integral of this Lagrangian function, its pseudo-utility, and inspired by recent formulations of Principal Components Analysis and GEPs as games with differentiable utilities, we develop a game-theory inspired approach to solving GEPs. We show that our approaches share much of the theoretical grounding of the previous Hebbian and game theoretic approaches for the linear case but our method permits extension to general function approximators like neural networks for certain GEPs for dimensionality reduction including CCA which means our method can be used for deep multiview representation learning. We demonstrate the effectiveness of our method for solving GEPs in the stochastic setting using canonical multiview datasets and demonstrate state-of-the-art performance for optimizing Deep CCA.

1. INTRODUCTION

A Generalised Eigenvalue Problem (GEP) is defined by two symmetricfoot_0 matrices A, B ∈ R d×d . They are usually characterised by the set of solutions to the equation Aw = λBw (1) with λ ∈ R, w ∈ R d , called (generalised) eigenvalue and (generalised) eigenvector respectively. Note that by taking B = I we recover the standard eigenvalue problem. We shall only be concerned with the case where B is positive definite to avoid degeneracy; in this case one can find a basis of eigenvectors spanning R d . Without loss of generality, take w 1 , . . . , w d such a basis of eigenvectors, with decreasing corresponding eigenvalues λ 1 ≥ • w ⊤ Aw subject to w ⊤ Bw = 1, w ⊤ Bw j = 0 for j = 1, . . . , k 1. (2) There is also a simpler (non-iterative) variational characterisation for the top-k subspace (that spanned by {w 1 , . . . , w k }), namely max W ∈R d×k trace(W ⊤ AW ) subject to W ⊤ BW = I k (3) again see Stewart & Sun (1990) ; the drawback of this characterisation is it only recovers the subspace and not the individual eigenvectors. We shall see that these two different characterisations lead to different algorithms for the GEP. Many classical dimensionality reduction methods can be viewed as GEPs including but not limited to Principal Components Analysis (Hotelling, 1933) , Partial Least Squares Haenlein & Kaplan (2004 ), Fisher Discriminant Analysis Mika et al. (1999) , and Canonical Correlation Analysis (CCA) (Hotelling, 1992) . Each of the problems above is defined at a population level, using population values of the matrices A, B, usually functionals of some appropriate covariance matrices. The practical challenge is the sample version: to estimate the population GEP where we only have estimates of A, B through some finite number of samples (z n ) N n=1 ; classically, one just solves the GEP with A, B estimated by plugging in the relevant sample covariance matrices. However for very large datasets, the dimensionality of the associated GEPs makes it memory and compute intensive to compute solutions using existing full-batch algorithms; these are usually variants of the singular value decomposition where successive eigenvalue eigenvector pairs are calculated sequentially by deflation Mackey ( 2008) and so cannot exploit parallelism over the eigenvectors. This work was motivated in particular by CCA, a classical method for learning representations of data with two or more distinct views: a problem known as multiview (representation) learning. Multiview learning methods are useful for learning representations of data with multiple sets of features, or 'views'. CCA identifies projections or subspaces in at least two different views that are highly correlated and can be used to generate robust low-dimensional representations for a downstream prediction task, to discover relationships between views, or to generate representations of a view that is missing at test time. CCA has been widely applied across a range of fields such as Neuroimaging (Krishnan et al., 2011 ), Finance (Cassel et al., 2000) , and Imaging Genetics (Hansen et al., 2021) . Deep learning functional forms are often extremely effective for modelling extremely large datasets as they have more expressivity than linear models and scale better than kernel methods. While PCA has found a natural stochastic non-linear extension in the popular autoencoder architecture (Kramer, 1991) , applications of Deep CCA (Andrew et al., 2013) have been more limited because estimation of the constraints in the problem outside the full batch setting are more challenging to optimize. In particular, DCCA performs badly when its objective is maximized using stochastic mini-batches. This is unfortunate as DCCA would appear to be well suited to a number of multiview machine learning applications as a number of successful deep multiview machine learning (Suzuki & Matsuo, 2022) and certain self-supervised learning approaches (Zbontar et al., 2021) are designed around similar principals to DCCA; to maximize the consensus between non-linear models of different views Nguyen & Wang (2020). Recently, a number of algorithms have been proposed to approximate GEPs Arora et al. (2012), and CCA specifically Bhatia et al. (2018) , in the 'stochastic' or 'data-streaming' setting; these can have big computational savings. Typically, the computational complexity of classical GEP algorithms is O (N + k)d 2 ; by exploiting parallelism (both between eigenvectors and between samples in a mini-batch), we can reduce this down to O (dk) (Arora et al., 2016) . Stochastic algorithms also introduce a form of regularisation which can be very helpful in these high-dimensional settings. A key motivation for us was a recent line of work reformulating top-k eigenvalue problems as games (Gemp et al., 2020; 2021) , later extended to GEPs in Gemp et al. (2022) . We shall refer to these ideas as the 'Eigengame framework'. Unfortunately, their GEP extension is very complicated, with 3 different hyperparameters; this complication is needed because they constrain their estimates to lie on the unit sphere, which is a natural geometry for the usual eigenvalue problem but not natural for the GEP. By replacing this unit sphere constraint with a Lagrange multiplier penalty, we obtain a much simpler method (GHA-GEP) with only a single hyperparameter; this is a big practical improvement because the convergence of the algorithms is mostly sensitive to step-size (learning rate) parameter Li & Jordan (2021), and it allows a practitioner to explore many more learning rates for the same computational budget. We also propose a second class of method (δ-EigenGame) defined via optimising explicit utility functions, rather than being defined via updates, which enjoys the same practical advantages and similar performance. These utilities give unconstrained variational forms for GEPs that we have not seen elsewhere in the literature and may be of independent interest; their key practical advantage is that they only contain linear factors of A, B so we can easily obtain unbiased updates for gradients. The other key advantage of these utility-based methods is that they



or, more generally, HermitianStewart & Sun (1990)

