THE SYMMETRIC GENERALIZED EIGENVALUE PROBLEM AS A NASH EQUILIBRIUM

Abstract

The symmetric generalized eigenvalue problem (SGEP) is a fundamental concept in numerical linear algebra. It captures the solution of many classical machine learning problems such as canonical correlation analysis, independent components analysis, partial least squares, linear discriminant analysis, principal components and others. Despite this, most general solvers are prohibitively expensive when dealing with streaming data sets (i.e., minibatches) and research has instead concentrated on finding efficient solutions to specific problem instances. In this work, we develop a game-theoretic formulation of the top-k SGEP whose Nash equilibrium is the set of generalized eigenvectors. We also present a parallelizable algorithm with guaranteed asymptotic convergence to the Nash. Current state-ofthe-art methods require O(d 2 k) runtime complexity per iteration which is prohibitively expensive when the number of dimensions (d) is large. We show how to modify this parallel approach to achieve O(dk) runtime complexity. Empirically we demonstrate that this resulting algorithm is able to solve a variety of SGEP problem instances including a large-scale analysis of neural network activations.

1. INTRODUCTION

This work considers the symmetric generalized eigenvalue problem (SGEP), Av = Bv (1) where A is symmetric and B is symmetric, positive definite. While the SGEP is not a common sight in modern machine learning literature, remarkably, it underlies several fundamental problems. Most obviously, when A = X > X, B = I, and X is a data matrix, we recover the ubiquitous SVD/PCA. However, by considering other forms of A and B we recover other well known problems. In general, we assume A and B consist of sums or expectations over outerproducts (e.g., X > Y or E[xy > ]) to enable efficient matrix-vector products. These include, but are not limited to: Canonical Correlation Analysis (CCA): Given a dataset of paired observations (or views) x 2 R dx and y 2 R dy (e.g., gene expressions x and medical imaging y corresponding to the same patient), CCA returns the linear projections of x and y that are maximally correlated. CCA is particularly useful for learning multi-modal representations of data and in semi-supervised learning (McWilliams et al., 2013) ; it is effectively the multi-view generalization of PCA (Guo & Wu, 2019) where A and B contain the cross-and auto-covariances of the two views respectively: A = " 0 E[xy > ] E[yx > ] 0 # B = " E[xx > ] 0 0 E[yy > ] # . ⇤ Asterisk denotes equal contribution. † Work done while at DeepMind. Independent Component Analysis (ICA): ICA seeks the directions in the data which are most structured, or alternatively, appear least Gaussian (Hyvärinen & Oja, 2000) . A common SGEP formulation of ICA uncovers latent variables which maximize the non-Gaussianity of the data as defined by its excess kurtosis. ICA has famously been proposed as a solution to the so-called cocktail party source-separation problem in audio processing and has been used for denoising and more generally, the discovery of explanatory latent factors in data. Here A and B are the excess kurtosis and the covariance of the data respectively (Parra & Sajda, 2003) : A = E[hx, xixx > ] tr(B)B 2B 2 B = E[xx > ]. Normalized Graph Laplacians: The graph Laplacian matrix (L) is central to tasks such as spectral clustering (A = L, B = I) where its eigenvectors are known to solve a relaxation of mincut (Von Luxburg, 2007) . Alternatives, such as the random walk normalized Laplacian (A = L, B is the diagonal node-degree matrix), approximate other min-cut objectives. Montana, 2010) . Likewise, linear discriminant analysis (LDA) can be formulated as a SGEP and learns a label-aware projection of the data that separates classes well (Rao, 1948) . More examples and uses of the SGEP can be found in (Bie et al., 2005; Borga et al., 1997) . We now shift focus to the mathematical properties and challenges of the corresponding SGEP. In this work, we assume the matrices A and B above can either be defined using expectations under a data distribution (e.g., E x⇠p(x) [xx > ]) or means over a finite sample dataset (e.g., 1 n X > X where X 2 R n⇥dx ). In either case, we typically assume the data has mean zero unless specified otherwise. Note that the SGEP, Av = Bv, is similar to the eigenvalue problem B 1 Av = v. There are two reasons for working with the SGEP instead: 1) inverting B is prohibitively expensive for a large matrix and 2) while A and B 0 are symmetric, B 1 A is not, which hides useful information about the eigenvalues and eigenvectors (they are necessarily real and B-orthogonal). This also highlights that the SGEP is a fundamentally more challenging problem than SVD and why a direct application of previous game-theoretic approaches such as (Gemp et al., 2021; 2022) is not possible. The complexity of solving the SGEP is O(d 3 ) where d is the dimension of the square matrix A (equiv. B). Several libraries exist for solving the SGEP in-memory (Tzounas et al., 2020) . There is also a vast numerics literature we cannot do justice that considers large matrices (Sorensen, 2002) . We specifically focus on the stochastic, streaming data setting which is of particular interest to machine learning methods which learn by iterating over small minibatches of data (e.g., stochastic gradient descent). Under this setting, machine learning research has developed simple approximate solvers for singular value decomposition (SVD) that scale to very large datasets (Allen-Zhu & Li, 2017b) . Similarly, in this work, we contribute a simple, elegant solution to the SGEP, including • A game whose Nash equilibrium is the top-k SGEP solution, • An easily parallelizable algorithm with O(dk) per-iteration complexity relying only on matrix-vector products, • An empirical analysis of neural similarity on activations 1000⇥ larger than prior work. The game and accompanying algorithm are developed synergistically to achieve a formulation that is amenable to analysis and naturally leads to an elegant and efficient algorithm.

2. GENERALIZED EIGENGAME: PLAYERS, STRATEGIES, AND UTILITIES

In this work, we take the approach of defining the top-k SGEP as a k-player game. It is an open question how to define a k-player game appropriately such that key properties of the SGEP are



These normalized variants, in particular, are important to computing representations for learning value functions in reinforcement learning such as successor features(Machado et al., 2017a; Stachenfeld et al., 2014;  Machado et al., 2017b), an extension of proto-value functions(Mahadevan, 2005)  which uses the un-normalized graph Laplacian (A = L, B = I).

