EIGENGAME: PCA AS A NASH EQUILIBRIUM

Abstract

We present a novel view on principal component analysis (PCA) as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm-which combines elements from Oja's rule with a generalized Gram-Schmidt orthogonalization-is naturally decentralized and hence parallelizable through message passing. We demonstrate the scalability of the algorithm with experiments on large image datasets and neural network activations. We discuss how this new view of PCA as a differentiable game can lead to further algorithmic developments and insights.

1. INTRODUCTION

The principal components of data are the vectors that align with the directions of maximum variance. These have two main purposes: a) as interpretable features and b) for data compression. Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019) ), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step 1 . This point is subtle, yet critical. For example, any pair of twodimensional, orthogonal vectors spans all of R 2 and, therefore, captures maximum variance of any two-dimensional dataset. However, for these vectors to be principal components, they must, in addition, align with the directions of maximum variance which depends on the covariance of the data. By learning the optimal subspace, rather than the principal components themselves, objectives focused on subspace error ignore the first purpose of PCA. In contrast, modern nonlinear representation learning techniques focus on learning features that are both disentangled (uncorrelated) and low dimensional (Chen et al., 2016; Mathieu et al., 2018; Locatello et al., 2019; Sarhan et al., 2019) . It is well known that the PCA solution of the d-dimensional dataset X ∈ R n×d is given by the eigenvectors of X X or equivalently, the right singular vectors of X. Impractically, the cost of computing the full SVD scales with O(min{nd 2 , n 2 d})-time and O(nd)-space (Shamir, 2015; Tang, 2019) . For moderately sized data, randomized methods can be used (Halko et al., 2011) . Beyond this, stochastic-or online-methods based on Oja's rule (Oja, 1982) or power iterations (Rutishauser, 1971) are common. Another option is to use streaming k-PCA algorithms such as Frequent Directions (FD) (Ghashami et al., 2016) or Oja's algorithm 2 (Allen-Zhu and Li, 2017) with storage complexity O(kd). Sampling or sketching methods also scale well, but again, focus on the top-k subspace (Sarlos, 2006; Cohen et al., 2017; Feldman et al., 2020) . In contrast to these approaches, we view each principal component (equivalently eigenvector) as a player in a game whose objective is to maximize their own local utility function in controlled competition with other vectors. The proposed utility gradients are interpretable as a combination of Oja's rule and a generalized Gram-Schmidt process. We make the following contributions: • A novel formulation of PCA as finding the Nash equilibrium of a suitable game, • A sequential, globally convergent algorithm for approximating the Nash on full-batch data, 1 After learning the top-k subspace V ∈ R d×k , the rotation can be recovered via an SVD of XV . 2 FD approximates the top-k subspace; Oja's algorithm approximates the top-k eigenvectors. • A decentralized algorithm with experiments demonstrating the approach as competitive with modern streaming k-PCA algorithms on synthetic and real data, • In demonstration of the scaling of the approach, we compute the top-32 principal components of the matrix of RESNET-200 activations on the IMAGENET dataset (n ≈ 10 6 , d ≈ 20 • 10 6 ). Each of these contributions is important. Novel formulations often lead to deeper understanding of problems, thereby, opening doors to improved techniques. In particular, k-player games are in general complex and hard to analyze. In contrast, PCA has been well-studied. By combining the two fields we hope to develop useful analytical tools. Our specific formulation is important because it obviates the need for any centralized orthonormalization step and lends itself naturally to decentralization. And lastly, theory and experiments support the viability of this approach for continued research.

2. PCA AS AN EIGEN-GAME

We adhere to the following notation. Vectors and matrices meant to approximate principal components (equivalently eigenvectors) are designated with hats, v and V respectively, whereas true principal components are v and V . Subscripts indicate which eigenvalue a vector is associated with. For example, v i is the ith largest eigenvector. In this work, we will assume each eigenvalue is distinct. By an abuse of notation, v j<i refers to the set of vectors {v j |j ∈ {1, . . . , i -1}} and are also referred to as the parents of v i (v i is their child). Sums over indices should be clear from context, e.g., j<i = i-1 j=1 . The Euclidean inner product is written u, v = u v. We denote the unit sphere by S d-1 and simplex by ∆ d-1 in d-dimensional ambient space. Outline of derivation As argued in the introduction, the PCA problem is often mis-interpreted as learning a projection of the data into a subspace that captures maximum variance (equiv. maximizing the trace of a suitable matrix R introduced below). This is in contrast to the original goal of learning the principal components. We first develop the intuition for deriving our utility functions by (i) showing that only maximizing the trace of R is not sufficient for recovering all principal components (equiv. eigenvectors), and (ii) showing that minimizing off-diagonal terms in R is a complementary objective to maximizing the trace and can recover all components. We then consider learning only the top-k and construct utilities that are consistent with findings in (i) and (ii), equal the true eigenvalues at the Nash of the game we construct, and result in a game that is amenable to analysis. Derivation of player utilities. The eigenvalue problem for a symmetric matrix X X = M ∈ R d×d is to find a matrix of d orthonormal column vectors V (implies V is full-rank) such that M V = V Λ with Λ diagonal. Given a solution to this problem, the columns of V are known as eigenvectors and corresponding entries in Λ are eigenvalues. By left-multiplying by V and recalling V V = V V = I by orthonormality (i.e., V is unitary), we can rewrite the equality as V M V = V V Λ unitary = Λ. Let V denote a guess or estimate of the true eigenvectors V and define R( V ) def = V M V . The PCA problem is often posed as maximizing the trace of R (equiv. minimizing reconstruction error): max V V =I i R ii = Tr(R) = Tr( V M V ) = Tr( V V M ) = Tr(M ) . Surprisingly, the objective in (2) is independent of V , so it cannot be used to recover all (i.e., k = d) the eigenvectors of M -(i). Alternatively, Equation (1) implies the eigenvalue problem can be phrased as ensuring all off-diagonal terms of R are zero, thereby ensuring R is diagonal-(ii): min V V =I i =j R 2 ij . It is worth further examining the entries of R in detail. Diagonal entries R ii = vi , M vi are recognized as Rayleigh quotients because ||v i || = 1 by the constraints. Off-diagonal entries R ij = vi , M vj measure alignment between vi and vj under a generalized inner product •, • M .

