APPROXIMATION ALGORITHMS FOR SPARSE PRINCI-PAL COMPONENT ANALYSIS

Abstract

Principal component analysis (PCA) is a widely used dimension reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present three provably accurate, polynomial time, approximation algorithms for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. The first algorithm is based on randomized matrix multiplication; the second algorithm is based on a novel deterministic thresholding scheme; and the third algorithm is based on a semidefinite programming relaxation of SPCA. All algorithms come with provable guarantees and run in low-degree polynomial time. Our empirical evaluations confirm our theoretical findings.

1. INTRODUCTION

Principal Component Analysis (PCA) and the related Singular Value Decomposition (SVD) are fundamental data analysis and dimension reduction tools in a wide range of areas including machine learning, multivariate statistics and many others. They return a set of orthogonal vectors of decreasing importance that are often interpreted as fundamental latent factors that underlie the observed data. Even though the vectors returned by PCA and SVD have strong optimality properties, they are notoriously difficult to interpret in terms of the underlying processes generating the data (Mahoney & Drineas, 2009) , since they are linear combinations of all available data points or all available features. The concept of Sparse Principal Components Analysis (SPCA) was introduced in the seminal work of (d'Aspremont et al., 2007) , where sparsity constraints were enforced on the singular vectors in order to improve interpretability. A prominent example where sparsity improves interpretability is document analysis, where sparse principal components can be mapped to specific topics by inspecting the (few) keywords in their support (d'Aspremont et al., 2007; Mahoney & Drineas, 2009; Papailiopoulos et al., 2013) . Formally, given a positive semidefinite (PSD) matrix A ∈ R n×n , SPCA can be defined as follows: 1 Z * = max x∈R n , x 2 ≤1 x Ax, subject to x 0 ≤ k. In the above formulation, A is a covariance matrix representing, for example, all pairwise feature or object similarities for an underlying data matrix. Therefore, SPCA can be applied for either the object or feature space of the data matrix, while the parameter k controls the sparsity of the resulting vector and is part of the input. Let x * denote a vector that achieves the optimal value Z * in the above formulation. Then intuitively, the optimization problem of eqn. (1) seeks a sparse, unit norm vector x * that maximizes the data variance. It is well-known that solving the above optimization problem is NP-hard (Moghaddam et al., 2006a) and that its hardness is due to the sparsity constraint. Indeed, if the sparsity constraint was removed, then the resulting optimization problem can be easily solved by computing the top left or right singular vector of A and its maximal value Z * is equal to the top singular value of A. Notation. We use bold letters to denote matrices and vectors. For a matrix A ∈ R n×n , we denote its (i, j)-th entry by A i,j ; its i-th row by A i * and its j-th column by A * j ; its 2-norm by 1 Recall that the p-th power of the p norm of a vector x ∈ R n is defined as x p p = n i=1 |xi| p for 0 < p < ∞. For p = 0, x 0 is a semi-norm denoting the number of non-zero entries of x. 1

