APPROXIMATION ALGORITHMS FOR SPARSE PRINCI-PAL COMPONENT ANALYSIS

Abstract

Principal component analysis (PCA) is a widely used dimension reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present three provably accurate, polynomial time, approximation algorithms for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. The first algorithm is based on randomized matrix multiplication; the second algorithm is based on a novel deterministic thresholding scheme; and the third algorithm is based on a semidefinite programming relaxation of SPCA. All algorithms come with provable guarantees and run in low-degree polynomial time. Our empirical evaluations confirm our theoretical findings.

1. INTRODUCTION

Principal Component Analysis (PCA) and the related Singular Value Decomposition (SVD) are fundamental data analysis and dimension reduction tools in a wide range of areas including machine learning, multivariate statistics and many others. They return a set of orthogonal vectors of decreasing importance that are often interpreted as fundamental latent factors that underlie the observed data. Even though the vectors returned by PCA and SVD have strong optimality properties, they are notoriously difficult to interpret in terms of the underlying processes generating the data (Mahoney & Drineas, 2009) , since they are linear combinations of all available data points or all available features. The concept of Sparse Principal Components Analysis (SPCA) was introduced in the seminal work of (d'Aspremont et al., 2007) , where sparsity constraints were enforced on the singular vectors in order to improve interpretability. A prominent example where sparsity improves interpretability is document analysis, where sparse principal components can be mapped to specific topics by inspecting the (few) keywords in their support (d'Aspremont et al., 2007; Mahoney & Drineas, 2009; Papailiopoulos et al., 2013) . Formally, given a positive semidefinite (PSD) matrix A ∈ R n×n , SPCA can be defined as follows: 1 Z * = max x∈R n , x 2 ≤1 x Ax, subject to x 0 ≤ k. In the above formulation, A is a covariance matrix representing, for example, all pairwise feature or object similarities for an underlying data matrix. Therefore, SPCA can be applied for either the object or feature space of the data matrix, while the parameter k controls the sparsity of the resulting vector and is part of the input. Let x * denote a vector that achieves the optimal value Z * in the above formulation. Then intuitively, the optimization problem of eqn. (1) seeks a sparse, unit norm vector x * that maximizes the data variance. It is well-known that solving the above optimization problem is NP-hard (Moghaddam et al., 2006a) and that its hardness is due to the sparsity constraint. Indeed, if the sparsity constraint was removed, then the resulting optimization problem can be easily solved by computing the top left or right singular vector of A and its maximal value Z * is equal to the top singular value of A. Notation. We use bold letters to denote matrices and vectors. For a matrix A ∈ R n×n , we denote its (i, j)-th entry by A i,j ; its i-th row by A i * and its j-th column by A * j ; its 2-norm by 1 Recall that the p-th power of the p norm of a vector x ∈ R n is defined as x p p = n i=1 |xi| p for 0 < p < ∞. For p = 0, x 0 is a semi-norm denoting the number of non-zero entries of x. F = i,j A 2 i,j . We use the notation A 0 to denote that the matrix A is symmetric positive semidefinite (PSD) and Tr(A) = i A i,i to denote its trace, which is also equal to the sum of its singular values. Given a PSD matrix A ∈ R n×n , its Singular Value Decomposition is given by A = UΣU T , where U is the matrix of left/right singular vectors and Σ is the diagonal matrix of singular values.

1.1. OUR CONTRIBUTIONS

We present three algorithms for SPCA and associated quality-of-approximation results (Theorems 2.2, 3.1, and 4.1). All three algorithms are simple, intuitive, and run in O n 3.5 or less time. They return a vector that is provably sparse and, when applied to the input covariance matrix A, provably captures a fraction of the optimal solution Z * . We note that in all three algorithms, the output vector has a sparsity that depends on k (the target sparsity of the original SPCA problem of eqn. ( 1)) and (an accuracy parameter between zero and one). The first algorithm is based on randomized, approximate matrix multiplication: it randomly (but non-uniformly) selects a subset of O k/foot_0 columns of A 1/2 (the square root of the PSD matrix A) and computes its top right singular vector. The output of this algorithm is precisely this singular vector, padded with zeros to become a vector in R n . It turns out that this simple algorithm, which, surprisingly has not been analyzed in prior work, returns an O k/ 2 sparse vector y ∈ R n that satisfies (with constant probability that can be amplified as desired, see Section 2 for details): y Ay ≥ 1 2 Z * - √ Z * • Tr(A) k . Notice that the above bound depends on both Z * and it square root and therefore is not a relative error bound. The second term scales as a function of the trace of A divided by k, which depends on the properties of the matrix A and the target sparsity. The second algorithm is a deterministic thresholding scheme. It computes a small number of the top singular vectors of the matrix A and then applies a deterministic thresholding scheme on those singular vectors to (eventually) construct a sparse vector z ∈ R n that satisfies z Az ≥ ( 1 /2)Z * -( 3 /2) Tr(A). Our analysis provides unconditional guarantees for the accuracy of the solution of this simple thresholding scheme. To the best of our knowledge, no such analyses have appeared in prior work (see Section 1.2 for details). The error bound of the second algorithm is weaker than the one provided by the first algorithm, but the second algorithm is deterministic and does not need to compute the square root (i.e., all singular vectors and singular values) of the matrix A. Our third algorithm provides novel bounds for the following standard convex relaxation of the problem of eqn. (1). max Z∈R n×n , Z 0 Tr(AZ) s.t. Tr(Z) ≤ 1 and |Z i,j | ≤ k. It is well-known that the optimal solution to eqn. ( 2) is at least the optimal solution to eqn. (1). We present a novel, two-step rounding scheme that converts the optimal solution matrix Z ∈ R n×n to a vector z ∈ R n that has expected sparsity 2 Õ k 2 / 2 and satisfies z Az ≥ γ Z (1 -) • Z * -. Here, γ Z is a constant that precisely depends on the top singular value of Z, the condition number of Z, and the extent to which the SDP relaxation of eqn. ( 2) is able to capture the original problem (see Theorem 4.1 and the following discussion for details). To the best of our knowledge, this is the first analysis of a rounding scheme for the convex relaxation of eqn. (2) that does not assume a specific model for the covariance matrix A. Applications to Sparse Kernel PCA. Our algorithms have immediate applications to sparse kernel PCA (SKPCA), where the input matrix A ∈ R n×n is instead implicitly given as a kernel matrix whose entry (i, j) is the value k(i, j) := φ(X i * ), φ(X j * ) for some kernel function φ that implicitly maps an observation vector into some high-dimensional feature space. Although A is not explicit,



For simplicity of presentation and following the lines of(Fountoulakis et al., 2017), we assume that the rows and columns of the matrix A have unit norm; this assumption was not necessary for the previous two algorithms and can be removed as in(Fountoulakis et al., 2017). We are also hiding a poly-logarithmic factor for simplicity, hence the Õ(•) notation. See Theorem 4.1 for a detailed statement.

