APPROXIMATION ALGORITHMS FOR SPARSE PRINCI-PAL COMPONENT ANALYSIS

Abstract

Principal component analysis (PCA) is a widely used dimension reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present three provably accurate, polynomial time, approximation algorithms for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. The first algorithm is based on randomized matrix multiplication; the second algorithm is based on a novel deterministic thresholding scheme; and the third algorithm is based on a semidefinite programming relaxation of SPCA. All algorithms come with provable guarantees and run in low-degree polynomial time. Our empirical evaluations confirm our theoretical findings.

1. INTRODUCTION

Principal Component Analysis (PCA) and the related Singular Value Decomposition (SVD) are fundamental data analysis and dimension reduction tools in a wide range of areas including machine learning, multivariate statistics and many others. They return a set of orthogonal vectors of decreasing importance that are often interpreted as fundamental latent factors that underlie the observed data. Even though the vectors returned by PCA and SVD have strong optimality properties, they are notoriously difficult to interpret in terms of the underlying processes generating the data (Mahoney & Drineas, 2009) , since they are linear combinations of all available data points or all available features. The concept of Sparse Principal Components Analysis (SPCA) was introduced in the seminal work of (d' Aspremont et al., 2007) , where sparsity constraints were enforced on the singular vectors in order to improve interpretability. A prominent example where sparsity improves interpretability is document analysis, where sparse principal components can be mapped to specific topics by inspecting the (few) keywords in their support (d'Aspremont et al., 2007; Mahoney & Drineas, 2009; Papailiopoulos et al., 2013) . Formally, given a positive semidefinite (PSD) matrix A ∈ R n×n , SPCA can be defined as follows: 1 Z * = max x∈R n , x 2 ≤1 x Ax, subject to x 0 ≤ k. In the above formulation, A is a covariance matrix representing, for example, all pairwise feature or object similarities for an underlying data matrix. Therefore, SPCA can be applied for either the object or feature space of the data matrix, while the parameter k controls the sparsity of the resulting vector and is part of the input. Let x * denote a vector that achieves the optimal value Z * in the above formulation. Then intuitively, the optimization problem of eqn. (1) seeks a sparse, unit norm vector x * that maximizes the data variance. It is well-known that solving the above optimization problem is NP-hard (Moghaddam et al., 2006a) and that its hardness is due to the sparsity constraint. Indeed, if the sparsity constraint was removed, then the resulting optimization problem can be easily solved by computing the top left or right singular vector of A and its maximal value Z * is equal to the top singular value of A. Notation. We use bold letters to denote matrices and vectors. For a matrix A ∈ R n×n , we denote its (i, j)-th entry by A i,j ; its i-th row by A i * and its j-th column by A * j ; its 2-norm by 1 Recall that the p-th power of the p norm of a vector x ∈ R n is defined as x p p = n i=1 |xi| p for 0 < p < ∞. For p = 0, x 0 is a semi-norm denoting the number of non-zero entries of x. A 2 = max x∈R n , x 2=1 Ax 2 ; and its (squared) Frobenius norm by A 2 F = i,j Afoot_0 i,j . We use the notation A 0 to denote that the matrix A is symmetric positive semidefinite (PSD) and Tr(A) = i A i,i to denote its trace, which is also equal to the sum of its singular values. Given a PSD matrix A ∈ R n×n , its Singular Value Decomposition is given by A = UΣU T , where U is the matrix of left/right singular vectors and Σ is the diagonal matrix of singular values.

1.1. OUR CONTRIBUTIONS

We present three algorithms for SPCA and associated quality-of-approximation results (Theorems 2.2, 3.1, and 4.1). All three algorithms are simple, intuitive, and run in O n 3.5 or less time. They return a vector that is provably sparse and, when applied to the input covariance matrix A, provably captures a fraction of the optimal solution Z * . We note that in all three algorithms, the output vector has a sparsity that depends on k (the target sparsity of the original SPCA problem of eqn. (1) ) and (an accuracy parameter between zero and one). The first algorithm is based on randomized, approximate matrix multiplication: it randomly (but non-uniformly) selects a subset of O k/ 2 columns of A 1/2 (the square root of the PSD matrix A) and computes its top right singular vector. The output of this algorithm is precisely this singular vector, padded with zeros to become a vector in R n . It turns out that this simple algorithm, which, surprisingly has not been analyzed in prior work, returns an O k/ 2 sparse vector y ∈ R n that satisfies (with constant probability that can be amplified as desired, see Section 2 for details): y Ay ≥ 1 2 Z * - √ Z * • Tr(A) k . Notice that the above bound depends on both Z * and it square root and therefore is not a relative error bound. The second term scales as a function of the trace of A divided by k, which depends on the properties of the matrix A and the target sparsity. The second algorithm is a deterministic thresholding scheme. It computes a small number of the top singular vectors of the matrix A and then applies a deterministic thresholding scheme on those singular vectors to (eventually) construct a sparse vector z ∈ R n that satisfies z Az ≥ ( 1 /2)Z * -( 3 /2) Tr(A). Our analysis provides unconditional guarantees for the accuracy of the solution of this simple thresholding scheme. To the best of our knowledge, no such analyses have appeared in prior work (see Section 1.2 for details). The error bound of the second algorithm is weaker than the one provided by the first algorithm, but the second algorithm is deterministic and does not need to compute the square root (i.e., all singular vectors and singular values) of the matrix A. Our third algorithm provides novel bounds for the following standard convex relaxation of the problem of eqn. (1) . max Z∈R n×n , Z 0 Tr(AZ) s.t. Tr(Z) ≤ 1 and |Z i,j | ≤ k. It is well-known that the optimal solution to eqn. (2) is at least the optimal solution to eqn. (1) . We present a novel, two-step rounding scheme that converts the optimal solution matrix Z ∈ R n×n to a vector z ∈ R n that has expected sparsity 2 Õ k 2 / 2 and satisfies z Az ≥ γ Z (1 -) • Z * -. Here, γ Z is a constant that precisely depends on the top singular value of Z, the condition number of Z, and the extent to which the SDP relaxation of eqn. (2) is able to capture the original problem (see Theorem 4.1 and the following discussion for details). To the best of our knowledge, this is the first analysis of a rounding scheme for the convex relaxation of eqn. (2) that does not assume a specific model for the covariance matrix A. Applications to Sparse Kernel PCA. Our algorithms have immediate applications to sparse kernel PCA (SKPCA), where the input matrix A ∈ R n×n is instead implicitly given as a kernel matrix whose entry (i, j) is the value k(i, j) := φ(X i * ), φ(X j * ) for some kernel function φ that implicitly maps an observation vector into some high-dimensional feature space. Although A is not explicit, we can query all O n 2 entries of A using O n 2 time assuming an oracle that computes the kernel function k. We can then subsequently apply our SPCA algorithms and achieve polynomial time runtime with the same approximation guarantees.

1.2. PRIOR WORK

SPCA was formally introduced by (d' Aspremont et al., 2007) ; however, previously studied PCA approaches based on rotating (Jolliffe, 1995) or thresholding (Cadima & Jolliffe, 1995) the top singular vector of the input matrix seemed to work well, at least in practice, given sparsity constraints. Following (d'Aspremont et al., 2007) , there has been an abundance of interest in SPCA. (Jolliffe et al., 2003) considered LASSO (SCoTLASS) on an 1 relaxation of the problem, while (Zou & Hastie, 2005) Wainwright, 2009) . Notably, (Amini & Wainwright, 2009) achieved provable theoretical guarantees regarding the SDP and thresholding approach of (d'Aspremont et al., 2007) in a specific, high-dimensional spiked covariance model, in which a base matrix is perturbed by adding a sparse maximal eigenvector. In other words, the input matrix is the identity matrix plus a "spike", i.e., a sparse rank-one matrix. Despite the variety of heuristic-based sparse PCA approaches, very few theoretical guarantees have been provided for SPCA; this is partially explained by a line of hardness-of-approximation results. The sparse PCA problem is well-known to be NP-Hard (Moghaddam et al., 2006a) . (Magdon-Ismail, 2017) shows that if the input matrix is not PSD, then even the sign of the optimal value cannot be determined in polynomial time unless P = NP, ruling out any multiplicative approximation algorithm. In the case where the input matrix is PSD, (Chan et al., 2016) shows that it is NP-hard to approximate the optimal value up to multiplicative (1 + ) error, ruling out any polynomialtime approximation scheme (PTAS). Moreover, they show Small-Set Expansion hardness for any polynomial-time constant factor approximation algorithm and also that the standard SDP relaxation might have an exponential gap. We conclude by summarizing prior work that offers provable guarantees (beyond the work of (Amini & Wainwright, 2009)), typically given some assumptions about the input matrix. (d'Aspremont et al., 2014) showed that the SDP relaxation can be used to find provable bounds when the covariance input matrix is formed by a number of data points sampled from Gaussian models with a single sparse singular vector. (Papailiopoulos et al., 2013) presented a combinatorial algorithm that analyzed a specific set of vectors in a low-dimensional eigenspace of the input matrix and presented relative error guarantees for the optimal objective, given the assumption that the input covariance matrix has a decaying spectrum. (Asteris et al., 2011) gave a polynomial-time algorithm that solves sparse PCA exactly for input matrices of constant rank. (Chan et al., 2016) showed that sparse PCA can be approximated in polynomial time within a factor of n -1/3 and also highlighted an additive PTAS of (Asteris et al., 2015) based on the idea of finding multiple disjoint components and solving bipartite maximum weight matching problems. This PTAS needs time n poly (1/ ) , whereas all of our algorithms have running times that are a low-degree polynomial in n.

2. SPARSE PCA VIA RANDOMIZED MATRIX MULTIPLICATION

Our first algorithm for SPCA leverages primitives and ideas from randomized matrix multiplication (Drineas & Kannan, 2001; Drineas et al., 2006; Drineas & Mahoney, 2016; 2018; Woodruff, 2014) . Let P ∈ R m×n and Q ∈ R n×p and recall that their product PQ equals PQ = n i=1 P * i Q i * . Recall that P * i denotes the i-th column of P and Q i * the i-th row of Q. A well-known approach to approximate the product PQ is to sample a subset of columns of P (we will do this without replacement) and the corresponding rows of Q (Drineas & Mahoney, 2018) S ii ← 1 / √ pi, with probability p i = min{sp i , 1} 0, otherwise when Algorithm 1 is used to approximate matrix multiplication. Lemma 2.1 Given matrices P ∈ R m×n and Q ∈ R n×p , let S ∈ R n×n be constructed using Algorithm 1 with pi = P * i 2 2/ P 2 F , for i = 1 . . . n. Then, E PS 2 Q -PQ 2 F ≤ 1 s P 2 F Q 2 F . Our SPCA algorithm uses the above primitive to approximate the product of the (square root) of the input matrix A and its top right singular vector v. Thus, the proposed SPCA algorithm sparsifies the top right singular vector of v of A without losing too much of the variance that is captured by v. Interestingly, this conceptually simple algorithm has not been formally analyzed in prior work. Algorithm 2 details our approach. Algorithm 2 SPCA via randomized matrix multiplication Input: A ∈ R n×n , sparsity parameter k, accuracy parameter ∈ (0, 1). Output: y ∈ R n satisfying E [ y 2 ] ≤ 1 and E [ y 0 ] ≤ k/ 2 . 1. X ← A 1/2 ; 2. Use Algorithm 1 to construct S ∈ R n×n with pi = X * i 2 2/ X 2 F and s = 4k / 2 ; 3. Let v ∈ R n be the top right singular vector of XS; 4. y ← Sv; Theorem 2.2 Let k be the sparsity parameter and ∈ (0, 1] be the accuracy parameter. Let S ∈ R n×n be the sampling matrix of Lemma 2.1 with s = 4k / 2 . Then, Algorithm 2 returns a vector y with expected sparsity at most s (i.e., E [ y 0 ] ≤ s) and expected two norm at most one (i.e., E y 2 2 ≤ 1) such that with probability at least 1 /4, we have y Ay ≥ 1 /2Z * - √ Z * • Tr(A) /k. See Appendix A.1 for a proof of the above theorem. We note that the success probability of Algorithm 2 can be trivially amplified by repeating the algorithm t times and keeping the vector y that maximizes y Ay. Then, the failure probability of the overall approach diminishes exponentially fast as a function of t to at most ( 3 /4) t . Finally, the running time of Algorithm 2 is dominated by the computation of a square root of the matrix A in the first step, which takes O n 3 time via the computation of the SVD of A.

3. SPCA VIA THRESHOLDING

Our second algorithm is based on a thresholding scheme using the top right singular vectors of the PSD matrix A. Given A and an accuracy parameter , our approach first computes Σ ∈ R × (the diagonal matrix of the top singular values of A) and U ∈ R n× (the matrix of the top right singular vectors of A), for = 1/ . Then, it deterministically selects a subset of O k/ 2 columns of Σ 1/2 U using a simple thresholding scheme based on the norms of the columns of Σ 1/2 U . (Recall that k is the sparsity parameter of the SPCA problem.) In the last step, it returns the top right singular vector of the matrix consisting of the chosen columns of Σ 1/2 U . Notice that this right singular vector is an O k/ 2 -dimensional vector, which is finally expanded to a vector in R n by appropriate padding with zeros. This sparse vector is our approximate solution to the SPCA problem of eqn. ( 1). This simple algorithm is somewhat reminiscent of prior thresholding approaches for SPCA. However, to the best of our knowledge, no provable a priori bounds were known for such algorithms without strong assumptions on the input matrix. This might be due to the fact that prior approaches focused on thresholding only the top right singular vector of A, whereas our approach thresholds the top = 1/ right singular vectors of A. This slight relaxation allows us to present provable bounds for the proposed algorithm. In more detail, let the SVD of A be A = UΣU T . Let Σ ∈ R × be the diagonal matrix of the top singular values and let U ∈ R n× be the matrix of the top right (or left) singular vectors. Let R = {i 1 , . . . , i |R| } be the set of indices of rows of U that have squared norms at least /k and let R be its complement. Here |R| denotes the cardinality of the set R and R ∪ R = {1, . . . , n}. Let R ∈ R n×|R| be a sampling matrix that selects 3 the columns of U whose indices are in the set R. Given this notation, we are now ready to state Algorithm 3. 

Algorithm 3 SPCA via thresholding

Input: A ∈ R n×n , sparsity k, error parameter > 0. Output: y ∈ R n such that y 2 = 1 and y 0 = k/ 2 . 1: ← 1/ ; 2: Compute U ∈ R n× (top ∈ R |R| ← argmax x 2 =1 Σ U Rx 2 2 ; 5: return z = Ry ∈ R n ; Notice that Ry satisfies Ry 2 = y 2 = 1 (since R has orthogonal columns) and Ry 0 = |R|. Since R is the set of rows of U with squared norms at least /k and U 2 F = = 1/ , it follows that |R| ≤ k/ 2 . Thus, the vector returned by Algorithm 3 has k/ 2 sparsity and unit norm. Theorem 3.1 Let k be the sparsity parameter and ∈ (0, 1] be the accuracy parameter. Then, the vector z ∈ R n (the output of Algorithm 3) has sparsity k/ 2 , unit norm, and satisfies z Az ≥ ( 1 /2)Z * -( 3 /2) Tr(A). We defer the proof of Theorem 3.1 to Appendix A.2. The running time of Algorithm 3 is dominated by the computation of the top singular vectors and singular values of the matrix A. In practice, any iterative method, such as subspace iteration using a random initial subspace or the Krylov subspace of the matrix, can be used towards this end. However, our current analysis does not account for the inevitable approximation error incurred by such methods, which run in O (nnz(A) ) time. One could always use the SVD of the full matrix A (O n 3 time) to compute the top singular vectors and singular values of A. Finally, we highlight that, as an intermediate step in the proof of Theorem 3.1, we need to prove the following lemma (see Appendix A.2 for its proof): Lemma 3.2 Let A ∈ R n×n be a PSD matrix and Σ ∈ R n×n (respectively, Σ ∈ R × ) be the diagonal matrix of all (respectively, top ) singular values and let U ∈ R n×n (respectively, U ∈ R n× ) be the matrix of all (respectively, top ) singular vectors. Then, for all unit vectors x ∈ R n , Σ 1/2 U x 2 2 ≥ Σ 1/2 U x 2 2 -Tr(A). 3 Each column of R has a single non-zero entry (set to one), corresponding to one of the |R| selected columns. Formally, Ri t ,t = 1 for t = 1, . . . , |R|; all other entries of R are set to zero. The above simple lemma is very much at the heart of our proof of Theorem 3.1 and, unlike prior work, allows us to provide provably accurate bounds for the thresholding Algorithm 3. Using an approximate SVD solution. The guarantees of Theorem 3.1 in Algorithm 3 uses an exact SVD computation, which could take time O n 3 . We can further improve the running time by using an approximate SVD algorithm such as the randomized block Krylov method of Musco & Musco (2015) , which runs in nearly input sparsity runtime. Our analysis uses the relationships We now need to convert the matrix Z * ∈ R n×n into a sparse vector that will be the output of our approximate SPCA algorithm and will satisfy certain accuracy guarantees. Towards that end, we employ a novel two-step rounding procedure. First, a critical observation is that generating a random Gaussian vector g ∈ R n and computing the vector Z * g ∈ R n results in an unbiased estimator for the trace of (Z * ) AZ * in the following sense: E g (Z * ) AZ * g = Tr((Z * ) AZ * ). Using von Neumann's trace inequality, we can prove that E g (Z * ) AZ * g = Tr((Z * ) AZ * ) ≥ γ Z * • Tr(AZ * ) ≥ γ Z • Z * . Here γ Z * is a constant that precisely depends on the top singular value of Z * , the condition number of Z * , and the extent to which the SDP relaxation of eqn. (2) is able to capture the original problem (see Theorem 4.1 for the exact expression of γ Z * ). The above inequality implies that, at least in expectation, we could use the vector Z * g as a "rounding" of the output of the semidefinite programming relaxation. However, there is absolutely no guarantee that the vector Z * g is sparse. Thus, in order to sparsify Z * g, we employ a separate sparsification procedure, where each entry of Z * g is kept (and rescaled) with probability proportional to its magnitude. This procedure is similar to the one proposed in (Fountoulakis et al., 2017) and guarantees that larger entries of Z * g are more likely to be kept, while smaller entries of Z * g are more likely to be set to zero, without too much loss in accuracy. We also note that to ensure a sufficiently high probability of success for the overall approach, we generate multiple Gaussian vectors and keep the one that maximizes the quantity g (Z * ) AZ * g. See Algorithm 4 and Algorithm 5 for a detailed presentation of our approach. Algorithm 4 SPARSIFY Input: y ∈ R n and sparsity parameter s. Output: z ∈ R n with E [ z 0 ] ≤ s. 1: for i = 1 . . . n do 2: z i := 1 pi y i , with probability p i = min 1, s|yi| y 1 , 0 otherwise. The running time of the algorithm is dominated by the time needed to solve the semidefinite programming relaxation of eqn. (2), which, in our setting, is O n 3.5 (Alizadeh, 1995) . We do note that SDP solvers such as the one in (Alizadeh, 1995) return an additive error approximation to the optimal solution. However, the running time dependence of SDP solvers on the additive error γ is logarithimic in 1/γ and thus highly accurate approximations can be derived without a significant increase in the number of iterations of the solver. Thus, for the sake of clarity, we initially omit this additive error from the analysis and address the approximate solution at the end. Our main quality-of-approximation result for Algorithm 5 is Theorem 4.1. For simplicity of presentation, and following the lines of ( 3: Generate M random Gaussian vectors g 1 , . . . , g M in R n ; 4: y ← Z * g j , where j ← argmax i=1...M g i (Z * ) AZ * g i ; 5: z ← SPARSIFY(y, s); of A have been normalized to have unit norms. This assumption can be relaxed as in (Fountoulakis et al., 2017) . In the statement of the theorem, we will use the notation Z 1 to denote the best rank-one approximation to the matrix Z. Theorem 4.1 Given a PSD matrix A ∈ R n×n , a sparsity parameter k, and an error tolerance > 0, let Z be an optimal solution to the relaxed SPCA problem of eqn. (2). Assume that Tr(AZ) ≤ α Tr(AZ 1 ) for some constant α ≥ 1. Then, Algorithm 5 outputs a vector z ∈ R n that, with probability at least 5/8, satisfies E [ z 0 ] = O k 2 log 5/2 (1/ ) / 2 , z 2 = O log 1 / , and z Az ≥ γ Z (1 -) • Z * -. Here γ Z = 1 -1 -1 κ(Z) 1 -1 α σ 1 (Z) with σ 1 (Z) and κ(Z) being the top singular value and condition number of Z respectively. Similar to Theorem 2.2, the probability of success can be boosted to 1 -δ by repeating the algorithm O 1 δ times in parallel. Moreover by using Markov's inequality, we can also guarantee a vector z with sparsity O k 2 log 5/2 (1/ ) / 2 with probability 1 -δ, rather than just in expectation. We now discuss the condition of eqn. (6) and the constant γ Z . Our assumption simply says that much of the trace of the matrix AZ should be captured by the trace of AZ 1 , as quantified by the constant α. For example, if Z were a rank-one matrix, then the assumption would hold with α = 1. As the trace of AZ 1 fails to approximate the trace of AZ (which intuitively implies that the SDP relaxation of eqn. (2) did not sufficiently capture the original problem) the constant α increases and the quality of the approximation decreases. More precisely, first notice that the constant γ Z is upper bounded by one, because σ 1 (Z) ≤ 1 by the SDP relaxation. Second, the quality of the approximation increases as γ Z approaches one. This happens if either the condition number of Z is close to one or if the constant α is close to one; at the same time, σ 1 (Z) also needs to be close to one. Clearly, these conditions are satisfied if Z is well approximated by Z 1 . In our experiments, we indeed observed that α is close to one and that the top singular value of Z is close to one, which imply that γ Z is also close to one (Appendix, Table 6 ). The proof of Theorem 4.1 is delegated to Appendix A.3 (as Theorem A.13), but we outline here a summary of statements that lead to the final bound. Finally, we can bound the 2 norm of the vector z of Algorithm 5 by proving that, with probability at least 3 /4, z 2 = O log 1 / . Notice that this slightly relaxes the requirement that z has unit norm; however, even for accuracy close to machine precision, log 1 / is a small constant. Using approximate SDP solution. The guarantees of Theorem 4.1 in Algorithm 5 uses an optimal solution Z * to the SDP relaxation in eqn. (2). In practice, we will only obtain an approximate solution Z to eqn. (2) using any standard SDP solver, e.g. (Alizadeh, 1995) We take k = 7 and Table 1 shows that while spca-d (Algorithm 3) and spca-r (Algorithm 2) perform very similar to spca or that of dspca with % of variace explained (PVE) uniformly better than spca, our SDP-based method spca-sdp (Algorithm 5) exactly recovers the optimal solution and the output matches with both dec and cwpca and very close to spca-lowrank. We also apply our algorithms on another benchmark artificial example from (Zou et al., 2006) Comparisons and metrics. We compare our Algorithm 2 (spca-r), Algorithm 3 (spca-d), and Algorithm 5 (spca-sdp) with the solutions returned by spca (Zou et al., 2006) , as well as the simple MaxComp heuristic (Cadima & Jolliffe, 1995) . We define the quantity f (y) = y Ay / A 2 to measure the quality of an approximate solution y ∈ R n to the SPCA problem. Notice that 0 ≤ f (y) ≤ 1 for all y with y 2 ≤ 1. As f (y) gets closer to one, the vector y captures more of the variance of the matrix A that corresponds to its top singular value and corresponding singular vector. Our goal is to identify a sparse vector y with f (y) ≈ 1. Since the outputs of our Algorithm 2 and Algorithm 5 may have norms that slightly exceed one and in order to have a fair comparison between different methods, we normalize our outputs in the same way as in (Fountoulakis et al., 2017) (Appendix B). Results. In our experiments, for spca-d, spca-r, and spca-sdp, we fix the sparsity s to be equal to k, so that all algorithms return the same number of non-zero elements. In Figures 1a-1c we evaluate the performance of the different SPCA algorithms by plotting f (y) against y 0 , i.e., the sparsity of the output vector, on data from chromosome 1, chromosome 2, and the gene expression data. Note that performance of our SDP-based method (spca-sdp) is indeed comparable with the state-of-the-art dec, cwpca, and spca-lowrank, while both spca-d and spca-r are better than or at least comparable to both spca and maxcomp. However, in practice, the running time of the SDP relaxation is substantially higher than our other methods, which highlights the interesting trade-offs between the accuracy and computation discussed in (Amini & Wainwright, 2009) . See Apeendix B for more experimental results.

6. CONCLUSION AND OPEN PROBLEMS

We present three provably accurate, polynomial time, approximation algorithms for SPCA, without imposing restrictive assumptions on the input covariance matrix. Future directions include: (i) extend the proposed algorithms to handle more than one sparse singular vector by deflation or other strategies and (ii) explore matching lower bounds and/or improve the guarantees of Theorems 2.2, 3.1, and 4.1.

A APPENDIX

A.1 SPCA VIA RANDOMIZED MATRIX MULTIPLICATION: PROOFS First, we prove two lemmas that are crucial in proving Lemma 2.1. Lemma A.1 Given matrices P ∈ R m×n and Q ∈ R n×p , let S ∈ R n×n be constructed using Algorithm 1. Then, E (PS 2 Q) ij = (PQ) ij var (PS 2 Q) ij = n k=1 P 2 ik Q 2 kj p k - n k=1 P 2 ik Q 2 kj . for any indices i, j ∈ {1, . . . , n}. Proof : For any i, j ∈ {1, . . . , n}, we have that  E (PS 2 Q) ij = E n k=1 P ik S 2 kk Q kj = E n k=1 P ik Z 2 k p k Q kj = n k=1 P ik Q kj p k E Z 2 k = n k=1 P ik Q kj = (PQ) ij , since Z 2 k d = Z k ∼ Ber(p) and thus E Z 2 k = E [Z k ] = p k , var (PS 2 Q) ij = var n k=1 P ik Z 2 k p k Q kj = n k=1 P ik Q kj p k 2 var Z 2 k = n k=1 1 -p k p k P 2 ik Q 2 kj = n k=1 P 2 ik Q 2 kj p k - n k=1 P 2 ik Q 2 kj . 2 Lemma A.2 Given matrices P ∈ R m×n and Q ∈ R n×p , let S ∈ R n×n be constructed using Algorithm 1. Then, E PS 2 Q -PQ 2 F = n i=1 P * i 2 2 • Q i * 2 2 p i - n i=1 P * i 2 2 Q i * 2 2 . ( ) Here P * i and Q i * are the i-th column of P and i-th row of Q respectively. Proof : Using Lemma A.1, we have that E PQ -PS 2 Q 2 F = m i=1 p j=1 E (PQ) ij -(PS 2 Q) ij 2 = m i=1 p j=1 var (PS 2 Q) ij = m i=1 p j=1 n k=1 P 2 ik Q 2 kj p k - n k=1 P 2 ik Q 2 kj = n k=1 1 p k -1 m i=1 P 2 ik   p j=1 Q 2 kj   = n k=1 P * k 2 2 Q k * 2 2 p k - n k=1 P * k 2 2 Q k * 2 2 . 2 Proof of Lemma 2.1: From Lemma A.2, E PS 2 Q -PQ 2 F = n i=1 P * i 2 2 • Q i * 2 2 p i - n i=1 P * i 2 2 Q i * 2 2 = {i: pi≤ 1 /s} P * i 2 2 • Q i * 2 2 sp i + {i: pi> 1 /s} P * i 2 2 • Q i * 2 2 - n i=1 P * i 2 2 Q i * 2 2 ≤ {i: pi≤ 1 /s} P * i 2 2 • Q i * 2 2 sp i ≤ n i=1 P * i 2 2 • Q i * 2 2 sp i . We conclude the proof by setting pi = P * i 2 2/ P 2 F . Proof of Theorem 2.2. Proof : In Lemma 2.1, let P = X and Q = x * to get E XS 2 x * -Xx * 2 2 ≤ 1 s X 2 F • x * 2 2 ≤ 1 s X 2 F . The last inequality follows from x * 2 ≤ 1. Moreover, by Markov's inequality, with probability at least 3 /4, XS 2 x * -Xx * 2 2 ≤ 4 s X 2 F = 2 k X 2 F . Let x = Sx * . Taking square roots of both sides of the above inequality, and applying the triangle inequality on the left hand side of the above inequality, we get: | Xx * 2 -XSx 2 | ≤ √ k X F ⇒ XSx 2 ≥ √ Z * -√ k X F ⇒ XSx 2 2 ≥ Z * + 2 k X 2 F - 2 √ k √ Z * • X F . ( ) Note that Z * = Xx * 2 2 . Ignoring the non-negative term in eqn. (10), we conclude XSx 2 2 ≥ Z * - 2 √ k √ Z * • X F . ( ) Next, using sub-multiplicativity on the left hand side of eqn. (11), XSx 2 2 ≤ XS 2 2 x 2 2 = XSv 2 2 x 2 2 , ( ) where v ∈ R n is the top right singular vector of XS. Letting x * i be the i-th entry of x * , we have, E( x 2 2 ) = E n i=1 x * 2 i Z 2 i p i = n i=1 x * 2 i p i p i = x * 2 2 = 1 , since E Z 2 i = p i . Using Markov's inequality, with probability at least 1 /2, x 2 2 ≤ 2. ( ) Conditioning on this event, we can rewrite eqn. (12) as follows: XSx 2 2 ≤ 2 XSv 2 2 = 2 (Sv) X X(Sv) = 2 y X Xy. ( ) Combining eqns. ( 11) and (15), we conclude y X Xy ≥ 1 2 Z * -√ k √ Z * • X F . Using X X = A and Tr(A) = X 2 F concludes the proof of eqn. (4). Finally, following the lines of eqn. (13), we can prove E( y 2 2 ) = E( Sv 2 2 ) = 1. To conclude the proof of the theorem, notice that the failure probability is at most 1 /4 + 1 /2 = 3 /4 from a union bound on the failure probabilities of eqns. ( 9) and ( 14). 2 A.2 SPCA VIA THRESHOLDING: PROOFS We will use the notation of Section 3. For notational convenience, let σ 1 , . . . , σ n be the diagonal entries of the matrix Σ ∈ R n×n , i.e., the singular values of A. Proof of Lemma 3.2: Let U ,⊥ ∈ R n×(n-) be a matrix whose columns form a basis for the subspace perpendicular to the subspace spanned by the columns of U . Similarly, let Σ ,⊥ ∈ R (n-)×(n-) be the diagonal matrix of the bottom nsingular values of A. Notice that U = [U U ,⊥ ] and Σ = [Σ 0; 0 Σ ,⊥ ]; thus, UΣ 1/2 U = U Σ 1/2 U + U ,⊥ Σ 1/2 ,⊥ U ,⊥ . By the Pythagorean theorem, UΣ 1/2 U x 2 2 = U Σ 1/2 U x 2 2 + U ,⊥ Σ 1/2 ,⊥ U ,⊥ x 2 2 . Using invariance properties of the vector two-norm and sub-multiplicativity, we get Σ 1/2 U x 2 2 ≥ Σ 1/2 U x 2 2 -Σ 1/2 ,⊥ 2 2 U ,⊥ x 2 2 . We conclude the proof by noting that Σ 1/2 U x 2 2 = x UΣU x = x Ax and Σ 1/2 ,⊥ 2 2 = σ +1 ≤ 1 n i=1 σ i = Tr(A) . The inequality above follows since σ 1 ≥ σ 2 ≥ . . . σ ≥ σ +1 ≥ . . . ≥ σ n . We conclude the proof by setting = 1/ . Proof of Theorem 3.1. Let R = {i 1 , . . . , i |R| } be the set of indices of rows of U (columns of U ) that have squared norms at least /k and let R be its complement. Here |R| denotes the cardinality of the set R and R ∪ R = {1, . . . , n}. Let R ∈ R n×|R| be the sampling matrix that selects the columns of U whose indices are in the set R and let R ⊥ ∈ R n×(n-|R|) be the sampling matrix that selects the columns of U whose indices are in the set R. Thus, each column of R (respectively R ⊥ ) has a single non-zero entry, equal to one, corresponding to one of the |R| (respectively | R|) selected columns. Formally, R it,t = 1 for all t = 1, . . . , |R|, while all other entries of R (respectively R ⊥ ) are set to zero; R ⊥ can be defined analogously. The following properties are easy to prove: RR + R ⊥ R ⊥ = I n ; R R = I; R ⊥ R ⊥ = I; R ⊥ R = 0. Recall that x * is the optimal solution to the SPCA problem from eqn. (1) . We proceed as follows: Σ 1/2 U x 2 2 = Σ 1/2 U (RR + R ⊥ R ⊥ )x 2 2 ≤ 2 Σ 1/2 U RR x * 2 2 + 2 Σ 1/2 U R ⊥ R ⊥ x * 2 2 ≤ 2 Σ 1/2 U RR x * 2 2 + 2σ 1 U R ⊥ R ⊥ x * 2 2 . The above inequalities follow from the Pythagorean theorem and sub-multiplicativity. We now bound the second term in the right-hand side of the above inequality. U R ⊥ R ⊥ x * 2 = n i=1 (U R ⊥ ) * i (R ⊥ x * ) i 2 ≤ n i=1 (U R ⊥ ) * i 2 • |(R ⊥ x * ) i | ≤ k n i=1 |(R ⊥ x * ) i | ≤ k R ⊥ x * 1 ≤ k √ k = √ . In the above derivations we use standard properties of norms and the fact that the columns of U that have indices in the set R have squared norm at most /k. The last inequality follows from R ⊥ x * 1 ≤ x * 1 ≤ √ k, since x * has at most k non-zero entries and Euclidean norm at most one. Recall that the vector y of Algorithm 3 maximizes Σ 1/2 U Rx 2 over all vectors x of appropriate dimensions (including Rx * ) and thus Σ 1/2 U Ry 2 ≥ Σ 1/2 U RR x * 2 . ( ) Combining eqns. ( 16), (17), and (18), we get 1 2 Σ 1/2 U x * 2 2 ≤ Σ 1/2 U z 2 2 + Tr(A). In the above we used z = Ry (as in Algorithm 3) and σ 1 ≤ Tr(A). Notice that U Σ 1/2 U z + U ,⊥ Σ 1/2 ,⊥ U ,⊥ z = UΣ 1/2 U z, and use the Pythagorean theorem to get U Σ 1/2 U z 2 2 + U ,⊥ Σ 1/2 ,⊥ U ,⊥ z| 2 2 = UΣ 1/2 U z 2 2 . Using the unitary invariance of the two norm and dropping a non-negative term, we get the bound Σ 1/2 U z 2 2 ≤ Σ 1/2 U z 2 2 . Combining eqns. ( 20) and ( 19), we conclude 1 2 Σ 1/2 U x * 2 2 ≤ Σ 1/2 U z 2 2 + Tr(A). We now apply Lemma 3.2 to the optimal vector x * to get Σ 1/2 U x * 2 2 -Tr(A) ≤ Σ 1/2 U x * 2 2 . Combining with eqn. (21) we get z Az ≥ 1 2 Z * - 3 2 Tr(A). In the above we used Σ 1/2 U z 2 2 = z Az and Σ 1/2 U x * 2 2 = (x * ) Ax * = Z * . A.3 SPCA VIA A SEMIDEFINITE PROGRAMMING RELAXATION: PROOFS We start with a lemma arguing that the sparsification procedure of Algorithm 5 does not significantly distort the 2 norm of the input/output vectors. Lemma A.3 Let y and z be defined as in Algorithm 5. If y 1 ≤ α, then, with probability at least 15 /16, z -y 2 2 ≤ 16α 2 s . Proof : Notice that E z -y 2 2 = n i=1 1 p i -1 y 2 i ≤ n i=1 y 2 i p i ≤ y 1 n i=1 y i s = y 2 1 s , which is at most α 2 s from our assumptions. The lemma follows by Markov's inequality. 2 To prove Lemma 4.3, we start with the following consequence of the triangle inequality, e.g., Lemma 3 from (Fountoulakis et al., 2017). Lemma A.4 Let y and z be defined as in Algorithm 5. Then |y Ay -z Az| ≤ 2|y A(y -z)| + |(y -z) A(y -z)|. Proof : This is Lemma 3 from (Fountoulakis et al., 2017) . 2 We now proceed to upper bound the two terms in the above lemma separately. The following lemma bounds the first term. Lemma A.5 Let y and z be defined as in Algorithm 5. If y 1 ≤ α and y 2 ≤ β, then, with probability at least 15 /16, |y A(yz)| ≤ 4αβ / √ s. Proof : Recall that we set z i = yi /pi with probability p i and zero otherwise, for all i = 1 . . . n. Then, E (y A(y -z)) 2 = n i=1 1 p i -1 y 2 i (A i * y) 2 . Using |A i * y| ≤ A i * 2 y 2 ≤ β (from our assumption on the 2 norm of y as well as our assumption that the rows/columns of A have unit norm), it follows that E (y A(y -z)) 2 ≤ β 2 n i=1 y 2 i p i ≤ β 2 y 2 1 s ≤ α 2 β 2 s . The lemma follows from Markov's inequality. 2 The next lemma provides an upper bound for the second term in the right hand side of Lemma A.4. Proof : Let ζ i = 1 pi with probability p i and zero otherwise for all i = 1 . . . n. Then, E (y -z) A(z -y) 2 = a,b,c,d A a,c A b,d y a y b y c y d • E [(1 -ζ a )(1 -ζ b )(1 -ζ c )(1 -ζ d )] . We immediately have E [1 -ζ i ] = 0. Thus, if any of the indices a, b, c, d appears only once in the above summation, then E [(1 -ζ a )(1 -ζ b )(1 -ζ c )(1 -ζ d )] = 0. Let B 1 = a =b A 2 a,b y 2 a y 2 b E (1 -ζ a ) 2 (1 -ζ b ) 2 , B 2 = a =b A a,a A b,b y 2 a y 2 b E (1 -ζ a ) 2 (1 -ζ b ) 2 , B 3 = a =b A a,b A b,a y 2 a y 2 b E (1 -ζ a ) 2 (1 -ζ b ) 2 , B 4 = n a=1 A 2 a,a y 4 a E (1 -ζ a ) 4 . It now follows that E (y -z) A(z -y) 2 = 4 i=1 B i . Using |A i,j | ≤ 1 for all i, j, we can bound B 1 , B 2 , and B 3 by max i=1,2,3 {B i } ≤ n a=1 y 2 a E (1 -ζ a ) 2 n b=1 y 2 b E (1 -ζ b ) 2 . Using E (1 -ζ i ) 2 = 1 pi -1 for all i, we get max i=1,2,3 {B i }) ≤ a=1 1 p a -1 y 2 a 2 ≤ y 2 1 s 2 ≤ α 2 s 2 , where the inequality follows by y 1 ≤ α. The next two lemmas bound the 2 and 1 norms of the vectors y i for all i = 1 . . . M . We will bound the norm of a single vector y i (we will drop the index) and then apply a union bound on all M vectors. Lemma A.7 Let y be defined as in Algorithm 5. Then, Pr y 2 ≥ 2 log M ≤ 1 M 2 . Proof : Let Z = UΣV be the singular value decomposition of Z and let σ i = Σ i,i , for i = 1 . . . n, be the singular values of Z. Since Tr(Z) = 1, it follows that n i=1 σ i = 1 and also σ i ≤ 1 for all i = 1 . . . n. Additionally, n i=1 σ 2 i ≤ n i=1 σ i ≤ 1. Then, y 2 2 = Zg 2 2 = g Z Zg = g VΣ 2 V g = ΣV g 2 2 . The rotational invariance of the Gaussian distribution implies that y 2 ∼ h 2 , where h is a random vector whose i-th entry h i satisfies h i ∼ N (0, σ 2 i ). Hence, E y 2 2 = E h 2 2 = n i=1 σ 2 i ≤ 1. Now, from Markov's inequality, for any C > 0, Pr y 2 ≥ t + log M t = Pr e C y 2 ≥ e Ct+C log M/t ≤ E e C y 2 e Ct M C/t . Then, E e C y 2 = E e C h 2 ≤ n i=1 2 √ 2πσ i ∞ 0 e Cx e -x 2 /2σ 2 i dx = n i=1 2 √ 2πσ i e C 2 σ 2 i /2 ∞ 0 e - x √ 2σ i - Cσ i √ 2 2 dx = n i=1 2 √ π e C 2 σ 2 i /2 ∞ 0 e -t 2 dt = n i=1 e C 2 σ 2 i /2 = exp n i=1 C 2 σ 2 i /2 . Using eqn. ( 22), we get E e C y 2 ≤ e C 2 /2 . Setting C = 2t and ≤ 1, we get Pr y 2 ≥ t + log M t ≤ e C 2 /2 e Ct M C/t = 1 M 2 . Setting t = √ log M concludes the proof. Prior to bounding the 1 norm of y, we present a measure concentration result that will be useful in our proof. First, recall the definition of L-Lipschitz functions. Definition A.8 Let f : R n → R be any function. If f (x) -f (y) 2 ≤ L x -y 2 for all x, y ∈ R n , then f is L-Lipschitz. Theorem A.9 (Gaussian Lipschitz Concentration) (Wainwright, 2015) Let f be an L-Lipschitz function and let g ∈ R n be a vector of i.i.d. Gaussians. Then f (x) is sub-Gaussian with variance L 2 and, for all t ≥ 0, Pr [|f (x) -E [f (x)] | ≥ t] ≤ 2e -t 2 /2L 2 . Lemma A.10 Let y be defined as in Algorithm 5. Then, Pr y 1 ≥ k(1 + 2k log M ) ≤ 1 M 2 . Proof : Since g j ∼ N (0, 1) for all j = 1 . . . n, the 2-stability of the Gaussian distribution implies that E [ Zg 1 ] = n i=1 n j=1 Z i,j g j = n i=1 2 π Z i * 2 = 2 π Z 1,2 . Let f (x) = Zx 1 . The triangle inequality implies that | Zx 1 -Zy 1 | ≤ n i=1 |Z i * x -Z i * y| = n i=1 |Z i * (x -y)| . Thus, by Cauchy-Schwarz, | Zx 1 -Zy 1 | ≤ n i=1 Z i * 2 x -y 2 , and f (x) is Z 1,2 -Lipschitz 4 . Using Theorem A.9, We now prove an inequality that was used in eqn. (5) to compare y Ay and Tr(AZ). Pr y 1 - 2 π Z 1,2 ≥ t ≤ 2e -t 2 /2 Z 2 1,2 , 4 Recall that the Lp,q norm of A is A p,q = n i=1 n j=1 Lemma A.11 Let Z, A ∈ R n×n be PSD matrices and Tr(AZ) ≤ α Tr(AZ 1 ) for some α ≥ 1, where Z 1 is the best rank-1 approximation of Z. Then, Tr(ZAZ) ≥ γ Z . Tr(AZ) . Here γ Z = 1 -1 -1 κ(Z) 1 -1 α σ 1 (Z) with σ 1 (Z) and κ(Z) being the top singular value and condition number of Z respectively. Proof : For simplicity of exposition, assume that rank(Z) = n. Let Z = UΣU be the SVD of Z. Suppose U = (U 1 U 1,⊥ ) and Σ = Σ 1 0 0 Σ 1,⊥ such that we have Z 1 = U 1 Σ 1 U 1 and Z 1,⊥ = U 1,⊥ Σ 1,⊥ U 1,⊥ . As Z 1 is the best rank-1 approximation of Z, we have Z 1 Z 1,⊥ = Z 1,⊥ Z 1 = 0. Using this, we rewrite Tr(ZAZ) as the following Tr(ZAZ) = Tr ((Z 1 + Z 1,⊥ )A(Z 1 + Z 1,⊥ )) = Tr(Z 1 AZ 1 ) + Tr(Z 1,⊥ AZ 1,⊥ ) + Tr(Z 1 AZ 1,⊥ ) + Tr(Z 1,⊥ AZ 1 ) = Tr(Z 1 AZ 1 ) + Tr(Z 1,⊥ AZ 1,⊥ ) + Tr(AZ 1,⊥ Z 1 ) + Tr(AZ 1 Z 1,⊥ ) = Tr(Z 1 AZ 1 ) + Tr(Z 1,⊥ AZ 1,⊥ ) , where the third equality follows from the invariance of matrix trace under cyclic permutations and the last step is due to Z 1 Z 1,⊥ = Z 1,⊥ Z 1 = 0. Next, we rewrite Tr(AZ) as Tr(AZ) = Tr(AZ 1 ) + Tr(AZ 1,⊥ ) = Tr(AZ 1 Z † 1 Z 1 ) + Tr(AZ 1,⊥ ) = Tr(Z † 1 Z 1 AZ 1 ) + Tr(AZ 1,⊥ ) ≤ σ 1 (Z † 1 ) • σ 1 (Z 1 AZ 1 ) + Tr(AZ 1,⊥ ) where Z † 1 is the pseudo-inverse of Z 1 and we have used the fact Z 1 Z † 1 Z 1 = Z 1 and the last inequality follows from the von Neumann's trace inequality. Now, noting that σ 1 (Z † 1 ) = 1 σ1(Z) along with the fact that σ 1 (Z 1 AZ 1 ) ≤ Tr(Z 1 AZ 1 ) applying eqn. (23), we have Tr(AZ) ≤ 1 σ 1 (Z) Tr(ZAZ) -Tr(Z 1,⊥ AZ 1,⊥ ) + Tr(AZ 1,⊥ ) Proof : Consider Algorithm 5 and let Z * be an optimal solution to the SPCA semidefinite relaxation of eqn. (2). Then, as already discussed, (x * ) Ax * ≤ Tr(AZ * ), where x * is the optimal solution to the SPCA problem of eqn. (1). Then, using Lemma A.11, it follows that γ Z * Tr(AZ * ) ≤ Tr (Z * ) AZ * . Applying Theorem A.12 to the matrix (Z * ) AZ * and using our choice of y in Algorithm 4, we get y Ay ≥ 1 M M i=1 g i (Z * ) AZ * g i ≥ (1 -) Tr (Z * ) AZ * , with probability at least 7 /8. By Lemma 4.3, we have y Ay ≤ z Az + with probability at least 3 /4. Thus, with probability at least 5 /8, ( -)γ Z * • Z * = (1 -)γ Z * • (x * ) Ax * ≤ z Az + . To conclude the proof, we need to bound the 2 norm of the solution vector z. Let E be the event that Zg i 1 ≤ k(1 + 2 √ log M ) and Zg i 2 ≤ 2 √ log M for all i = 1 . . . M . From Lemma A.7 and Lemma A.10 and the union bound, we have Pr [E] ≥ 1 -2 M . Conditioned on E, Lemma A.3 implies that, with probability at least 15 /16, y -z 2 2 ≤ 16k 2 (1 + 2 √ log M ) 2 s . Therefore, with probability at least 15 16 -2 M ≥ 3 4 (since M ≥ 16), an application of the triangle inequality gets z 2 ≤ y 2 + z -y 2 ≤ 2 log M + 4k(1 + 2 √ log M ) √ s . Using our chosen values for α and β concludes the proof. 2

B ADDITIONAL NOTES ON EXPERIMENTS

In addition, we normalize the outputs of Algorithm 2 and Algorithm 5 by keeping the rows and the columns of A corresponding to the nonzero elements of the output vectors and then getting the top singular vector of the induced matrix and padding it with zeros. The above two considerations make our comparisons fair in terms of function f (y) (see Section 5 for the definition of f (y)). For Algorithm 3, we fix the threshold parameter to 30 for human genetic data, as well as for the text data; we set = 10 for the gene expression data. Finally, for Algorithm 5, we fix M (the number of random Gaussian vectors) to 300 and we use Python's cvxpy package to solve eqn. (2). All the experiments were implemented on a single-core Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz.

B.1 REAL DATA

Population genetics data. We use population genetics data from the Human Genome Diversity Panel (Consortium, 2007) and the HAPMAP (Li et al., 2008) . In particular, we use the 22 matrices (one for each chromosome) that encode all autosomal genotypes. Each matrix contains 2,240 rows and a varying number of columns that is equal to the number of single nucleotide polymorphisms (SNPs, well-known biallelic loci of genetic variation across the human genome) in the respective chromosome. The columns of each matrix were mean-centered as a preprocessing step. See Table 4 for summary statistics. Gene expression data. We also use a lung cancer gene expression dataset (GSE10072) from from the NCBI Gene Expression Omnibus database (Landi et al., 2008) . This dataset contains 107 samples (58 cases and 49 controls) and 22,215 features. Both the population genetics and the gene expression datasets are interesting in the context of sparse PCA beyond numerical evaluations, since the sparse components can be directly interpreted to identify small sets of SNPs or genes that capture the data variance. We take A to be the exact covariance matrix of (X 1 X 2 . . . X 10 ) to compute the top principal component. As the first two factors i.e., V 1 and V 2 are associated with four variables while the last one i.e., V 3 is associated with only two variables and noting that all the three factors V 1 , V 2 , and V 3 roughly have the same variance, V 1 and V 2 are almost equally important, and they are both significantly more important than V 3 . Therefore, for the first sparse principal component, the ideal solution would be to use either (X 1 , X 2 , X 3 , X 4 ) or (X 5 , X 6 , X 7 , X 8 ). Using the true covariance matrix and the oracle knowledge that the ideal sparsity is k = 4, we apply our algorithms and compare it with spca of (Zou et al., 2006) as well as the SDP-based algorithm of (d' Aspremont et al., 2007) . We found that while two of our methods, namely, spca-d (Algorithm 3) and spca-sdp (Algorithm 5) are able to identify the correct sparsity pattern of the optimal solution, spca-r (Algorithm 2) wrongly includes the variable X 10 instead of X 8 , possibly due to high correlation between V 2 and V 3 (see Table 2 for details). However, the output of the spca-d is much more interpretable, even though it has slightly lower PVE than spca-r. In our additional experiments on the large datasets, Figure 2b shows the performance of various SPCA algorithms on the synthetic data. Notice that the performance of the maxcomp heuristic is worse than spca as well as our algorithms. This is quite evident from the way we constructed the synthetic data. In particular, turning the bottom n 4 elements of Ṽ into large values guarantees that these would not be good elements to retain in the construction of the output vector in maxcomp, as they fail to capture the right sparsity pattern. On the other hand, our algorithms perform better than or comparable to spca. Similar to the real data, performance of spca-sdp closely matches with that of dec, cwpca, and spca-lowrank. In Figure 3 , we demonstrate how our algorithms perform on CHR 3 and CHR 4 of the population genetics data. We see a similar behavior as observed for CHR 1 and CHR 2 in Figures 1a-1b . In Table 3 , we report the variance f (y) captured by the output vectors of different methods for the text data, which again validates the accuracy of our algorithms. Pit Props with k = 7 0.9999 3.95 × 10 10 1.0000 0.9999 Data of Zou et al. (2006) with k = 4 0.9999 5.14 × 10 10 1.0000 0.9999 CHR 1 with k = 10, 000 0.9985 7.12 × 10 6 1.0017 0.9968 CHR 2 with k = 10, 000 0.9996 6.94 × 10 6 1.0003 0.9982 CHR 3 with k = 10, 000 0.9995 9.47 × 10 7 1.0005 0.9989 CHR 4 with k = 10, 000 0.9998 1.27 × 10 7 1.0002 0.9991 Gene expression with k = 5, 000 0.9913 2.05 × 10 6 1.0001 0.9914 Text Classification with k = 5, 000 0.9997 5.78 × 10 6 1.0001 0.9987



For simplicity of presentation and following the lines of(Fountoulakis et al., 2017), we assume that the rows and columns of the matrix A have unit norm; this assumption was not necessary for the previous two algorithms and can be removed as in(Fountoulakis et al., 2017). We are also hiding a poly-logarithmic factor for simplicity, hence the Õ(•) notation. See Theorem 4.1 for a detailed statement.



left singular vectors of A) and Σ ∈ R × (square roots of the top singular values of A); 3: Let R = {i 1 , . . . , i |R| } be the set of rows of U with squared norms at least /k and let R ∈ R n×|R| be the associated sampling matrix (see text for details); 4: y

Tr(A) and σ 1 (Σ ) ≤ Tr(A). The randomized block Krylov method ofMusco & Musco (2015) recovers these guarantees up to a multiplicative (1 + ), using O log n 2 • nnz(A) runtime. Thus by rescaling , we recover the same guarantees of Theorem 3.1 by using an approximate SVD with nearly input sparsity runtime.4 SPCA VIA A SEMIDEFINITE PROGRAMMING RELAXATIONOur third algorithm is based on the SDP relaxation of eqn. (2). Recall that solving eqn. (2) returns a PSD matrix Z * ∈ R n×n that, by the definition of the semidefinite programming relaxation, satisfies Tr(AZ * ) ≥ Z * , where Z * is the true optimal solution of SPCA in eqn. (2).

Let y and z be defined as in Algorithm 5. If y 1 ≤ α and y 2 ≤ β, then|y Ayz Az| ≤ 2|y A(yz)| + |(y -z) A(yz)|.Moreover, with probability at least 7 /8, we have |yA(yz)| ≤ 4αβ / √ s and |(y -z) A(yz)| ≤ 64α 4 s 2 + 96αβ 3 s . Lemma 4.3 Let M = 80 / 2 , α = k(1 + 2 √ log M ), and β = 2 √ log M . If the sparsity parameter s is set to s = 450α 2 β 3/ 2 , then with probability at least 3 /4, we have y Ay ≤ z Az + .Letting y = Z * g, we now conclude the proof by combining eqn. (5) with the above lemma to bound (at least in expectation) the accuracy of Algorithm 5. To get a high probability bound, we leverage a result by(Avron & Toledo, 2011) on estimating the trace of PSD matrices. This approach allows us to properly analyze step 4 of Algorithm 5, which uses multiple random Gaussian vectors to achieve measure concentration (see Appendix A.3 for details).

Fig. 1: Experimental results on real data: f (y) vs. sparsity.

denotes equality in distribution. By the independence of the Z k 's, and noting that var (Z k ) = p k (1 -p k ), we have that

Let y and z be defined as in Algorithm 5. If y 1 ≤ α and y 2 ≤ β, then, with probability at least 15 /16, |(y -z) A(yz)| ≤ 64α 4

Fig. 2: Experimental results on synthetic data with m = 2 7 and n = 2 12 : (a) the red and the blue lines are the sorted absolute values of the elements of the first column of matrices V and Ṽ respectively. (b) f (y) vs. sparsity ratio.

Fig. 3: Experimental results on real data: f (y) vs. sparsity ratio.

considered a non-convex regression-type approximation, penalized similar to LASSO. Additional heuristics based on LASSO (Ando et al., 2009) and non-convex 1 regularizations (Zou & Hastie, 2005; Zou et al., 2006; Sriperumbudur et al., 2007; Shen & Huang, 2008) have also been explored. Random sampling approaches based on non-convex 1 relaxations (Fountoulakis et al., 2017) have also been studied; we highlight that unlike our approach, (Fountoulakis et al., 2017) solved a non-convex relaxation of the SPCA problem and thus perhaps relied on locally optimal solutions. Additionally, (Moghaddam et al., 2006b) considered a branch-and-bound heuristic motivated by greedy spectral ideas. (Journée et al., 2010; Papailiopoulos et al., 2013; Kuleshov, 2013; Yuan & Zhang, 2013) further explored other spectral approaches based on iterative methods similar to the power method. (Yuan & Zhang, 2013) specifically designed a sparse PCA algorithm with early stopping for the power method, based on the target sparsity. Another line of work focused on using semidefinite programming (SDP) relaxations (d'Aspremont et al., 2007; d'Aspremont et al., 2008; Amini &

. Formally, let the random variables Z i iid ∼ Bernoulli(p i ), i = 1 . . . n, denote whether the i-th column of P and the i-th row of Q are sampled. Define the diagonal sampling-and-rescaling matrix S ∈ R n×n by S diag { Z1 / √ p1, . . . , Zn / √ pn}. The sampling probabilities {p i } n i=1 do not have to sum up to one. The number of sampled column/row pairs, denoted by s, satisfies E [s] =

be the optimal solution to the relaxed SPCA problem of eqn. (2); 2: M ← 80 / 2 and s ← O k 2 log 5/2 ( 1 / )

LoadingsWe compare the outputs our algorithms with that of the state-of-the-art sparse PCA solvers such as the coordinate-wise optimization algorithm ofBeck & Vaisbourd (2016) (cwpca) and the block decomposition algorithm of Yuan et al. (2019) (dec), along with other standard methods such as Papailiopoulos et al. (2013) (spca-lowrank), d'Aspremont et al. (2007) (dspca), and Zou et al. (2006) (spca). For the implementation of dec, we use coordinate descent method with workset: (6, 6) and for cwpca, we use greedy coordinate wise (GCW) method.First, in order to explore the sparsity patterns of the outputs of our algorithms and how close they are as compared to standard methods, we first apply our methods on the pit props data which was introduced in(Jeffers, 1967) and is a benchmark example used to test sparse PCA. It is a 13 × 13 correlation matrix, originally calculated from 180 observations with 13 explanatory variables. While the existing algorithms aimed to extract top 6 principal components, we restrict only to the top principal component. In particular, we are interested to apply our algorithms on the Pit Props matrix in a view to extract the top pc having a sparsity pattern similar to that ofBeck & Vaisbourd (2016),

, please see Appendix B.3 for a detailed discussion. Tightness of our bounds. Now, we also verify the tightness of the theoretical lower bounds of our results with the guarantee of (Papailiopoulos et al., 2013) on the pit props data. We take = 0.1 and found that the lower bound of our spca-sdp (Algorithm 5) (dashed red line on the left) is indeed very close to that of Papailiopoulos et al. (2013) with d=3 (dashed grey line on the left) . Nevertheless, the accuracy parameter of Papailiopoulos et al. (2013) typically relies on the spectrum of A i.e., for a highly accurate output, can be much smaller depending on the structure of A, in which case the difference between the lower bounds of our Algorithm 5 and Papailiopoulos et al. (2013) becomes even smaller. Next, we further demonstrate the empirical performance of our algorithms on larger real-world datasets as well as on a synthetic dataset, similar to (Fountoulakis et al., 2017) (see Appendix B). We use human genetic data from the HGDP (Consortium, 2007) and HAPMAP (Li et al., 2008) (22 matrices, one for each chromosome). In addition, we also use a lung cancer gene expression dataset (107×22, 215) from (Landi et al., 2008) and a sparse document-term matrix (2, 858×12, 427) created using the Text-to-Matrix Generator (TMG) (Zeimpekis & Gallopoulos, 2006) (see Appendix B).

To bound B 4 , use y 1 ≤ α and y 2 ≤ β, to get

|Ai,j| q all t ≥ 0. Setting t = 2 √ log M and noting that Z 1,2 ≤ Z 1,1 ≤ k, we get Proof of Lemma 4.3: Using Lemma A.7 and Lemma A.10, we conclude that y 1 ≤ α and y 2 ≤ β both hold with probability at least 1 -2 M . Using Lemma A.4, we get |y Ayz Az| ≤ 2|y A(yz)| + |(y -z) A(yz)|.

Text data: f (y) vs. the sparsity parameter k for various SPCA algorithms.

Statistics of the population genetics data.

Statistics of gene expression and text data.

Values of σ 1 (Z), κ(Z), α, and γ Z for various datasets.

annex

Next, we will show that Tr(Z 1,⊥ AZ 1,⊥ ) ≥ σ n (Z)•Tr(AZ 1,⊥ ). First, note that Σ 1,⊥ σ n (Z)•I n-1 , as σ n (Z) ≤ σ i (Z) for all i = 2, . . . , n. Therefore, pre-and post-multiplying both sides by Σ 1 /2 1,⊥ , we further have Σ 2 1,⊥ σ n (Z) • Σ 1,⊥ . Again, pre-and post-multiplying both sides by U 1,⊥ and U 1,⊥ , we have:As the matrix A is also PSD, it has a PSD square-root A 1 /2 such that A = A 1 /2 • A 1 /2 . Now, preand post-multiplying both sides of eqn. (26) by A 1 /2 , we haveNext, we rewrite Tr(Z 1,⊥ AZ 1,⊥ ) as followsIn the above, e 1 , e 2 , . . . , e n ∈ Rn are the canonical basis vectors and we have used the invariance property of matrix trace under cyclic permutations. Finally, the inequality in eqn. (28) directly follows from eqn. (27), as eqn. (27) boils down toNext, we combine eqns. ( 25) and (28) and replacing κ(Z) = σ1(Z) σn(Z) to getwhere the equality is holds as Tr(AZ) = Tr(AZ 1 ) + Tr(AZ 1,⊥ ) and the last inequality is due to our assumption that Tr(AZ) ≤ α Tr(AZ 1 ). This concludes the proof. 2To finalize our proof, we use the following result of (Avron & Toledo, 2011) for estimating the trace of PSD matrices.Theorem A.12 (Avron & Toledo, 2011) Given a PSD matrix A ∈ R n×n , let M = 80 / 2 . Let g i (for i = 1 . . . M ) be standard Gaussian random vectors. Then with probability at least 7/8,We are now ready to prove the correctness of Algorithm 5 by establishing Theorem A.13.Theorem A.13 Let Z be an optimal solution to the relaxed SPCA problem of eqn.(2) and Assume that Tr(AZ) ≤ α Tr(AZ 1 ) for some constant α ≥ 1, where Z 1 is the best rank-1 approximation of Z. Then, there exists an algorithm that takes as input a PSD matrix A ∈ R n×n , an approximation parameter > 0, and a parameter k, and outputs a vector z such that with probability at least 5/8,In the above,, and M = 80 / 2 , andwith σ 1 (Z) and κ(Z) being the top singular value and condition number of Z respectively.Text classification data. We also evaluate our algorithms on a text classification dataset used in (Fountoulakis et al., 2017) . This consists of two publicly available standard test collections for ad hoc information retrieval system evaluation: the Cranfield collection that contains 1, 398 abstracts of aerodynamics journal articles and the CISI (Centre for Inventions and Scientific Information) data that contains 1, 460 information science abstracts. Finally, using these two collections, a sparse, 2, 858 × 12, 427 document-term matrix was created using the Text-to-Matrix Generator (TMG) (Zeimpekis & Gallopoulos, 2006) , with the entries representing the weight of each term in the corresponding document. See Table 5 for summary statistics.

B.2 SYNTHETIC DATA

We also use a synthetic dataset generated using the same mechanism as in (Fountoulakis et al., 2017) . Specifically, we construct the m × n matrix X such that X = UΣV + E σ . Here, E σ is a noise matrix, containing i.i.d. Gaussian elements with zero mean and we set σ = 10 -3 ; U ∈ R m×m is a Hadamard matrix with normalized columns; Σ = ( Σ 0) ∈ R m×n such that Σ ∈ R m×m is a diagonal matrix with Σ11 = 100 and Σii = e -i for i = 2, . . . , m; V ∈ R n×n such that V = G n (θ) Ṽ, where Ṽ ∈ R n×n is also a Hadamard matrix with normalized columns andHere G(i, j, θ) ∈ R n×n be a Givens rotation matrix, which rotates the plane i -j by an angle θ. For θ ≈ 0.27π and n = 2 12 , the matrix G n (θ) rotates the bottom n 2 components of the columns of Ṽ, making half of them almost zero and the rest half larger. Figure 2 shows the absolute values of the elements of the first column of the matrices V and Ṽ. Additionally, in order to further explore the sparsity patterns of the outputs of our algorithms and how close they are as compared to standard methods, we further apply our methods on a simulation example proposed by (Zou et al., 2006) . We describe them below:

B.3 ADDITIONAL EXPERIMENTS

Artificial Data of (Zou et al., 2006) . In this example, three hidden factors V 1 , V 2 , and V 3 are created in the following way:V 1 ∼ N (0, 290), V 2 ∼ N (0, 300),, and ε are independent.Next, we create 10 observable variables X 1 , X 2 , . . . , X 10 in the following way:, ε i1 ∼ N (0, 1) , i = 1, 2, 3, 4 , X i = V 2 + ε i2 , ε i2 ∼ N (0, 1) , i = 5, 6, 7, 8 , X i = V 3 + ε i3 , ε i3 ∼ N (0, 1) , i = 9, 10 , ε ij are independent i = 1, 2, . . . , 10; j = 1, 2, 3 .

