LEARNING A LATENT SIMPLEX IN INPUT-SPARSITY TIME

Abstract

We consider the problem of learning a latent k-vertex simplex K ⊂ R d , given access to A ∈ R d×n , which can be viewed as a data matrix with n points that are obtained by randomly perturbing latent points in the simplex K (potentially beyond K). A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast as learning a latent simplex. Bhattacharyya and Kannan (SODA, 2020) give an algorithm for learning such a latent simplex in time roughly O(k • nnz(A)), where nnz(A) is the number of non-zeros in A. We show that the dependence on k in the running time is unnecessary given a natural assumption about the mass of the top k singular values of A, which holds in many of these applications. Further, we show this assumption is necessary, as otherwise an algorithm for learning a latent simplex would imply an algorithmic breakthrough for spectral low rank approximation. At a high level, Bhattacharyya and Kannan provide an adaptive algorithm that makes k matrix-vector product queries to A and each query is a function of all queries preceding it. Since each matrix-vector product requires nnz(A) time, their overall running time appears unavoidable. Instead, we obtain a low-rank approximation to A in input-sparsity time and show that the column space thus obtained has small sin Θ (angular) distance to the right top-k singular space of A. Our algorithm then selects k points in the low-rank subspace with the largest inner product (in absolute value) with k carefully chosen random vectors. By working in the low-rank subspace, we avoid reading the entire matrix in each iteration and thus circumvent the Θ(k • nnz(A)) running time. We study the problem of learning k vertices M * ,1 , . . . , M * ,k of a latent k-dimensional simplex K in R d using n data points generated from K and then possibly perturbed by a stochastic, deterministic, or adversarial source before given to the algorithm. In particular, the resulting points observed as input data could be heavily perturbed so that the initial points may no longer be discernible or they could be outside the simplex K. Recent work of Bhattacharyya & Kannan (2020) unifies several stochastic models for unsupervised learning problems, including k-means clustering, topic models (Blei, 2012), and mixed membership stochastic block models (Airoldi et al., 2014) under the problem of learning a latent simplex. In general, identifying the latent simplex can be computationally intractable. However many special applications do not require the full generality. For example, in a mixture model like Gaussian mixtures, the data is assumed to be generated from a convex combination of density functions. Thus, it may be possible to efficiently approximately learn the latent simplex given certain distributional properties in these models. Indeed, Bhattacharyya & Kannan (2020) showed that given certain reasonable geometric assumptions that are typically satisfied for real-world instances of Latent Dirichlet Allocation, Stochastic Block



Models and Clustering, there exists an O(k • nnz(A))foot_0 time algorithm for recovering the vertices of the underlying simplex. We show that, given an additional natural assumption, we can remove the dependency on k and obtain a true input sparsity time algorithm. We begin by defining the model along with our new assumption: Definition 1.1 (Latent Simplex Model). Let M * ,1 , M * ,2 , . . . , M * ,k ∈ R d denote the vertices of a k-simplex, K. Let P * ,1 , P * ,2 . . . P * ,n ∈ R d be n points in the convex hull of K. Given σ > 0, we observe n points A * ,1 , A * ,2 . . . A * ,n ∈ R d such that A -P 2 ≤ σ √ n. Further, we make the following assumptions on the data generation process: 1. Well-Separateness. For all ∈ [k], M * , has non-trivial mass in the orthogonal complement of the span of the remaining vectors, i.e., for all ∈ [k], |Proj(M * , , Null(M \ M * , ))| ≥ α max M * , 2 where Proj(x, U ) denotes the orthogonal projection of x to the subspace U . 2. Proximate Latent Points. For all ∈ [k], there exists a set S ⊆ [n] such that |S | ≥ δn and for all j ∈ S , M * , -P * ,j 2 ≤ 4σ/δ. 3. Spectrally Bounded Perturbation. The spectrum of A -P is bounded, i.e., for a sufficiently large constant c, σ/ √ δ ≤ α 2 min M * , 2 /ck 9 . 4. Significant Singular Values. Let A = i∈ [d] σ i u i v T i be the singular value decomposition and let 0 < φ ≤ nnz(A)/(n • poly(k)). We assume that for all i ∈ [k], σ i > φ • σ k+1 and A -A k 2 F ≤ φ A -A k 2 2 . These assumptions are natural across many interesting applications; see Section 2 for more details. Bhattacharyya & Kannan (2020) introduced the Well-Separateness (1), Proximate Latent Points (2) and Spectrally Bounded Perturbation (3) assumptions. We include an additional Significant Singular Values assumption (4), which is crucial for obtaining a faster running time; we discuss this in more detail below. Our main algorithmic result can then be stated as follows: Theorem 1.2 (Learning a Latent Simplex in Input-Sparsity Time). Given k ≥ 2 and A ∈ R d×n from the Latent Simplex Model (Definition 1.1), there exists an algorithm that runs in O (nnz(A) + (n + d)poly(k/φ)) time to output subsets A R1 , . . . , A R k such that upon permuting the columns of M, with probability at least 1 -1/Ω( √ k), for all ∈ [k], we have A R -M * , 2 ≤ 300k 4 σ/(α √ δ). Our result implies faster algorithms for various stochastic models that can be formulated as special cases of the Latent Simplex Model, including Latent Dirichlet Allocation for Topic Modeling, Mixed Membership Stochastic Block Models and Adversarial Clustering. We summarize the connections to these applications below. We describe our algorithm and provide an outline to our analysis; we defer all formal proofs to the supplementary material.

2. CONNECTION TO STOCHASTIC MODELS

We first formalize the connection between the Latent Simplex Model and numerous stochastic models. In particular, we show that topic models like Latent Dirichlet Allocation (LDA) and Stochastic Block Models can be viewed as special cases of the Latent Simplex Model; we defer discussion on Adversarial Clustering to the supplementary material.

2.1. TOPIC MODELS

Probabilistic Topic Models attempt to identify abstract topics in a collection of documents by discovering latent semantic structure (Blei & Jordan, 2003; Blei & Lafferty, 2006; Hoffman et al., 2010; Zhu et al., 2012; Blei, 2012) . Each document in the corpus is represented by a bag-of-words vectorization with the corresponding word frequencies. The standard statistical assumption is that the generative process for the corpus is a joint probability distribution over both the observed and hidden random variables. The hidden random variables can be interpreted as representative documents for each topic. The goal is to then design algorithms that can learn the underlying topics. The topics can be viewed geometrically as k latent vectors M * ,1 , M * ,2 , . . . , M * ,k ∈ R d , where d is the size of the dictionary and M i, is the expected frequency of word i in topic . Since each vector M * , represents a probability distribution, i M i, = 1. Let M be the corresponding d × k matrix. One important stochastic model is Latent Dirichlet Allocation (LDA) (Blei et al., 2003) , where each document consists of m words is generated as follows : • For all ∈ [k], we pick topic weights W j, ∼ Dir(1/k), where Dir(1/k) is the Dirichlet distribution over the unit simplex. The topic distribution of document j is decided by the topic weights, W j, , and given by P * ,j = ∈[k] W j, • M * , , where P * ,j are latent points. • We then generate the j-th document with m words by taking i.i.d. samples from Mult(P * ,j ), the multinomial distribution with P * ,j as the probability vector. The resulting document observed is denoted by the vector A * ,j , where for all i ∈ [d] A i,j = 1 m m t=1 X (t) ij ,, such that X (t) ij ∼ Bern(P ij ) , where X (t) ij = 1 if the i-th word was chosen in the t-th draw while generating the j-th document, and 0 otherwise. The data generation process of LDA can be viewed as a special case of the Latent Simplex Model, where the j-th document is the data point A * ,j generated from the stochastic vector P * ,j , a point in the simplex K. The vertices of the simplex are the k topic vectors M * ,1 , . . . , M * ,k ; the goal is then to recover the vertices of K. We formally justify our assumptions below. Lemma 2.1 (LDA as a Latent Simplex). Given A, P, M following the LDA model as described above, such that for all ∈ [k], M * , 2 = Ω(1), m, n = Ω(poly(k/α)) and δ = cσ/ √ k, assumptions (2),( 3) and (4) from Definition 1.1 are satisfied with high probability.

2.2. MIXED MEMBERSHIP STOCHASTIC BLOCK MODELS

The Stochastic Block Model (Airoldi et al., 2008; Miller et al., 2009; Xing et al., 2010; Fu et al., 2009; Li et al., 2016; Fan et al., 2016 ) is a well-studied stochastic model for generating random graphs, where the vertices are partitioned into k communities and edges within each community are more likely to occur than edges across communities. Given communities C 1 , C 2 , . . . C k , there exists a k × k symmetric latent matrix B, where, B 1 , 2 is the probability that there exists an edge between vertices in C 1 and C 2 . The MMBM can be formalized as the following stochastic process: • For j ∈ [n], vertex j picks a probability vector W * ,j ∈ R k representing community membership probabilities that sum to 1, i.e., W i,j ∼ Dir(1/k) for all i ∈ [k]. • For all pairs (j 1 , j 2 ) ∈ [n], vertex j 1 picks a community 1 proportional to Mult(W * ,j1 ) and j 2 picks a community 2 proportional to Mult(W * ,j2 ). The edge (j 1 , j 2 ) is included in the graph with probability B 1, 2 . Since 1 , 2 W 1 ,j1 B 1 , 2 W 2,j2 represents the edge probability of the edge (j 1 , j 2 ), the latent variable matrix P of edge probabilities can be represented as P = W T BW T . However, our reduction is not straightforward since now P depends quadratically on W and the only polynomial time algorithms for B directly rely on semidefinite programming. Further, they require non-degeneracy assumptions in order to compute a tensor decomposition provably in polynomial time (Anandkumar et al., 2014; Hopkins & Steurer, 2017) . However, we can pose the problem of recovery of the k underlying communities differently and first pick at random a subset V 1 ⊂ [n] of d vertices and represent the -th community by a d-dimensional vector that represents the probabilities of vertices in [n] \ V 1 belonging to community and having an edge with each of the d vertices in V 1 . We now define W (1) to be a k × d matrix representing the fractional membership of weights of vertices in V 1 and W (2) to be the analogous k × n matrix for vertices in [n] \ V 1 . Observe that the probability matrix P can now be represented as W T (1) BW (2) . The reduction to the Latent Simplex Model can now be stated as follows: given a data matrix A which is the adjacency matrix of the community graph, and the latent variable matrix P, recover the simplex M = W T (1) B. Further, (Airoldi et al., 2008) assumes that each column of W (2) is picked from the Dirichlet distribution with parameter 1/k. Combined with tools from random matrix theory (Vershynin, 2010) , (Bhattacharyya & Kannan, 2020 ) (Lemma 7.2) shows that the Proximate Latent Points and Spectrally Bounded assumptions hold for Stochastic Block Models. As for the Significant Singular Values assumption, it is satisfied when σ is a small enough polynomial in k. Justifying Significant Singular Values. We give the following further justification for assumption (4) in the supplementary material: a faster algorithm using the assumptions from (Bhattacharyya & Kannan, 2020) would imply an algorithmic breakthrough for spectral low-rank approximation and partially resolve the first open question of (Woodruff, 2014). Theorem 2.2 (Spectral LRA and Learning a Simplex (informal)). There exists a distribution over instances such that learning a latent simplex in o(nnz(A) • k) time with good probability implies a constant factor spectral low-rank approximation algorithm in the same running time.

3. ALGORITHM AND ANALYSIS

Preliminaries. We use n, d, and k to denote the number of data points, the number of dimensions of the space and the number of vertices of K respectively. We use the notation A * ,j to denote the j-th column of matrix A. For A ∈ R d×n with rank r, its singular value decomposition, denoted by SVD(A) = UΣV T , guarantees that U is a d × r matrix with orthonormal columns, V T is an r × n matrix with orthonormal rows and Σ is an r × r diagonal matrix. The diagonal entries of Σ are the singular values of A, denoted by σ 1 ≥ σ 2 ≥ . . . ≥ σ r . Given an integer k ≤ r, we define the truncated singular value decomposition of A that zeros out all but the top k singular values of A, i.e., A k = UΣ k V T , where Σ k has only k non-zero entries along the diagonal. It is well-known that the truncated SVD computes the best rank-k approximation to A under the Frobenius norm, i.e., A k = min rank(X)≤k A -X F . Given an orthonormal basis U for a subspace, we use P U = UU T to denote the projection matrix corresponding to the subspace. We consider the following notion of subspace distance: Definition 3.1 (sin Θ Distance). For any two subspaces R, S of R d , the sin Θ distance between R and S is defined as sin Θ(R, S) = max u∈R min v∈S sin θ(u, v) = max u∈R,|u|=1 min v∈S u -v . We use the notion of spectral low-rank approximation to obtain a compact representation of the input and compute matrix-vector products efficiently. We also require the notion of mixed spectral-Frobenius low-rank approximation. This guarantee is weaker than spectral-low rank approximation but admits faster algorithms. Definition 3.2 (Spectral Low-rank Approximation, Spectral-Frobenius Low-rank Approximation). Given a matrix A, an integer k and > 0, a rank-k matrix B satisfies a relative-error spectral low-rank approximation guarantee if A -B 2 2 ≤ (1 + ) A -A k 2 2 . B satisfies a mixed spectral-Frobenius low-rank approximation guarantee if A -B 2 2 ≤ (1 + ) A -A k 2 2 + k A -A k 2 F .

3.1. OVERVIEW

In this section, we provide an overview of our algorithmic techniques and discuss the main challenges we overcome to obtain an input-sparsity time algorithm. Our Techniques. The starting point in Bhattacharyya & Kannan (2020) is that the smoothened polytope, obtained by averaging points in the data matrix A is itself close to the latent points in the convex hull of K in operator norm. This fact is captured by the following lemma: Lemma 3.3 (Subset Smoothing). For any S ⊂ [n], let A S be a vector obtained by averaging the columns of A indexed by S and define P S similarly. Then for A -P 2 ≤ σ √ n, we have A S -P S 2 ≤ σ n/|S|. Our main insight is that we can approximately optimize a linear function on the smoothed polytope by working with a rank-k spectral approximation to A instead. Geometrically, this implies that while the smoothed polytope is perhaps d-dimensional, projecting it onto the k-dimensional space spanned by the top-k singular values of the data matrix A suffices to recover the latent k-simplex, K. This is surprising since the data matrix can contain points significantly far from the latent polytope. Further, this approach presents several challenges: we do not have access to the left singular space of A and even if we are provided this subspace exactly, it is unclear why it spans a set of points that approximate vertices of K. Finally, the points obtained by smoothing the projected polytope have no immediate relation to points in the smoothed high-dimensional polytope considered by (Bhattacharyya & Kannan, 2020) . We would like to begin by computing a spectral low-rank approximation (Definition 3.2) for A. Since a low-rank approximation to A can be represented in factored form YZ T , where Y is d × k and Z T is k × n, any matrix-vector product of the form YZ T • x only requires (n + d)k time. Thus optimizing a linear function k times over a smoothed low-rank polytope requires only (n + d)k 2 time, circumventing the previous bound of k • nnz(A). However, the best known algorithm for spectral low-rank approximation (Theorem 1 in (Musco & Musco, 2015) ) requires O(nnz(A) • k/ √ ) time and thus provides no improvement. A natural direction to pursue is then to compute a Frobenius low-rank approximation (which requires nnz(A) time) for A and use this as our proxy. However, a Frobenius low-rank approximation is too coarse to obtain a subspace that is close to the top-k singular vectors of A. Algorithm 1 : Learning a Latent k-Simplex in Input Sparsity Time Input: A matrix A ∈ R d×n , integer k, and > 0. 1. Using the algorithm from Lemma 3.4, compute rank-k matrices Y, Z such that YZ T is a spectral low-rank approximation to A, i.e., A - YZ T 2 2 ≤ (1 + ) A -A k 2 2 . 2. Let S = {∅}. For each t ∈ [k], (a) Let U t be an orthonormal basis for the vectors in S. (b) Compute the projection matrix P t = U t U T t that projects onto the row span of S. (c) Let g ∼ N (0, I k ) and let u t = gY T (I d -P t )YZ T be a random vector in R n . Compute R t ⊂ [n], a subset of δn indices corresponding to the largest coordinates of u t in absolute value. (d) Let A Rt be the average of the columns of A indexed by R t . Update S = S ∪A Rt . Output: The set of vectors A R1 , A R2 , . . . , A R k as our approximation to the vertices of the latent k-simplex K. Instead we compute a mixed spectral-Frobenius low-rank approximation (see Definition 3.2) that runs in O(nnz(A) + dk 2 ) time, but the resulting error guarantee is weaker. In particular, it incurs an additive A -A k 2 F /k term. Here, we use the assumption we introduced (the Significant Singular Value assumption) to show that the low-rank matrix obtained from this algorithm also satisfies a relative-error spectral low-rank approximation guarantee. The next challenge is that the aforementioned guarantee only bounds the spectral norm of A -YZ T in terms of the (k + 1)-st singular value of A. This guarantee does not relate how close the subspaces spanned by the columns and rows of the low-rank approximation are to the top-k singular space of A. A key technical contribution of our work is thus to prove that the subspaces obtained via spectral low-rank approximation are close to the true left and right top-k singular space in angular (sin Θ) distance. We note that such a guarantee is crucial to approximately optimize a linear function over A. Further, this result provides an intriguing connection between spectral low-rank approximation and power iteration. It is well known that power iteration suffices to obtain a subspace that is close to the top-k subspace of a matrix in sin Θ distance, which at first glance appears much stronger than spectral low-rank approximation. However, our work implies that it suffices to compute a spectral low-rank approximation, which provides a succinct representation of the data matrix and can be computed faster than power iteration in several natural settings. In the context of learning the latent simplex, given a spectral low-rank approximation, YZ T , we first restrict to the column span of Y, which w.l.o.g. has orthonormal columns, and iteratively generate k vectors in this subspace. In the first iteration, we generate a random vector gY T and compute gY T YZ T . We then consider the largest δn indices of gY T YZ T . While the resulting vector does not have strong provable guarantees, we show that averaging the columns of A corresponding to these indices results in a vector, A R1 , which intuitively corresponds to efficiently optimizing a linear function over a low-rank approximation to the smoothened polytope, where the smoothened polytope is obtained by averaging over all subsets of δn data points. Our next contribution is to show that A R1 obtained by the aforementioned algorithmic process is indeed close to a vertex of K. To obtain an approximation to the remaining vertices of K, we consider the following iterative process: in the t-th iteration, consider the subspace Y T (I -P t ), where (I -P t ) is the projection onto the orthogonal complement of the span of A R1 , A R2 . . . A Rt-1 . Then generate a random vector gY T (I -P t ), and compute the largest δn coordinates of gY T (I -P t )YZ T . Average the corresponding columns of A to obtain A Rt and output this vector. We prove that after iterating k times, the vectors A R1 , A R1 , . . . A R k approximate all the vertices of the latent simplex K within the desired accuracy and running time. In contrast, prior work of (Bhattacharyya & Kannan, 2020) uses power iteration to approximate the left top-k singular space U k of A using a subspace V that is poly(α/k) close in sin Θ distance. Each step of the power iteration uses O(nnz(A) + dk 2 ) time and is repeated log(d) times. Next, they pick a random vector u 1 in the subspace spanned V and compute A R1 = argmax S:|S|=δn |u 1 • A S |, using the resulting vector as an approximation to some vertex M * ,1 . They then repeat the above algorithm k times and in the i-th iteration, they pick u i to be a uniformly random direction in the k-i dimensional subspace constructed as follows: let V i-1 be an orthonormal basis for A R1 , A R2 , . . . , A Ri-1 . Intuitively, this corresponds to sampling a random vector from the subspace orthogonal to the set of vertex approximations picked thus far. The resulting k vectors A R1 , . . . , A R k are the approximation to the vertices of the latent simplex. Since they directly optimize over the smoothened polytope, the correctness analysis is more straightforward. However, each iteration of the algorithm requires optimizing a linear function over the smoothened polytope and in particular requires computing u i • A, and thus, the overall running time is dominated by k • nnz(A). Since the latent simplex satisfies the Well-Separateness condition, the inner product with a random direction is maximized by a unique vertex. Intuitively, it appears necessary to project away from the set of vectors obtained up to the i-th iteration in order to learn new vertices of K. The inherently iterative nature of the algorithm combined with matrix-vector product lower bounds indicates that the new algorithmic ideas we introduce are in fact necessary.

3.2. TECHNICAL DISCUSSION

In this section, we provide an outline of our proof. We defer the full proofs to the supplementary material. We start with a spectral low-rank approximation for A. We then use the right factor as an approximation to Σ k V T k and the left factor as an approximation to U k . Lemma 3.4. (Input-Sparsity Spectral LRA (Cohen et al., 2015; 2017) .) Given a matrix A ∈ R d×n , , δ > 0 and k ∈ N, there exists an algorithm that outputs matrices Y, Z, such that with probability at least 1 -δ, A -YZ T 2 2 ≤ (1 + ) A -A k 2 2 + k A -A k 2 F , in time O (nnz(A) + (n + d)poly(k/ δ)). Under the Significant Singular Values condition (4), setting = φ in Lemma 3.4 implies with probability 99/100, 1 poly(k) n i=k+1 σ 2 i = 1 poly(k) A -A k 2 F ≤ σ 2 k+1 = A -A k 2 2 and thus A -YZ T 2 2 ≤ 2 A -A k 2 2 . Further, such a matrix YZ T can be computed in O (nnz(A) + (n + d)poly(k/φ)) time. Next, we show that if YZ T is a good rank k spectral approximation to A, then the subspace spanned by the columns of Y must be close to the column span of U k , the top-k left singular vectors of A. We begin by recalling Wedin's sin Θ theorem that relates norms of projectors to angular distance: Theorem 3.5 (sin Θ theorem (Wedin, 1972) ). Let R, S ∈ R d×n and 0 < m ≤ be integers. Let R m and Sσ denote the subspaces spanned by the top m singular vectors of R and top singular vectors of S, respectively. Suppose γ = σ m (R) -σ +1 (S). Then, sin Θ(R m , Sσ ) ≤ R-S 2 γ . We use the above theorem to bound the spectral norm of the projectors onto the relevant subspaces. Lemma 3.6 (Proximity of Subspace Projections). Let Y be defined as in Algorithm 1 and let U k be the subspace spanned by the top k left singular vectors of A. Let P Y and P U k be the d × d projection matrices onto the row span of Y and U k . Then P Y -P U k 2 ≤ 1 1000k 10 . Proof Sketch. Suppose by way of contradiction that P Y -P U k 2 ≥ 1 1000k 10 . This implies U k U T k -YY T 2 F is large and thus U k Y T 2 F is bounded by k - 1 (1000k 10 ) 2 . Intuitively, we show that if the inner product term is small, we can obtain a lower bound on A -P Y A 2 , as follows: an upper bound on U k Y T 2 F suffices to obtain an upper bound on the the k-th singular value of U k YY T U T via averaging. This, in turn lower bounds A -P Y A 2 = U T Σ -YY T U T Σ 2 by 1 (1000k 10 ) 2 σ k (A), contradicting the Significant Singular Value assumption. Our analysis proceeds via induction on the number of iterations performed by the algorithm. Suppose our algorithm has selected t points from our approximation of the top k subspace and these points are reasonably close to i points of the k-simplex. In the (t + 1)-st iteration, we again bound the sin Θ distance between Y T (I -P t ), which corresponds to our approximation of the top k subspace projected away from the selected vectors, and the actual k-simplex projected away from the corresponding points closest to our selected vectors. This argues that we can continue selecting random vectors in the subspace spanned by Y T (I -P t ) as a close approximation to random vectors in M(I -P t ). We first bound the k-th singular values of the simplex vertices (M) and latent variables (P): Lemma 3.7. (Bhattacharyya & Kannan, 2020) If the underlying points M follow the Well-Separateness and Spectrally Bounded Perturbations assumptions, then σ k (M) ≥ 1000k 8.5 α 2 σ √ δ , and σ k (P) ≥ 995k 8.5 √ n α 2 σ. Next, we prove our lemma relating angular distance of the subspace obtained in the i-th iteration of the algorithm (Y(I -P i )) to the optimal subspace (M(I -P i )). Proof Sketch. Let y ∈ Y(I d -P r ) be a unit vector. It can be shown that using the sin Θ theorem and the hypothesis that there exists x ∈ Span (M) with xy 2 ≤ α 2 500k 8.5 . Let z = x -M M † x be the component of x in Null( M). We can then bound xz 2 ≤ xy 2 + M( M T M) -1 ( M T -A T )y 2 , where A is the set of vectors selected thus far and A T y = 0. Combining the aforementioned observations, we can bound xz 2 by α 2 /(500k 8.5 ) + k 4.5 σ/(α √ δσ k ( M)). We then observe that y ∈ Y(I d -P r ) and z ∈ Span (M) ∩ Null( M), and appeal to Lemma 10.1 in (Bhattacharyya & Kannan, 2020) to yield sin Θ Y(I d -P r ), Span (M) ∩ Null( M) ≤ α/100k 4 . To prove the second half of the claim, it suffices to show that the dimension of Y(I d -P r ) is k -r, since Span (M) ∩ Null( M) has dimension k -r and the sin Θ distance is symmetric between two subspaces of the same dimension. By construction Y has dimension k so that Y(I d -P r ) has dimension at least k -r. Therefore, there exist vectors u 1 , . . . , u k-r+1 ∈ Y(I d -P r ) and vectors v 1 , . . . , v k-r+1 ∈ Span (M) ∩ Null( M) such that u i -v j 2 < α/100k 4 . We can then upper and lower bound |v a • v b | for all a = b to conclude that v 1 , . . . , v k-r+1 are orthogonal, contradicting that Span (M) ∩ Null( M) spans a k -r dimensional space and the claim follows. Now we need to show that our algorithm is (1) well-defined and (2) preserves the invariant that the (i + 1)-st point sampled from Y T (I -P i ) will also be reasonably close to some different point of the k-simplex. We show the selected procedure is well-defined in Lemma 3.9 by arguing that there exists a unique solution to the maximization problem. Lemma 3.9 (Optimization is Well-Defined). Let u ∈ R d be a random unit vector in the space of Y T (I d -P r ), where P r is the orthogonal projection to A R1 , . . . , A Rr . Then there exists a constant c > 0 so that with probability at least 1 -c/k 



Throughout the paper we use the notation O to suppress poly-logarithmic factors.



Lemma 3.8 (Angular Distance between Subspaces.). For some r ∈ [k], let M = M * , 1 • . . . • M * , r be the matrix with r columns corresponding to vertices of the latent k-simplex M closest to the first r points selected by Algorithm 1, A R1 , . . . , A Rr , respectively. Suppose A Ri -M * , i 2 ≤ 300k 4 α σ √ δ for each i ∈ [r]. Let P r be the projection matrix orthogonal to A R1 , . . . , A Rr . Then sin Θ Y(I d -P r ), Span (M) ∩ Null( M) ≤ α/100k 4 and sin Θ Span (M) ∩ Null( M), Y(I d -P r ) ≤ α/100k 4 .

Figure 4.2: Mean runtime comparison of algorithms across parameters on real-world data.

Mean runtime comparison of algorithms across parameters on synthetic data. running time of both algorithms across each dataset among choices of k ∈ {20, 50, 100}. We observe that the resulting matrix has sparsity roughly 1000, which is consistent with p ≈ 1 n and is much less than the sparsity parameters tested in our synthetic data.

acknowledgement

ACKNOWLEDGMENTS A.B., and D. W. were supported by the Office of Naval Research (ONR) grant N00014-18-1-2562, and the National Science Foundation (NSF) Grant No. CCF-1815840. D.W and S.Z were supported by National Institute of Health (NIH) grant 5R01 HG 10798-2 and a Simons Investigator Award.

annex

Published as a conference paper at ICLR 2021 2. For all a / ∈ { 1 , . . . , r }, then |u • M * ,a | ≥ 0.0989 k 4 α max M * , 2 .We then show that the algorithm preserves the aforementioned invariant by showing that the unique solution A Ri cannot correspond to one of the vertices of the k-simplex that have been found in the first i rounds, thus proving that we find a solution A Ri that corresponds to a new vertex of M. We then show A Ri is close to the new vertex of M, preserving the inductive hypothesis. Lemma 3.10 (Recovery Guarantees).Proof Sketch. We consider the case u • A Rr+1 ≥ 0 as the analysis for the case uIt can be shown that r+1 / ∈ { 1 , . . . , r }. Thus applying Lemma 3.9, u • M * , r+1 ≥ 0.0989α max M * , /k 4 . By the Proximate Latent Points assumption, there exists a set σ r+1 of size δn so that P * ,j -M * , r+1 2 ≤ 4σ √ δ for all j ∈ σ r+1 so that P * ,σ r+1 -M * , r+1 2 ≤ 4σ √ δ . Then by Lemma 3.1 in (Bhattacharyya & Kannan, 2020) ,Further, using Lemma 3.9 and the Spectrally Bounded Perturbation assumption, we obtain uCombining the upper and lower bounds, straightforward computations yield the claim.In contrast to (Bhattacharyya & Kannan, 2020) , we only need input sparsity time to compute the low-rank approximation to A. The subsequent k iterations of selecting points from Y T (I -P i ) are computed in the low-dimensional space and use lower-order runtime. Hence, the dominating term in the final runtime is just the input sparsity time used to compute Y T (I -P i ).

4. EMPIRICAL EVALUATION

In this section, we describe a series of experiments that demonstrate the advantage of our algorithm, performed in Python 3.6.9 on an Intel Core i7-8700K 3.70 GHz CPU with 12 cores and 64GB DDR4 memory, using an Nvidia Geforce GTX 1080 Ti 11GB GPU, on both synthetic and real-world data. Whereas previous work requires computing the top k subspace as a pre-processing step, our main improvement is that we only require a crude approximation. Thus we compared the running times for finding the top k subspace as required by (Bhattacharyya & Kannan, 2020) to finding a mixed spectral-Frobenius approximation using an input sparsity algorithm, as required by our algorithm. For the former, we use the svds method from the sparse scipy linalg package optimized by LAPACK. For the latter, (Cohen et al., 2015; 2017) show that using a sparse CountSketch matrix (Clarkson & Woodruff, 2013; Meng & Mahoney, 2013; Nelson & Nguyen, 2013) , i.e., a matrix with O(k 2 ) columns and a single nonzero entry in each row that is in a random location and is a random sign, suffices to obtain a mixed spectral-Frobenius guarantee; we evaluate such a matrix with exactly k 2 columns. Across all parameters and datasets, the input sparsity procedure used by our algorithm significantly outperforms the optimized power iteration methods required by (Bhattacharyya & Kannan, 2020) . Synthetic Data. Since our theoretical results are most interesting when k d n, we set n = 50000, d = 1000, k ∈ {20, 50, 100} and generate a random d × n matrix A that consists of independent entries that are each 1 with probability p ∈ 1 500 , 1 2000 , 1 5000 and 0 with probability 1 -p. In Figure 4 .1, we report the average running time of both algorithms, among 5 independent runs for each choice of p and k.Social Networks. We also evaluate the algorithms on the email-Eu-core network dataset of interactions across email data between individuals from a large European research institution (Yin et al., 2017; Leskovec et al., 2007) and the com-Youtube dataset of friendships on the Youtube social network (Yang & Leskovec, 2015) , both accessed through the Stanford Network Analysis Project (SNAP). In the former, there are n = d = 1005 nodes in the adjacency matrix over 25571 total edges, forming k = 42 communities. In the latter, there are 1134890 nodes with 8385 communities, from which we extract a d × n matrix with n = 100000, d = 1000 to represent a bipartite graph, as described in both Section 2.2 and (Bhattacharyya & Kannan, 2020) . In Figure 4 .2, we report the

