AN EFFICIENT PROTOCOL FOR DISTRIBUTED COLUMN SUBSET SELECTION IN THE ENTRYWISE p NORM

Abstract

We give a distributed protocol with nearly-optimal communication and number of rounds for Column Subset Selection with respect to the entrywise 1 norm (k-CSS 1 ), and more generally, for the p -norm with 1 ≤ p < 2. We study matrix factorization in 1 -norm loss, rather than the more standard Frobenius norm loss, because the 1 norm is more robust to noise, which is observed to lead to improved performance in a wide range of computer vision and robotics problems. In the distributed setting, we consider s servers in the standard coordinator model of communication, where the columns of the input matrix A ∈ R d×n (n d) are distributed across the s servers. We give a protocol in this model with O(sdk) communication, 1 round, and polynomial running time, and which achieves a multiplicative k 1 p -1 2 poly(log nd)-approximation to the best possible column subset. A key ingredient in our proof is the reduction to the p,2 -norm, which corresponds to the p-norm of the vector of Euclidean norms of each of the columns of A. This enables us to use strong coreset constructions for Euclidean norms, which previously had not been used in this context. This naturally also allows us to implement our algorithm in the popular streaming model of computation. We further propose a greedy algorithm for selecting columns, which can be used by the coordinator, and show the first provable guarantees for a greedy algorithm for the 1,2 norm. Finally, we implement our protocol and give significant practical advantages on real-world data analysis tasks.

1. INTRODUCTION

Column Subset Selection (k-CSS) is a widely studied approach for rank-k approximation and feature selection. In k-CSS, one seeks a small subset U ∈ R d×k of k columns of a data matrix A ∈ R d×n , typically n d, for which there is a right factor V such that |U V -A| is small under some norm | • |. k-CSS is a special case of low rank approximation for which the left factor is an actual subset of columns. The main advantage of k-CSS over general low rank approximation is that the resulting factorization is more interpretable, as columns correspond to actual features while general low rank approximation takes linear combinations of such features. In addition, k-CSS preserves the sparsity of the data matrix A. k-CSS has been extensively studied in the Frobenius norm (Guruswami & Sinop, 2012; Boutsidis et al., 2014; Boutsidis & Woodruff, 2017; Boutsidis et al., 2008) and operator norms (Halko et al., 2011; Woodruff, 2014) . A number of recent works (Song et al., 2017; Chierichetti et al., 2017; Dan et al., 2019; Ban et al., 2019; Mahankali & Woodruff, 2020) studied this problem in the p norm (k-CSS p ) for 1 ≤ p < 2. The 1 norm is less sensitive to outliers, and better at handling missing data and non-Gaussian noise, than the Frobenius norm (Song et al., 2017) . Specifically, the 1 norm leads to improved performance in many real-world applications, such as structure-from-motion (Ke & Kanade, 2005) and image denoising (Yu et al., 2012) . Distributed low-rank approximation arises naturally when a dataset is too large to store on one machine, takes prohibitively long time for a single machine to compute a rank-k approximation, or is collected simultaneously on multiple machines. Despite the flurry of recent work on k-CSS p , this problem remains largely unexplored in the distributed setting. This should be contrasted to Frobenius norm column subset selection and low rank approximation, for which a number of results in the distributed model are known, see, e.g., Altschuler et al. (2016) ; Balcan et al. (2015; 2016) ; Boutsidis et al. (2016) . We consider a widely applicable model in the distributed setting, where s Step 1: Server i applies a dense p-stable sketching matrix S to reduce the row dimension of the data matrix A i . S is shared between all servers. Step 2: Server i constructs a strong coreset for its sketched data matrix SA i , which is a subsampled and reweighted set of columns of SA i . Server i then sends the coreset SA i T i , as well as the corresponding unsketched, unweighted columns A i D i selected in the strong coreset SA i T i to the coordinator. Step 3: The coordinator concatenates the SA i T i column-wise, applies k-CSS p,2 to the concatenated columns and computes the set of indices of the selected columns. Step 4: The coordinator recovers the set of selected columns A I from the unsketched, unweighted columns A i D i 's through previously computed indices. servers communicate to a central coordinator via 2-way channels. This model can simulate arbitrary point-to-point communication by having the coordinator forward a message from one server to another; this increases the total communication by a factor of 2 and an additive log s bits per message to identify the destination server. We consider the column partition model, in which each column of A ∈ R d×n is held by exactly one server. The column partition model is widely-studied and arises naturally in many real world scenarios such as federated learning (Farahat et al., 2013; Altschuler et al., 2016; Liang et al., 2014) . In the column partition model, we typically have n d, i.e., A has many more columns than rows. Hence, we desire a protocol for distributed k-CSS p that has a communication cost that is only logarithmic in the large dimension n, as well as fast running time. In addition, it is important that our protocol only uses a small constant number of communication rounds (meaning back-and-forth exchanges between servers and the coordinator). Indeed, otherwise, the servers and coordinator would need to interact more, making the protocol sensitive to failures in the machines, e.g., if they go offline. Further, a 1-round protocol can naturally be adapted to an single pass streaming algorithm when we consider applications with limited memory and access to the data. In fact, our protocol can be easily extended to yield such a streaming algorithmfoot_0 . In the following, we denote A i * and A * j as the i-th row and j-th column of A respectively, for i ∈ [d], j ∈ [n]. We denote A T as the subset of columns of A with indices in T ⊆ [n]. The entrywise p -norm of A is |A| p = ( d i=1 n j=1 |A ij | p ) 1 p . The p,2 norm is defined as |A| p,2 = ( d j=1 |A * j | p 2 ) 1 p . We consider 1 ≤ p < 2. We denote the best rank-k approximation error for A in p norm by OPT := min rank-k A k |A -A k | p . Given an integer k > 0, we say U ∈ R d×k , V ∈ R k×n are the left and right factors of a rank-k factorization for A in the p norm with approximation factor α if |U V -A| p ≤ α • OPT. Since general rank-k approximation in 1 norm is NP hard (Gillis & Vavasis, 2015) , we follow previous work and consider bi-criteria k-CSS algorithms which obtain polynomial running time. Instead of outputting exactly k columns, such algorithms return a subset of O(k) columns of A, suppressing logarithmic factors in k or n. It is known that the best approximation factor to OPT that can be obtained through the span of a column subset of size O(k) is Ω(k 1/2-γ ) for p = 1 (Song et al., 2017) and Ω(k 1/p-1/2-γ ) for p ∈ (1, 2) (Mahankali & Woodruff, 2020) , where γ is an arbitrarily small constant.

1.1. PREVIOUS APPROACHES TO k-CSS p IN THE DISTRIBUTED SETTING

If one only wants to obtain a good left factor U , and not necessarily a column subset of A, in the column partition model, one could simply sketch the columns of A i by applying an oblivious sketching matrix S on each server. Each server sends A i • S to the coordinator. The coordinator obtains U = AS as a column-wise concatenation of A i S's. Song et al. (2017) showed that AS achieves an O( √ k) approximation to OPT, and this protocol only requires O(sdk) communication, O(1) rounds and polynomial running time. However, while AS is a good left factor, it does not correspond to an actual subset of columns of A. Obtaining a subset of columns that approximates A well with respect to the p-norm in a distributed setting is non-trivial. One approach due to Song et al. (2017) is to take the matrix AS described above, sample rows according to the Lewis weights (Cohen & Peng, 2015) of AS to get a right factor V , which is in the row span of A, and then use the Lewis weights of V to in turn sample columns of A. Unfortunately, this protocol only achieves a loose O(k 3/foot_1 ) approximation to OPT (Song et al., 2017) . Moreover, it is not known how to do Lewis weight sampling in a distributed setting. Alternatively, one could adapt existing single-machine k-CSS p algorithms to the distributed setting under the column partition model. Existing works on polynomial time k-CSS p (Chierichetti et al., 2017; Song et al., 2019b; Dan et al., 2019; Mahankali & Woodruff, 2020) give bi-criteria algorithms, and are based on a recursive framework with multiple rounds, which is as follows: in each round, O(k) columns are selected uniformly at random, and with high probability, the selected columns can provide a good approximation to a constant fraction of all columns of A. Among the remaining columns that are not well approximated, O(k) columns are recursively selected until all columns of A are well approximated, resulting in a total of O(log n) rounds. A naïve extension of this bi-criteria k-CSS p framework to a distributed protocol requires O(log n) rounds, as in each round, the servers and the coordinator need to communicate with each other in order to find the columns that are covered well and select from the remaining unselected columns. To reduce this to a single round, one might consider running the O(log n) round selection procedure on the coordinator only. In order to do this, the coordinator needs to first collect all columns of A from the servers, but directly communicating all columns is prohibitive. Alternatively, one could first apply k-CSS p on A i to obtain factors U i and V i on each server, and then send the coordinator all of the U i and V i . The coordinator then column-wise stacks the U i V i to obtain U • V and selects O(k) columns from U • V . Even though this protocol applies to all p ≥ 1, it achieves a loose O(k 2 ) approximation to OPT and requires a prohibitive O(n + d) communication cost 2 . One could instead try to just communicate the matrices U i to the coordinator, which results in much less communication, but this no longer gives a good approximation. Indeed, while each U i serves as a good approximation locally, there may be columns that are locally not important, but become globally important when all of the matrices A i are put together. What is really needed here is a small coreset C i for each A i so that if one concatenates all of the C i to obtain C, any good column subset of the coreset C corresponds to a good column subset for A. Unfortunately, coresets for the entrywise p -norm are not known to exist.

1.2. OUR CONTRIBUTIONS

Our Distributed Protocol We overcome these problems and propose the first efficient protocol for distributed k-CSS p (1 ≤ p < 2) in the column partition model that selects O(k) columns of A achieving an O(k 1/p-1/2 )-approximation to the best possible subset of columns and requires only O(sdk) communication cost, 1 round and polynomial time. Figure 1 gives an overview of the protocol. We note that our subset of columns does not necessarily achieve an O(k 1/p-1/2 )-approximation to OPT itself, although it does achieve such an approximation to the best possible subset of columns. Using the fact that there always exists a subset of columns providing an O(k 1/p-1/2 )-approximation to OPT (Song et al., 2017) , we conclude that our subset of columns achieves an O(k 2/p-1 )-approximation to OPT. Recently, and independently of our work, Mahankali & Woodruff (2020) show how to obtain a subset of columns achieving an O(k 1/p-1/2 )-approximation to OPT itself; however, such a subset is found by uniformly sampling columns in O(log n) adaptive rounds using the recursive sampling framework above, and is inherently hard to implement in a distributed setting with fewer rounds. In contrast, our protocol achieves 1 round of communication, which is optimal. We make use of a strong coreset, i.e., a sampled and reweighted subset of columns of each A i that approximates the cost of all potential left factors of A i , by first embedding all subspaces spanned by any subset of O(k) columns of A from p -space to Euclidean space, to bypass the lack of strong coresets for the p norm. We denote this new norm as p,2 norm, which is the sum of the p-th powers of the 2 norms of the columns. To reduce the error incurred by switching to the p,2 -norm, we reduce the row dimension of A by left-multiplying by an oblivious sketching matrix S shared across servers, resulting in an overall approximation factor of only O(k 1/p-1/2 ). Afterwards, each server sends its strong coreset to the coordinator. The coordinator, upon receiving the coresets from each server, runs an O(1)-approximate bi-criteria k-CSS p,2 algorithm to select the final column subset, giving an overall O(k 1/p-1/2 ) approximation to the best column subset. We introduce several new technical ideas in the analysis of our protocol. Our work is the first to apply a combination of oblivious sketching in the p-norm via p stable random variables and strong coresets in the p,2 norm (Sohler & Woodruff, 2018; Huang & Vishnoi, 2020) to distributed k-CSS. Furthermore, to show that our oblivious sketching step only increases the final approximation error by a logarithmic factor, we combine a net argument with a union bound over all possible subspaces spanned by column subsets of A of size O(k). Previous arguments involving sketching, such as those by Song et al. (2017) ; Ban et al. (2019) ; Mahankali & Woodruff (2020) , only consider a single subspace at a time. Theoretical Guarantees and Empirical Benefits for Greedy k-CSS 1,2 We also propose a greedy algorithm to select columns in the k-CSS 1,2 step of our protocol, and show the first additive error guarantee compared to the best possible subset A S of columns, i.e., our cost is at most (1 - ) min V |A S V -A| 1,2 + |A| 1,2 . Similar error guarantees were known for the Frobenius norm (Altschuler et al., 2016) , though nothing was known for the 1,2 norm. We also implement our protocol and experiment with distributed k-CSS 1 on various real-world datasets. We compare the O(1)-approximate bi-criteria k-CSS 1,2 and the greedy k-CSS 1,2 as different possible subroutines in our protocol, and show that greedy k-CSS 1,2 yields an improvement in practice.

2.1. THE COLUMN PARTITION MODEL

We consider a model where there are s servers, the i th of which holds A i ∈ R d×ni , and a coordinator which initially does not hold any data. Each server talks only to the coordinator, via a 2-way communication channel. The communication cost is the total number of words transferred between the servers and the coordinator over the course of the protocol. Each word is O(log(snd)) bits. The overall data matrix A ∈ R d×n is A = [A 1 , A 2 , . . . , A s ] (the column-wise concatenation of the A i 's). Here, n is defined to be s i=1 n i . Typically, in the column partition model, n d. )

2.2. p-STABLE

1-p p follows a p-stable distribution.

3. PRELIMINARIES FOR OUR PROTOCOL

We first note a standard relationship between the p norm and the p,2 norm. Lemma 1. For a matrix A ∈ R d×n and p ∈ [1, 2), |A| p,2 ≤ |A| p ≤ d 1 p -1 2 |A| p,2 .

3.1. p -NORM OBLIVIOUS SKETCHING

We left-multiply A by an oblivious sketching matrix S with p-stable random variables so that we lose only an O(k 1 p -1 2 ) approximation factor when we switch to the p,2 norm. The purpose of the next two lemmas is to show that we can perform oblivious sketching while preserving the costs of all possible column subsets up to logarithmic factors. We first show a lower bound on the approximation error for a sketched subset of columns, |SA T V -SA| p , which holds simultaneously for any arbitrary subset A T of chosen columns, and for any arbitrary right factor V . Lemma  V ∈ R |T |×n , |A T V -A| p ≤ |SA T V -SA| p . Next, we show an upper bound on the approximation error of k-CSS p on a sketched subset of columns, |SA T V -SA| p , which holds for a fixed subset of columns A T and a fixed right factor V . Lemma 3 (Sketched Error Upper Bound). Let A ∈ R d×n and k ∈ N. Let t = k • poly(log(nd)), and let S ∈ R t×d be a matrix whose entries are i.i.d. standard p-stable random variables, rescaled by Θ(1/t 1 p ). Then, for a fixed subset T ⊂ [n] of columns with |T | = k • poly(log k) and a fixed V ∈ R |T |×n , with probability 1 -o(1), we have min V |SA T V -SA| p ≤ min V O(log 1/p (nd))|A T V - A| p .

3.2. STRONG CORESETS IN THE p,2 NORM

To enable sub-linear communication cost in the number n of columns, the i-th server sends the coordinator a strong coreset of columns of SA i , which is a reweighted subset of the columns of SA i . Such strong coresets preserve the error incurred by any rank-k projection, up to a constant factor, in the p,2 norm. The coreset of a matrix A ∈ R d×n is usually denoted as AT , where T = D • W ∈ R n×t is a sampling and reweighting matrix and t is the number of columns to be included in the coreset. The sampling matrix D is a matrix with t columns where each column has only one 1 in the index of the column of A to be included in the coreset and 0 everywhere else. The reweighting matrix W is a diagonal t × t matrix with weights associated with each sample in the coreset. Lemma 4 (Strong Coreset in p,2 norm). Let A ∈ R d×n , k ∈ N, p ∈ [1, 2), and , δ ∈ (0, 1). Then, in n • poly(k log n/ ) time, one can find a sampling and reweighting matrix T with O(d log d/ 2 ) • log(1/δ) columns such that, with probability 1 -δ, for all rank-k matrices U , min rank-k V |U V -AT | p,2 = (1 ± ) min rank-k V |U V -A| p,2 AT is called a strong coreset of A. 3.3 POLYNOMIAL TIME, O(1)-APPROXIMATE BI-CRITERIA k-CSS p,2 After server i sends a strong coreset to the coordinator, the coordinator does k-CSS on a column-wise concatenation of these coresets, in the p,2 norm rather than the p norm. We give a polynomial time, O(1)-approximate bi-criteria k-CSS p,2 algorithm for p ∈ [1, 2). Theorem 5 (Bicriteria O(1)-Approximation Algorithm for k-CSS p,2 ). Let A ∈ R d×n and k ∈ N. There exists an algorithm that runs in (nnz(A) + d 2 ) • kpoly(log k) time and outputs a rescaled subset of columns U ∈ R d× O(k) of A and a right factor V ∈ R O(k)×n for which V = min V |U V -A| p,2 , such that with probability 1 -o(1), |U V -A| p,2 ≤ O(1) • min rank-k A k |A k -A| p,2 Our polynomial time bi-criteria k-CSS p,2 algorithm is based on that of Clarkson & Woodruff (2015) . The main difference is that the algorithm of Clarkson & Woodruff (2015) outputs a subset with O(k 2 ) columns due to the usage of p leverage scores -we reduce the number of selected columns to O(k) by using p Lewis weights (Cohen & Peng, 2015) . Details are given in Appendix C. Algorithm 1 An efficient protocol for bi-criteria k-CSS p in the column partition model Initial State: Server i holds matrix A i ∈ R d×ni , ∀i ∈ [s]. Coordinator: Generate a dense p-stable sketching matrix S ∈ R k poly(log(nd))×d . Send S to all servers. Server i: Compute SA i . Let the number of samples in the coreset be t = O(kpoly(log(nd))). Construct a coreset of SA i under the p,2 norm by applying a sampling matrix D i of size n i × t and a diagonal reweighting matrix W i of size t × t. Let T i = D i W i . Send SA i T i along with A i D i to the coordinator. Coordinator: Column-wise stack SA i T i to obtain SAT = [SA 1 T 1 , SA 2 T 2 , . . . , SA s T s ]. Apply k-CSS p,2 on SAT to obtain the indices I of the subset of selected columns with size O(k • poly(log k)). Since D i 's are sampling matrices, the coordinator can recover the original columns of A by mapping indices I to A i D i 's. Denote the final selected subset of columns by A I . Send A I to all servers. Server i: Solve min Vi |A I V i -A i | p to obtain the right factor V i . A I and V will be factors of a rankk • poly(log k) factorization of A, where V is the (implicit) column-wise concatenation of the V i .

4. AN EFFICIENT PROTOCOL FOR DISTRIBUTED k-CSS p

Theorem 6 (A Protocol for Distributed k-CSS p ). In the column partition model, let A ∈ R d×n be the data matrix whose columns are partitioned across s servers and suppose server i holds a subset of columns A i ∈ R d×ni , where n = i∈[s] n i . Then, given p ∈ [1, 2) and a desired rank k ∈ N, Algorithm 1 outputs a subset of columns A I ∈ R d×kpoly(log(k)) in O(nnz(A)k + kd + k 3 ) time, such that with probability 1 -o(1), min V |A I V -A| p ≤ O(k 1/p-1/2 ) min L⊂[n],|L|=k |A L V -A| p Algorithm 1 uses 1 round of communication and O(sdk) words of communication. Proof. Approximation Factor. In the following proof, let L ⊂ [n], |L| = k denote the best possible subset of k columns of A that gives the minimum k-CSS p cost, i.e., the cost min V |A L V -A| p achieves minimum. First, note that min V |A I V -A| p ≤ |A I V -A| p V := arg min V |SA I V -SA| p,2 ≤ |SA I V -SA| p By Lemma 2 = O(k 1 p -1 2 )|SA I V -SA| p,2 By Lemma 1, and S has k • poly(log(nd)) rows SA I is the selected columns output from the bi-criteria O(1)-approximation k-CSS p,2 algorithm. Let (SAT ) * denote the best rank k approximation to SAT . By Theorem 5, O(k 1 p -1 2 )|SA I V -SA| p,2 ≤ O(k 1 p -1 2 ) • O(1)|(SAT ) * -SAT | p,2 ≤ O(k 1 p -1 2 ) min V |SA L V -SAT | p,2 Note that SAT = [SA 1 T 1 , . . . , SA s T s ] is a column-wise concatenation of all coresets of SA i , ∀i ∈ [s]. By Lemma 4, (min V |SA L V -SAT | p p,2 ) 1/p = ( s i=1 min Vi |SA L V i -SA i T i | p p,2 ) 1/p = ( s i=1 (1 ± ) p min Vi |SA L V i -SA i | p p,2 ) 1/p = (1 ± )( s i=1 min Vi |SA L V i -SA i | p p,2 ) 1/p = (1 ± ) min V |SA L V -SA| p,2 Hence, O(k 1 p -1 2 ) min V |SA L V -SAT | p,2 ≤ O(k 1 p -1 2 ) min V |SA L V -SA| p,2 ≤ O(k 1 p -1 2 ) min V |SA L V -SA| p By Lemma 1 ≤ O(k 1 p -1 2 ) • log 1/p (nd) min V |A L V -A| p By Lemma 3 Therefore, we conclude: min V |A I V -A| p ≤ O(k 1 p -1 2 ) min V |A L V -A| p . Communication • kpoly(log k) ≤ k 3 poly(log(knd)) to find the set of selected columns. Since the number of selected columns is O(kpoly(log k)), it then takes the protocol O(kpoly(log k)) time to map the indices of the output columns from k-CSS p,2 to recover the original columns A I . Therefore, the overall running time for the protocol to find the subset of columns A I is O((nnz(A)k + kd + k 3 ), suppressing a low degree polynomial dependency on log(knd). After the servers receive A I , it is possible to solve Wang & Woodruff (2019) ; Yang et al. (2018) . min Vi |A I V i -A i | p in O(nnz(A I )) + poly(d log n) time , ∀i ∈ [s] due to

5. GREEDY k-CSS 1,2

We propose a greedy algorithm, shown in Algorithm 2, for k-CSS 1,2 , which can be used in the place of the algorithm described in Theorem 5. The basic version of this algorithm, discussed in Appendix D, performs k-CSS 1,2 by simply selecting the additional column, among those of A, that reduces the approximation error the most at each iteration. Our analysis of that algorithm is inspired by the analysis of Greedy k-CSS 2 for the Frobenius norm in Altschuler et al. (2016) . Here we provide the first additive error guarantee, compared to the best possible subset of columns, for the greedy k-CSS 1,2 algorithm. For a faster running time, we make use of the Lazier-than-lazy heuristic described in Section 5.2 of Altschuler et al. (2016) , where in each iteration, rather than considering all columns of A as candidate additional columns of A T , we only sample a subset of the columns of A of size O( n log(1/δ) k ), and pick the column among those that improves the objective the most. Theorem 7. Let A ∈ R d×n be the data matrix and k ∈ N be the desired rank. Let A S be the best possible subset of k columns, i.e., A S = arg min A S min V |A S V -A| 1,2 . Let σ be the minimum non-zero singular value of the matrix B of normalized columns of A S , (the j-th column of B is B * j = (A S ) * j /|(A S ) * j | 2 ). Then, if T ⊂ [n] is the subset of columns selected by Greedy k-CSS 1,2 , the following holds with |T | = Ω( k σ 2 2 ): min V |A T V -A| 1,2 ≤ (1 -) min S⊂[n],|S|=k,V ∈R k×n |A S V -A| 1,2 + |A| 1,2 Similarly, if T ⊂ [n] is the subset of columns selected by Lazier-than-lazy Greedy k-CSS 1,2 , the following holds with |T | = Ω( k σ 2 2 ) and δ = : E[min V |A T V -A| 1,2 ] ≤ (1 -) min S⊂[n],|S|=k,V ∈R k×n |A S V -A| 1,2 + |A| 1,2 Algorithm 2 Lazier-than-lazy Greedy k-CSS 1,2 . This version of the greedy algorithm is based on Section 5.2 of Altschuler et al. (2016) . Input: The data matrix A ∈ R d×n , the number of iterations r ≤ n, a parameter δ ∈ (0, 1). Output: A subset of columns A T from A, where |T | = r. A T ← ∅ for i = 1 to r do T ← A subset of n log(1/δ) k columns of A, each selected uniformly at random (excluding the columns whose indices are in T ) Column j * ← arg min j∈T (min  V |A T ∪j V -A| 1,2 ) A T ← A T ∪j * end for Dataset Size # servers s Column Distribution Rank k synthetic (2000 + k) × (2000 + k) 2 1001, 1002 If we let |T | = Ω( k σ 2 2 ), then the overall running time of Algorithm 2 is O( n log(1/ ) σ 2 2 • F ), where F is the running time needed to evaluate min V |A T ∪j V -A| 1,2 for a fixed j ∈ T . We can get F = O( dk 2 σ 4 4 + ndk σ 2 2 ) by taking A † T ∪j . Since the error upper bound for greedy k-CSS p,2 depends on |A| 1,2 , it is not directly comparable to the error upper bound for the proposed k-CSS p,2 from Subsection 3.3, which achieves a multiplicative O(1)-approximation to the best rank-k approximation. We empirically compare the two versions of k-CSS p,2 for p = 1 in Section 6.

6. EXPERIMENTS

We implement our protocol for distributed k-CSS p in Algorithm 1, setting p = 1, which enables us to compare two subroutines on the coordinator: Regular k-CSS 1,2 from Section 3.3 and Greedy k-CSS 1,2 from Section 5. We compare our k-CSS 1 protocol against a commonly applied baseline for p low rank approximation (used by Song et al. (2019a) ; Chierichetti et al. (2017) ): rank-k Singular Value Decomposition (SVD). Datasets. We demonstrate the benefits of our k-CSS 1 protocol on one synthetic data and one real-world application. We present a summary of the datasets, along with the number of servers s, the column distribution across servers and the rank k we consider for each dataset in Table 1 . The synthetic dataset constructs a data matrix M ∈ R (k+n)×(k+n) such that the top left k × k submatrix is the identity matrix multiplied by n 3 2 , and the bottom right n × n submatrix has all 1's. The optimal rank-k left factor consists of one of the last n columns along with k -1 of the first k columns, incurring an error of n 3 2 in the 1 norm and an error nfoot_2 in the squared 2 norm. SVD, however, will not cover any of the last n columns, and thus will get an error of n 2 in both the 1 and squared 2 norms. We set n = 2000 and apply i.i.d. Gaussian noise to each entry with mean 0 and standard deviation 0.01. We consider a real-world application, term-document clustering, where k-CSS 2 algorithm is previously applied (Mahoney & Drineas, 2009) . TechTC 3 contains 139 documents processed in bag-of-words representation with a dictionary of 18446 words. Such representation naturally results in a sparse matrix. k-CSS p is used to select the top k most representative words. Hyperparameters. Experiment hyperparameters are summarized in Table 2 . We denote the number of rows in our 1-stable (Cauchy) sketching matrix by cauchy size, and the strong coreset size by coreset size. We have two additional hyperparameters for regular k-CSS 1,2 . We denote the number of rows in the sparse embedding matrix of O(k) rows by sketch size, and the number of non-zero entries in each column of the sparse embedding matrix by sparsity. Table 2 : A summarization of hyperparameters used for each dataset. Note that a too small coreset size compared to cauchy size will incur large 1 cost, and coreset size needs to be increased as rank k increases.  L |A|1 • k∈K 1 20|A * k |1 min v |A I v -A * k | 1 . The approximated 1 error can be shown as an unbiased estimation of min V |A I V -A| 1 with small variance by Hoeffding's bound. Results. We present our empirical results in Figure 2 . The distributed protocol performs better using GREEDY k-CSS 1,2 than REGULAR k-CSS 1,2 on both datasets, and in other settings we include in the supplementary material. We also conducted experiments on three different datasets, bcsstk13, isolet, and caltech-101, extensively comparing the approximation error as well as the running time for different algorithms. A comprehensive reporting of our results is given in Appendix G.

7. CONCLUSION

In this work, we give the first nearly-optimal communication and number of rounds protocol for distributed k-CSS p (1 ≤ p < 2) in the column partition model, which achieves O(k 1/p-1/2 )approximation to the best possible subset of columns, with O(sdk) communication cost, 1 rounds and polynomial time. To achieve a good approximation factor, we use dense p-stable sketching and work with the p,2 norm, which enables us to use an efficient construction of strong coresets and an O(1)-approximation bi-criteria k-CSS p,2 algorithm. We further propose a greedy algorithm for k-CSS 1,2 and show the first additive error upper bound compared to the best possible subset of columns. We implement our distributed protocol using both greedy k-CSS 1,2 and regular k-CSS 1,2 . Our results empirically show that greedy k-CSS 1,2 gives substantial improvements over regular k-CSS 1,2 on real-world datasets. For future works, it is not known whether a O(k 1/p-1/2 )-approximation factor to the best possible subset of columns, is optimal for distributed k-CSS p (1 ≤ p < 2).

A A HIGH COMMUNICATION COST PROTOCOL FOR k-CSS p (p ≥ 1)

We start by describing a protocol for distributed k-CSSp, which works for all p ≥ 1, in the column partition model, and which achieves an O(k 2 )-approximation to the best rank-k approximation, using O(1) rounds and polynomial time but requiring a communication cost that is linear in n + d. The inputs are a column-wise partitioned data matrix A ∈ R d×n distributed across s servers and a rank parameter k ∈ N. Each server i holds part of the data matrix Ai ∈ R d×n i , ∀i ∈ [s], and such that s i=1 ni = n. We use a single machine, polynomial time bi-criteria k-CSSp algorithm as a subroutine of the protocol, e.g., Algorithm 3 in Chierichetti et al. (2017) , which selects a subset of O(k) columns AT of the data matrix A ∈ R d×n in polynomial time, for which minX |AT X -A|p ≤ O(k) minrank-k A k |A -A k |p, ∀p ≥ 1. Algorithm 3 A protocol for k-CSS p (p ≥ 1) Initial State: Server i holds matrix A i ∈ R d×ni , ∀i ∈ [s]. Server i: Apply polynomial time bi-criteria k-CSS p on A i to obtain a subset B i of columns as the left factor. Solve for the right factor V i = arg min Vi |U i V i -A i | p . Send U i and V i to the coordinator. Coordinator: Column-wise concatenate the U i V i to obtain U V = [U 1 V 1 , . . . , U s V s ]. Apply a polynomial time bi-criteria k-CSS p algorithm on U V to obtain a subset C of columns. Send C to each server. Server i: Solve min Xi |CX i -A i | p to obtain the right factor. Approximation Factor. Let U V denote the column-wise concatenation of the UiVi. Let X * = arg min X |CX -A|p. Then, |CX * -AS|p ≤ |CX * -U V |p + |U V -A|p By the triangle inequality ≤ O(k) min rank-k (U V ) k |U V -(U V ) k |p + |U V -A|p By the O(k)-approximation of k-CSSp ≤ O(k)|U V -A|p = O(k)( s i=1 |UiVi -Ai|p) ≤ O(k)( s i=1 O(k) min rank-k A * i |Ai -A * i |p) By the O(k)-approximation of k-CSSp ≤ O(k 2 ) s i=1 |Ai -(A * )i|p A * = arg min rank-k A * |A -A * |p = O(k 2 )|A -A * |p Communication Cost. Since Ui ∈ R d× O(k) and Vi ∈ R O(k)×n i , sending Ui and Vi costs O(skn). Since C ∈ R d× O(k) , sending C from the coordinator to all servers costs O(sdk). Thus the overall communication cost is O(s(n + d)k). Running time. According to Chierichetti et al. (2017) , applying the k-CSSp algorithm and solving p regression can both be done in polynomial time. Thus the overall running time of the protocol is polynomial. Problems with this protocol. Although this protocol works for all p ≥ 1, a communication cost that linearly depends on the large dimension n is too high, and furthermore, the output C is not a subset of columns of A, because the protocol applies k-CSSp on a concatenation of both the left factor Ui and the right factor Vi. Ui is a subset of columns of Ai but Vi is not necessarily a sampling matrix. One might wonder whether it is possible that each server only sends Ui and the coordinator then runs k-CSSp on a concatenation of the Ui. This will not necessarily give a good approximation to minrank-k A k |A -A k |p because the columns not selected in the Ui locally on each server might become globally important. Finally, although it is possible to improve the approximation factor to O(k) by making use of an O( √ k)-approximation algorithm for p-low rank approximation that also selects a subset of columns (Mahankali & Woodruff, 2020) , this protocol would still suffer from all of the aforementioned problems. We begin the proof by first showing that applying a dense p-stable sketch to a vector will not shrink its p-norm. This is done in Lemma 2.1. We further observe that although p-stable random variables are heavy-tailed, we can still bound their tail probabilities by applying Lemma 9 from Meng & Mahoney (2013) . We note this in Lemma 2.2. Note that the Xi's do not need to be independent in this lemma. Equipped with Lemma 2.1, Lemma 2.2 and a net argument, we can now establish a lower bound on |SAT V -SA|p. We first show in Lemma 2.3 that, with high probability, for any arbitrarily selected subset AT of columns and for an arbitrary column A * j, the error incurred to fit SA * j using the columns of SAT is no less than the error incurred to fit A * j using the columns of AT . We then apply a union bound over all subsets , where αp > 0 is a constant that is at most 2 p-1 . Proof. Lemma 9 from Meng & Mahoney (2013) for p ∈ (1, 2).  (x) = x 0 2 π(t 2 +1) dt = 1 -Θ( 1 x ). Thus, for any i ∈ [t] and j ∈ [m], P r[|Sij| ≤ D] = 1 -Θ( 1 D ). Case 2: p ∈ (1, 2). We apply the upper tail bound of p-stable random variables in Lemma 2.2. For any fixed  i ∈ [t] and j ∈ [m], P r[|Sij| p ≤ D p ] ≥ 1 -Θ( log(t) D p ), which implies P r[|Sij| ≤ D] ≥ 1 -Θ( 1 D ). Therefore, for p ∈ [1, 2), = |y|p -O( 1 m 2 ) = |x|p -O( 1 m 2 ) |x|p = |y|p = 1 For a sufficiently large m, O( 1 m 2 ) is at most 1 2 , and thus |x|p 2 = 1 2 ≥ 1 m 2 . This implies |Sx|p ≥ |x|p - |x|p 2 = |x|p 2 . We can rescale S by a factor of 2 so that |Sx|p ≥ |x|p. We have shown that |Sy|p ≥ |y|p holds simultaneously for all unit vectors y in the column span of AT,j. By linearity, we conclude that |Sy|p ≥ |y|p (1 ≤ p < 2) holds simultaneously for all y in the column span of AT,j. Step 2: Next, we apply a union bound over all possible subsets T ⊂ [n] of chosen columns from A and all possible single columns A * j for j ∈ 

From

Step 1, we have shown that event E2 fails with probability D O(m log m) e t . Thus event E2 fails over all possible subsets T and all possible single columns A * j with probability at most D O(m log m) e t • d m -1 • d ≤ D O(m log m) e t • d O(m) • d The above failure probability is o(1) as long as t = Θ log D O(m log m) • d O(m) • d) = Θ(O(m log m) log(D) + O(m) log d) = Θ(O(kpoly(log k)) log(mt) + O(kpoly(log k) log(d))) Thus it suffices to have t = kpoly(log nd) to have failure probability at most 1 e O(t) . Now we condition on a single global event E1 and that event E2 holds for all possible T and A * j. We conclude that with probability 1 -Θ( mt D ) -1 e O(t) , the following holds simultaneously for all T ⊂ [n] and for all A * j for which j ∈  -Θ( mt D ) -n e O(t) = 1 -o(1), |AT V -A|p = ( n j=1 |AT yj -Aj| p p ) 1 p ≤ ( n j=1 |S(AT yj -Aj)| p p ) 1 p = |SAT V -SA|p

B.3 UPPER BOUND ON THE COST

We show an upper bound on the approximation error of k-CSSp on a sketched subset of columns, |SAT V -SAT |p, which holds for a fixed subset AT of columns and for the minimizing right factor V = arg min V |SAT V -SA|p for that subset of columns. We first adapt Lemma E.17 from Song et al. (2017) to establish an upper bound on the error |SAT V -SA|p for any fixed V in Lemma 3.1. We then apply Lemma 3.1 to the minimizer V to conclude the upper bound in Lemma 3.  min V |SAT V -SA|p ≤ min V O(log 1/p (nd))|AT V -A|p Here, the failure probability can be an arbitrarily small constant. Proof. Let X * 1 = arg min X |SAT X -SA|p and X * 2 = arg min X |AT X -A|p. By Lemma 3.1, |SAT X * 1 -SA| p p ≤ |SAT X * 2 -SA| p p ≤ O(log(kpoly(log n)d))|AT X * 2 -A| p p ≤ O(log(nd))|AT X * 2 -A| p p Therefore, min X |SAT X -SA|p ≤ min X O(log 1/p (nd))|AT X -A|p . B.4 STRONG CORESETS IN THE p,2 NORM Lemma 4 (Strong Coreset in p,2 norm). Let A ∈ R d×n , k ∈ N, p ∈ [1, 2) , and , δ ∈ (0, 1). Then, in n • poly(k log n/ ) time, one can find a sampling and reweighting matrix T with O(d log d/ 2 ) • log(1/δ) columns such that, with probability 1 -δ, for all rank-k matrices U , min rank-k V |U V -AT |p,2 = (1 ± ) min rank-k V |U V -A|p,2 AT is called a strong coreset of A. Proof. We can obtain T with O(d(log d)/ 2 ) columns using the strong coreset construction from Lemma 16 in Sohler & Woodruff (2018) . Note that the coreset construction for k-subspace approximation in Sohler & Woodruff (2018) aims at removing a dependence on d in the coreset size. The algorithm first finds a poly(k)dimensional subspace S by running a dimensionality reduction algorithm and constructs coresets in the lower dimensional subspace S, resulting in a coreset size of poly(k/ ). But in our case, we do not want our coreset size to have a polynomial dependency on k while a linear dependency on d suffices. Thus, instead of running their dimensionality reduction algorithm to find such a subspace S to project A to, we directly use the the column span of the input matrix A as the subspace S appended with a row of zeros to construct the B in Lemma 16 of Sohler & Woodruff (2018) . Note that in the original algorithm, the appended column encodes the distances from the input matrix A to the subspace S, but in our case it is just all 0's. Then the guarantees and running time claimed above immediately follow from Lemma 16 of Sohler & Woodruff (2018) . We note that the size of our coreset for k-subspace approximation can be further reduced to O(k), suppressing a logarithmic dependence on k, 1 , 1 δ , using an additional O(nd) time, by combining Corollary 5.16 of Huang & Vishnoi (2020) and the importance sampling scheme from Stage 2 of Algorithm 1 in Huang & Vishnoi (2020) . Furthermore, though not explicitly stated in Lemma 16 of Sohler & Woodruff (2018) , the coreset size has a log( 1 δ ) dependence on the failure probability δ due to the importance sampling (Sohler & Woodruff, 2018; Huang & Vishnoi, 2020) . Note that an accuracy of (1 ± ) is desired for our extension to the streaming model, described in Appendix E, and for both Algorithm 1 and Algorithm 7, we perform a union bound over strong coreset constructions, for which a log(1/δ) dependence on δ is sufficient.

C POLYNOMIAL TIME, O(1)-APPROXIMATE BI-CRITERIA k-CSS p,2

We give a detailed analysis on the polynomial time, O(1)-approximate k-CSSp,2 algorithm presented in Algorithm 4, which is based on ideas in Clarkson & Woodruff (2015) . We first use a sparse embedding matrix S to obtain an O(1)-approximate left factor. We then use p-Lewis weight sampling Cohen & Peng (2015) to select a subset of columns. Algorithm 4 polynomial time, O(1)-approximation for k-CSS p,2 (1 ≤ p < 2) Input: The data matrix A ∈ R d×n , rank k ∈ N Output: The left factor U ∈ R d× O(k) , the right factor V ∈ R O(k)×n such that |U V -A| p,2 ≤ O(1) min rank-k A k |A k -A| p,2 S ← O(k) × d sparse embedding matrix, with sparsity s = poly(log k). S ← n × O(k) sampling matrix, each column of which is a standard basis vector chosen randomly according to the p Lewis weights of columns of SA. return U ← AS , V ← (AS ) † A { † denotes the Moore-Penrose pseudoinverse.} Proof. The proof is the same as the proof of Theorem 32 in Clarkson & Woodruff (2015) , except that we adapt a different construction of the sparse embedding matrix S, which reduces the number of rows from O(k 2 ) to O(k) with increased sparsity s.  Consider A k = arg min rank-k A k |A k -A|p,2. Let V k

C.2 LEWIS WEIGHTS SAMPLING

The p Lewis weight is an inherent property of a matrix. By Cohen & Peng (2015), the unique set of p Lewis weights w for a matrix is defined as follows: for the i-th column of a matrix M ∈ R n×d , M * j, is defined as w 2/p i = M T * j (M W 1-2/p M T ) -1 M * j, where W is the diagonal matrix with Wii = wi, ∀i ∈ [d]. We first note that sampling by Lewis weights provides a subspace embedding in the p-norm in Theorem 5.6. We further apply a randomized version of Dvoretzky's Theorem in Theorem 5.7, which allows the embedding from the p norm into a low-dimensional Euclidean subspace with very small distortion and thus enables us to switch between the p norm and the p,2 norm. Based on Theorem 5.6 and Theorem 5.7, we show in Theorem 5.8 that Lewis weights sampling provides a good subset of columns, on which the analysis of k-CSSp,2 is based. Theorem 5.6. ( p-Lewis Weights Subspace Embedding) Let A ∈ R n×d and t = O(d). For 1 ≤ p < 2, there exists a distribution (λ1, λ2, . . . , λn) on the rows of A such that if we generate a matrix S with t rows, each chosen independently as the i th standard basis vector times 1 (rλ i ) Proof. This follows from Theorem 1.2 from Paouris et al. (2017) . Theorem 5.8. (Subset of Columns by Lewis Weights Sampling) Let A ∈ R d×n . Let S ∈ R m×d be a sparse embedding matrix, with m = O(k • poly(log k)poly( 1)). Further, let S ∈ R n×t be a sampling matrix whose columns are random standard basis vectors generated according to the p Lewis weights of columns of SA, with t = k • poly(log k). Then, for X = arg min rank-k X |XSAS -AS |p,2, the following holds with constant probability, Now let G ∈ R Θ(d)×d be a rescaled random matrix whose entries are i.i.d. standard Gaussian random variables as in Theorem 5.7. We apply Theorem 5.7 to transform between the p space and the Euclidean space. Since transformation of both directions can be done with very small distortion, we obtain a Θ(1) approximation. With constant probability, we have | XSA -A|p,2 ≤ Θ(1) min rank-k A k |A k -A|p,2 Proof. Let X * = arg min rank-k X * |X * SA -A|p, |X * SA -XSA|p,2 = Θ(1)|G(X * -X)SA|p By Theorem 5.7 Under review as a conference paper at ICLR 2021 Algorithm 5 Greedy k-CSS 1,2 . Here we denote the set of selected columns T from A by A T and the set of unselected columns by A T . Input: The data matrix A ∈ R d×n , the number of iterations r ≤ n. Output: A subset of columns A T from A, where |T | = r. A T ← ∅ for i = 1 to r do Column j * ← arg min j∈A T (min V |A T ∪j V -A| 1,2 ) A T ← A T ∪j * end for Algorithm 6 Lazier-than-lazy Greedy k-CSS 1,2 . This version of the greedy algorithm is based on Section 5.2 of Altschuler et al. (2016) . Input: The data matrix A ∈ R d×n , the number of iterations r ≤ n, a parameter δ ∈ (0, 1). Output: A subset of columns A T from A, where |T | = r. A T ← ∅ for i = 1 to r do T ← A subset of n log(1/δ) k columns of A, each selected uniformly at random (excluding the columns whose indices are in T ) Column j * ← arg min j∈A T (min V |A T ∪j V -A| 1,2 ) A T ← A T ∪j * end for First we analyze Algorithm 4, then show how this analysis can be extended to Algorithm 6. We first show in Lemma 7.2 an improvement of the utility function with one additional column when projecting a single vector, based on Lemma 7.1 from Altschuler et al. (2016) . We then show an improvement of the utility function when projecting a matrix in Lemma 7.3, by applying Lemma 7.2 and Jensen's Inequality, following the analysis in Altschuler et al. (2016) . Finally, we conclude our analysis for Greedy k-CSS1,2 in Theorem 7. Notation Consider the input matrix A ∈ R d×n (n d). Let B be the matrix of normalized columns of A, where the j-th column of B is B * j = A * j/|A * j|2. Let πT : R d → R d be the projection onto the column span of AT or equivalently BT . Let σmin(M ) denote the minimum singular value of some matrix M . To aid our analysis, we define a utility function as follows, inspired by Altschuler et al. (2016) . For a subset T ⊂ [n] and a matrix M ∈ R d×t (or a vector M ∈ R d ), ΦM (T ) = |M |1,2 -|M -πT M |1,2 = t i=1 |M * i|2 -|M * i -πT M * i|1,2 = t i=1 ΦM * i (T ) Observe that as the number of columns selected and added to T increases, we get a more accurate estimation of M and thus the approximation error |M -πT M |1,2 decreases, which results in an increase in the utility function ΦM (T ). Lemma 7.1. Let S, T ⊂ [n] be two sets of column indices, with |πSu|2 ≥ |πT u|2 for some vector u ∈ R d . Then, Altschuler et al. (2016) , except that we replace the condition for S and T , i.e., Φu(S) ≥ Φu(T ) in Altschuler et al. (2016) with |πSu|2 ≥ |πT u|2. The two conditions are equivalent, since k i=1 |π T i u| 2 2 -|πT u| 2 2 ≥ σmin(BS) 2 (|πSu| 2 2 -|πT u| 2 2 ) 2 4|πSu| 2 2 Proof. Lemma 2 from Φu(S) ≥ Φu(T ) ⇔ |u -πSu|2 ≤ |u -πT u|2 ⇔ |u| 2 2 -|πSu| 2 2 ≤ |u| 2 2 -|πT u| 2 2 ⇔ |πSu|2 ≥ |πT u|2 Lemma 7.2. (Utility Improvement by Projecting a Single Vector) Let S, T ⊂ [n] be two sets of column indices, with Φu(S) ≥ Φu(T ) for some vector u ∈ R d . Let k = |S|. For i ∈ [k], let T i = T ∪ {i}. Then, k i=1 Φu(T i ) -Φu(T ) ≥ σmin(BS) 2 (Φu(S) -Φu(T )) 3 16Φu(S) 2 Proof. We define a function for convenience in the analysis g : (-∞, |u| 2 2 ] → R ≥0 by g(x) = |u| 2 2 -x. Note that |g (x)| = 1 2 √ |u| 2 2 -x . k i=1 Φu(T i ) -Φu(T ) = k i=1 |u -πT u|2 -|u -π T i u|2 By definition of Φ = k i=1 |u| 2 2 -|πT u| 2 2 -|u| 2 2 -|π T i u| 2 2 By Pythagorean Theorem = k i=1 g(|πT u| 2 2 ) -g(|π T i u| 2 2 ) ≥ k i=1 |g (|πT u| 2 2 )| |π T i u| 2 2 -|πT u| 2 2 By the Mean Value Theorem = k i=1 1 2 |u| 2 2 -|πT u| 2 2 |π T i u| 2 2 -|πT u| 2 2 = 1 2|u πT u|2 k i=1 |π T i u| 2 2 -|πT u| 2 2 By Pythagorean Theorem ≥ 1 2|u -πT u|2 • σmin(BS) 2 (|πSu| 2 2 -|πT u| 2 2 ) 2 4|πSu| 2 2 By Lemma 7.1 = σmin(BS) 2 (|u -πT u| 2 2 -|u -πSu| 2 2 ) 2 8 • |πSu| 2 2 |u -πT u|2 By Pythagorean Theorem = σmin(BS) 2 8 • |πSu| 2 2 |u -πT u|2 • (|u -πT u|2 -|u -πSu|2) 2 • (|u -πT u|2 + |u -πSu|2) 2 ≥ σmin(BS) 2 (Φu(S) -Φu(T )) 2 • |u -πT u|2 8 • |πSu| 2 2 Since |u -πSu|2 ≥ 0 We now lower bound |u-π T u| 2 |π S u| 2 2 as follows. |u -πT u|2 |πSu| 2 2 = |u -πT u|2 |u| 2 2 -|u -πSu| 2 2 By Pythagorean Theorem = |u -πT u|2 |u|2 -|u -πSu|2 • 1 |u|2 + |u -πSu|2 = 1 Φu(S) • |u -πT u|2 |u|2 + |u -πSu|2 By definition of Φ ≥ 1 2Φu(S) • |u -πT u|2 |u|2 Since |u -πSu|2 ≤ |u|2 = 1 2Φu(S) • |u|2 -Φu(T ) |u|2 By definition of Φ = 1 2Φu(S) • 1 - Φu(T ) |u|2 ≥ 1 2Φu(S) • 1 - Φu(T ) Φu(S) Since Φu(S) ≤ |u|2 = (Φu(S) -Φu(T )) 2Φu(S) 2 Therefore, k i=1 Φu(T i ) -Φu(T ) ≥ σmin(BS) 2 (Φu(S) -Φu(T )) 2 • |u -πT u|2 8 • |πSu| 2 2 ≥ σmin(BS) 2 (Φu(S) -Φu(T )) 3 16Φu(S) 2 Lemma 7.3. (Utility Improvement by Projecting a Matrix) Let A ∈ R d×n , and T, S ⊂ [n] be two sets of column indices, with ΦA(S) ≥ ΦA(T ). Furthermore, let k = |S|. Then, there exists a column index i ∈ S such that ΦA(T ∪ {i}) -ΦA(T ) ≥ σmin(BS) 2 (ΦA(S) -ΦA(T )) 3 16kΦA(S) 2 Proof. The proof mostly follows the proof of Lemma 1 in Altschuler et al. (2016) . We combine Lemma 7.2 with Jensen's inequality to conclude an improvement of the utility function with one additional column when projecting a matrix instead of a single column. For j ∈ [n], we define δj = min(1, Φ A * j (T ) Φ A * j (S) ). Note that δj is 1 if the j-th column A * j has a larger projection onto BT than BS, and Φ A * j (T ) Φ A * j (S) otherwise. Let k = |S|. For i ∈ [k], let T i = T ∪ {i}. 1 σmin(BS) 2 k i=1 ΦA(T i ) -ΦA(T ) = 1 σmin(BS) 2 n j=1 k i=1 ΦA * j (T i ) -ΦA * j (T ) By definition of Φ ≥ n j=1 (1 -δj) 3 16 • ΦA * j (S) By Lemma 7.2 = ΦA(S) 16 n j=1 (1 -δj) 3 • ΦA * j (S) n i=1 ΦA * i (S) Note ΦA(S) = n i=1 ΦA * i (S) ≥ ΦA(S) 16 n j=1 (1 -δj) • ΦA * j (S) n i=1 ΦA * i (S) 3 By Jensen's Inequality = 1 16ΦA(S) 2 n j=1 (1 -δj) • ΦA * j (S) 3 Since 1 -δj ≥ 1 - ΦA * j (T ) ΦA * j (S) ⇒ (1 -δj) • ΦA * j (S) ≥ ΦA * j (S) -ΦA * j (T ) ≥ 1 16ΦA(S) 2 n j=1 (ΦA * j (S) -ΦA * j (T )) 3 = (ΦA(S) -ΦA(T )) 3 16ΦA(S) 2 Hence, k i=1 ΦA(T i ) -ΦA(T ) ≥ σmin(BS) 2 (ΦA(S) -ΦA(T )) 3 16ΦA(S) 2 This implies there is at least one column of BS, with index i ∈ S, such that when i is added to T , the utility function ΦA(T ) increases by at least 1 k • σmin(BS) 2 (Φ A (S)-Φ A (T )) 3 16Φ A (S) 2 . Theorem 7. Let A ∈ R d×n be the data matrix and k ∈ N be the desired rank. Let AS be the best possible subset of k columns, i.e., AS = arg min A S minV |ASV -A|1,2. Let σ be the minimum non-zero singular value of the matrix B of normalized columns of AS, (the j-th column of B is B * j = (AS) * j/|(AS) * j|2). Then, if T ⊂ [n] is the subset of columns selected by Greedy k-CSS1,2, the following holds with |T | = Ω( k σ 2 2 ), min V |AT V -A|1,2 ≤ (1 -) min S⊂[n],|S|=k,V ∈R k×n |ASV -A|1,2 + |A|1,2 Proof. The proof follows the one of Theorem 1 in Altschuler et al. (2016) . Let Tt be the subset of columns of B selected by Greedy k-CSS1,2 after t iterations. Notice that T0 = ∅. In addition, define F = ΦA(S) = ΦA(S) -ΦA(T0), ∆0 = F , and ∆i = ∆ 0 2 i for i ∈ N. Let ∆i ≥ ΦA(S) -ΦA(Tt) ≥ ∆i+1 = ∆ i 2 . Our goal is to bound the number of iterations needed for the gap between ΦA(S) and ΦA(Tt) to become less than ∆ i 2 . Consider some iteration s for which ΦA(S) -ΦA(Ts) ≥ ∆ i 2 . The improvement of the utility function after adding a column of B to Ts through greedy selection is at least the improvement of the utility function after adding the best column of BS to Ts. By Lemma 7.3, this is at least σ 2 (ΦA(S) -ΦA(T )) 3 16kΦA(S) 2 = σ 2 • ∆ 3 i 16 • 8 • k • F 2 = σ 2 ∆ 3 i 128kF 2 . If the gap ΦA(S) -ΦA(Tt) after t iterations is at most ∆i and at least ∆ i 2 , then after at most 64kF 2 σ 2 ∆ 2 i iterations, the gap becomes at most ∆ i 2 . We can use this to bound the number of iterations required for the gap to become at most F . Take N ∈ N such that ∆N+1 ≤ F ≤ ∆N . Then the number of iterations required for the gap to become at most ∆N+1 is at most N i=0 64kF 2 σ 2 ∆ 2 i = 64kF 2 σ 2 N i=0 1 ∆ 2 i = 64kF 2 σ 2 ∆ 2 N +1 N i=0 1 4 N +1-i Since ∆N+1 = ∆i 2 N +1-i ≤ 256k 3σ 2 2 Since ∆N+1 ≥ F 2 and N i=0 1 4 N +1-i ≤ 1 3 Therefore, after |T | = Ω( k δ 2 2 ) iterations, we have ΦA(S) -ΦA(T ) ≤ ΦA(S) ⇒ |A|1,2 -|A -πSA|1,2 -(|A|1,2 -|A -πT A|1,2) ≤ (|A|1,2 -|A -πSA|1,2) ⇒ |A -πT A|1,2 ≤ (1 -)|A -πSA|1,2 + |A|1,2 Since S is the set of indices for the best possible subset of k columns, the above is equivalent to min V |AT V -A|1,2 ≤ (1 -) min S⊂[n],|S|=k,V ∈R k×n |ASV -A|1,2 + |A|1,2 We now analyze Algorithm 2, based on the analysis of the Lazier-than-lazy greedy heuristic in Altschuler et al. (2016) . The first step is the following lemma based on Lemma 6 of Altschuler et al. (2016) , which shows that in expectation, the utility improves by a large amount on each iteration.  E[max i∈T ΦA(T ∪ {i})] -ΦA(T ) ≥ (1 -δ) • σmin(BS) 2 • (ΦA(S) -ΦA(T )) 3 16kΦA(S) 2 Proof. The proof is nearly identical to the proof of Lemma 6 of Altschuler et al. ( 2016) -we include the full proof for completeness. The first step in the proof is showing that T ∩ (S \ T ) is nonempty with high probability. Then, by conditioning on T ∩ (S \ T ) being nonempty, we can show that the expected increase in utility is large. For the purpose of this analysis, we assume that the columns of T are sampled independently with replacement. At the end of the proof, we discuss sampling the columns of T without replacement. First, observe that Pr[T ∩ (S \ T ) = ∅] = O n log(1/δ) k t=1 1 - |S \ T | n -|T | = 1 - |S \ T | n -|T | O n log(1/δ) k ≤ e - |S\T | n-|T | • n log(1/δ) k By 1 -x ≤ e -x ≤ e -|S\T | log(1/δ) k Because n -|T | < n meaning that Pr[T ∩ (S \ T )] ≥ 1 -e -|S\T | log(1/δ) k = 1 -δ |S\T | k ≥ (1 -δ) |S \ T | k Since |S \ T | ≤ k, and 1 -δ x ≥ (1 -δ)x for x, δ ∈ [0, 1] Therefore, E[max i∈T ΦA(T ∪ {i}) -ΦA(T )] ≥ Pr[T ∩ (S \ T ) = ∅] • E max i∈T ΦA(T ∪ {i}) -ΦA(T ) T ∩ (S \ T ) = ∅ ≥ (1 -δ) |S \ T | k • E max i∈T ΦA(T ∪ {i}) -ΦA(T ) T ∩ (S \ T ) = ∅ ≥ (1 -δ) |S \ T | k • E max i∈T ΦA(T ∪ {i}) -ΦA(T ) |T ∩ (S \ T )| = 1 (Since it is always better for T ∩ (S \ T ) to be larger) = (1 -δ) |S \ T | k • i∈S\T (ΦA(T ∪ {i}) -ΦA(T )) |S \ T | = (1 -δ) • i∈S (ΦA(T ∪ {i}) -ΦA(T )) |S| (Since ΦA(T ∪ {i}) = ΦA(T ) for i ∈ T ) ≥ (1 -δ) • 1 |S| • σmin(BS) 2 (ΦA(S) -ΦA(T )) 2 16ΦA(S) 2 (See the proof of Lemma 7.3.) This proves the lemma in the case where the columns are sampled with replacement. Now, we discuss what happens when sampling without replacement. Note that the expected increase in utility can only be higher if the columns of T are sampled without replacement. Intuitively, this is because if T has some repeated columns, then it is always better to replace those repeated columns with other columns of A. Thus, for each instance of T where some columns are sampled multiple times, we can "move" all of the probability mass from this instance of T to other sets T ⊂ [n] \ T , which contain T but do not have repeated elements. This leads to the uniform distribution on subsets of [n] \ T with no repeated elements, i.e., the distribution that results from sampling without replacement. Using this lemma, we analyze the convergence of Algorithm 2: Theorem 8. Let A ∈ R d×n be the data matrix and k ∈ N the desired rank. Let AS be the best subset of k columns, i.e., AS = arg min A S minV |ASV -A|1,2. Let σ be the minimum non-zero singular value of the matrix B of normalized columns of AS (meaning the j-th column of B is B * j = (AS) * j/|(AS) * j|2). Then, if T ⊂ [n] is the subset of columns selected by Algorithm 2, the following holds if |T | = Ω( k σ 2 2 ): E[min V |AT V -A|1,2] ≤ (1 -) min S⊂[n],|S|=k,V ∈R k×n |ASV -A|1,2 + |A|1,2 Proof. The proof is uses the same strategy as that of Theorem 5 of Altschuler et al. (2016) (and Theorem 7 above), with minor modifications. Let Tt be the subset of columns of B selected by Algorithm 2 after t iterations (in particular, T0 = ∅). In addition, let F = ΦA(S) = ΦA(S) -ΦA(T0), ∆0 = F , and ∆i+1 = ∆ i 2 . Now, fix a time t such that for some i, ∆i ≥ ΦA(S) -ΦA(Tt) ≥ ∆i+1 = ∆ i 2 . Then, we bound the number of additional iterations t needed so that E[ΦA(S) -ΦA(T t+t ) | Tt] < ∆i+1 For convenience, for each k ≥ 0, define E k := ΦA(T t+k ). Then, our goal is to find t such that ΦA(S) -E t < ∆i+1 However, observe that from Lemma 7.4 above, we obtain E k+1 -E k = E ΦA(T t+k+1 ) -ΦA(T t+k ) Tt = E E ΦA(T t+k+1 ) -ΦA(T t+k ) T t+k Tt By E[E[X|Y ]] = E[X] ≥ E (1 -δ) • σmin(BS) 2 • (ΦA(S) -ΦA(T t+k )) 3 16kΦA(S) 2 Tt By Lemma 7.4 = (1 -δ) • σmin(BS) 2 16kΦA(S) 2 • E (ΦA(S) -ΦA(T t+k )) 3 |Tt ≥ (1 -δ) • σmin(BS) 2 16kΦA(S) 2 • E[ΦA(S) -ΦA(T t+k )|Tt] 3 By Jensen's Inequality = (1 -δ) • σmin(BS) 2 • (E[ΦA(S)] -E k ) 3 16kΦA(S) 2 Now, suppose that ∆i ≥ ΦA(S) -Es ≥ ∆i+1, for s = 0, . . . , t -1. Then, for all such s, Es+1 -Es ≥ (1-δ)σ min (B S ) 2 ∆ 3 i+1 16kF 2 . Summing these inequalities for s = 0, . . . , t -1, we find that E t -E0 ≥ (1 -δ)σmin(BS) 2 16kF 2 • ∆ 3 i+1 • t and for the increase from E0 to E t to be greater than ∆i+1, it suffices to have t ≥ 32kF 2 ∆ 2 i+1 • (1 -δ)σmin(BS) 2 In summary, if ΦA(S) -E[ΦA(Tt)] ≤ ∆i, then in at most 32kF 2 ∆ i+1 •(1-δ)σ min (B S ) 2 iterations, ΦA(S) -E[ΦA(Tt)] ≤ ∆i+1. Thus, if we let N ∈ N such that ∆N+1 ≤ √ 1-δ F ≤ ∆N , then the number of iterations t needed to have ΦA(S) -E[ΦA(Tt)] < ∆N+1 is at most N i=0 32kF 2 ∆ 2 i+1 • (1 -δ)σmin(BS) 2 = 32kF 2 (1 -δ)σmin(BS) 2 N i=0 1 ∆ 2 i+1 = 32kF 2 (1 -δ)σmin(BS) 2 N i=0 1 4 N -i • 1 ∆ 2 N +1 ≤ 32kF 2 (1 -δ)σmin(BS) 2 • 4(1 -δ) 2 F 2 N i=0 1 4 N -i Since ∆N+1 ≥ F 2 √ 1 -δ = 128k σmin(BS) 2 2 N i=0 1 4 i ≤ 512k 3σmin(BS) 2 2 Thus, after t = O( k σ min (B S ) 2 2 ) iterations, ΦA(S) -E[ΦA(Tt)] ≤ √ 1 -δ ΦA(S) meaning |A|1,2 -|A -πSA|1,2 -E[|A|1,2 -|A -πT A|1,2] ≤ √ 1 -δ |A|1,2 -√ 1 -δ |A -πSA|1,2 and rearranging gives E[|A -πT A|1,2] ≤ 1 -√ 1 -δ |A -πSA|1,2 + √ 1 -δ |A|1,2 This completes the proof (note that we can select δ = , meaning 1 √ 1-δ = O(1) for < 1 2 ).

E EXTENSION TO THE STREAMING MODEL

In this section, we describe how our protocol in Algorithm 1 can be made into a 1-pass streaming algorithm for column subset selection in the p norm. The algorithm is shown in Algorithm 7, and is analyzed in Theorem 9 below. The algorithm and its analysis follow the standard merge-and-reduce framework (see McGregor (2014) ). Theorem 9 (Analysis of Algorithm 7). Let A ∈ R d×n and k ∈ N, and assume Algorithm 7 sees the columns of A one at a time in the stream S. Then, Algorithm 7 returns U ∈ R d× O(k) such that min V ∈R O(k)×n U V -A p ≤ O(k 1/p-1/2 ) min A k rank k A k -A p with probability 0.9. Moreover, the space complexity of Algorithm 7 is O(dk). Proof. Let r = O(k) be the bi-criteria rank of Algorithm 4. Then, we will be repeatedly applying Lemma 4 with k being equal to r, i.e., we will create coresets which preserve the errors when projecting onto all subspaces of dimension r. Now, at every iteration of Algorithm 7, each element (L, t) of L can be thought of as holding a coreset of the columns of A in some interval I in [n] . We prove the following intermediate lemma by induction on t: Lemma 10. Let B be the concatenation of all the sketched columns in L (which have been reweighted by multiple applications of Lemma 4). Then, for all subspaces V ⊂ R O(k) of dimension at most r, B -PV B p,2 = 1 ± 1 log n t SAI -PV SAI p,2 Proof. We proceed by induction on t. The lemma is clear when t = 0, since in that case, the columns of B are simply sketched columns of A which have not been re-weighted. Now, suppose t > 0, and suppose the lemma holds for smaller values of t. Then, the sketched columns in L must have been obtained as follows: there previously existed two elements (L1, t -1) and (L2, t -1) of L, such that if B1 is the concatenation of the sketched columns in L1 and B2 is the concatenation of the sketched columns in L2, and T is a coreset for the concatenation B3 of B1 and B2, then B = B3T . The sketched columns in L1 and L2 form coresets for two intervals in [n], which we denote by I1 and I2 respectively. Applying Lemma 4, we find that B -PV B p p,2 = 1 ± 1 log n p B3 -PV B3 p p,2 = 1 ± 1 log n p B1 -PV B1 p p,2 + B2 -PV B2 p p,2 = 1 ± 1 log n p • 1 ± 1 log n p(t-1) SAI 1 -PV SAI 1 p p,2 + SAI 2 -PV SAI 2 p p,2 = 1 ± 1 log n pt • SAI -PV SAI p p,2 where the second and last equalities are because the p th power of the p,2 norm of a matrix decomposes across the columns of the matrix, and the third equality is by the induction hypothesis. By taking p th roots, we find that B -PV B p p,2 = 1 ± 1 log n t SAI -PV SAI p p,2 This proves the lemma. Now, if L is the unique element of L remaining at the end of the for loop in Algorithm 7, let B be the concatenation of the sketched columns in L. Then, L is a coreset for all the columns of A, and by the above lemma, the distortion of L is at most (1 ± 1 log n ) log n ∈ [ 1 e , e] (since t ≤ log n -note that, as in all applications of the merge-and-reduce framework, the coresets contained in L over the course of the algorithm form a binary tree, with the leaf nodes being contiguous intervals of length O(k)). Hence, for all subspaces V of dimension at most r, B -PV B p,2 = Θ(1) SA -PV SA p,2 and in particular, if k) is the matrix formed by concatenating the columns of B selected by running Algorithm 4, then M ∈ R O(k)× O( SA -PM SA p,2 ≤ Θ(1) min T ⊂[n],|T |≤k SA -PT SA p,2 Algorithm 7 1-pass streaming algorithm for k-CSS p . L is a collection of coresets of columns. Over the course of the algorithm, each element L ∈ L will represent a contiguous subset of 2 t columns, for some t ∈ [log n] -to determine when two coresets should be merged, we will also keep track of the size of each coreset in L. Hence, each element of L is of the form (L, t) where L is a list of sketched columns (and their corresponding unsketched columns) and t is the number of times this list has been involved in a merging operation. Input: A stream S in which the columns of the data matrix A ∈ R d×n arrive one at a time, and the target rank k ∈ N Output: The left factor U ∈ R d× O(k) S ← An O(k) × d random matrix with i.i.d. standard p-stable entries ← 1 log n δ ← O( 1 n ) f ← O(k) L ← ∅ for Each column A * j that arrives from S do if L is empty. then L ← {}, where {} is the empty list. else if The last element (J, t) of L is such that t = 0 (i.e., it has been merged 0 times). then C ← A coreset of L ∪ L , computed as specified in Lemma 4 -the k in the statement of Lemma 4 will be the bicriteria rank of Algorithm 4, which is O(k). The parameters δ and will be as specified at the beginning of this algorithm. (Only compute the coreset of the sketched columns -for those which are included in the coreset, include the corresponding unsketched columns as well (but without re-weighting them)). L ← (C, t + 1) -note that for each of the re-scaled columns that are included in C, we include their original indices in A as well. 2016)). Throughout this protocol, GREEDY(A, T, r) denotes a single-machine procedure, which does the following: if A ∈ R d×n is a data matrix, and T ∈ R d×t is a set of columns (not necessarily of A) then a subset S of r columns of T is constructed iteratively over r steps, such that at each step, the new column of T to add to S is greedily chosen -that is, the chosen column increases |π S A|foot_3 2 the most, or equivalently, decreases |A -π S A| 2 2 the most. In other words, it is the same as our Algorithm ??, but for the Frobenius norm rather than the 1,2 -norm. L ← J Remove (J, t) from the end of L. else L ← {} end if L ← L ∪ {(SA * j , A * j )} L ← L ∪ (L, 0) /* Input: The data matrix A ∈ R d×n , target rank k ∈ N, the number of servers s ∈ N. The columns of A are assumed to be randomly partitioned among servers T 1 , T 2 , . . . , T s . Output: A subset of columns A T from A. |T | = O( k σ ) , where if A OP T is the optimal subset of columns of A of size k, and the columns of A OP T are normalized to have unit 2 norm, then σ is the smallest singular value of A OP T . S i ← GREEDY(A, T i , 32k σ ) for all i ∈ [s] (The i th server performs this computation.) Each server sends its S i to the coordinator. T ← ∪ s i=1 S i (This computation is done by the coordinator.) S ← GREEDY(A, T, 12k σ ) The coordinator returns arg max S ∈{S,S1,S2,...,Ss} |π S A| 2 2 In our implementation of Algorithm 8, we make use of Projection-Cost Preserving sketches (PCPs). Another optimization described in Altschuler et al. (2016) , which we use in our implementation, is the LAZIER-THAN-LAZY-GREEDY ALGORITHM. The difference between the GREEDY(A, T, r) algorithm and the LAZIER-THAN-LAZY-GREEDY(A, T, r) algorithm is as follows: while at each of the r iterations, GREEDY considers all columns of T and chooses the one that leads to the most improvement in the objective, LAZIER-THAN-LAZY-GREEDY samples |T | log( 1 δ ) r columns uniformly at random for some small δ (which we take in our implementation to be 0.005), and out of those columns, chooses the one which leads to the greatest improvement (where |T | is the number of columns in T ). This leads to a significant speedup to GREEDY, and it was shown in Altschuler et al. (2016) that this does not significantly worsen the approximation guarantees that Altschuler et al. (2016) shows for GREEDY.

F.2 SETUP

We compare our distributed protocol, in the case p = 1, to Algorithm 8. For both protocols, at the outset we fix the number of columns which are selected. Algorithm 8 is rewritten as Algorithm 9 to reflect this. In this section, we use k to denote the number of columns ultimately selected. Algorithm 9 Distributed Greedy Column Subset Selection for the Frobenius Norm (Algorithm 2 of Altschuler et al. (2016) ). For our empirical comparison, we fix the number of columns selected at the outset. Input: The data matrix A ∈ R d×n , desired number of columns k ∈ N, the number of servers s ∈ N. The columns of A are assumed to be randomly partitioned among servers T 1 , T 2 , . . . , T s . Output: A subset of columns A T from A, with |T | = k. AR ← a PCP for A as discussed in the previous section. All servers and the coordinator have access to AR. S i ← GREEDY(AR, T i , k) for all i ∈ [s] (The i th server performs this computation.) Each server sends its S i to the coordinator. T ← ∪ s i=1 S i (This computation is done by the coordinator.) S ← GREEDY(AR, T, k) The coordinator returns arg max S ∈{S,S1,S2,...,Ss} |π S (AR)| 2 • secom, a 591 × 1567 matrix dataset. secom has missing entries, which we replace with 0s for the purposes of our experiments. secom is available at https://archive.ics.uci.edu/ml/ datasets/SECOM.

F.2.1 PARAMETERS

We compare our distributed protocol for k-CSS1, using Greedy k-CSS1,2 as a subroutine, with Algorithm 9, for several choices of k -on gastro lesions, we let k ∈ {10, 20, 30}, while on secom, we let k ∈ {30, 60, 90, 120}. For our protocol, cauchy size is set to 2k and coreset size is set to 5k for both datasets (where cauchy size and coreset size have the same meanings as in the main body of this paper.) For Algorithm 9, the number of columns in the PCPs is set to 8k on gastro lesions, and 7k on secom. The hyperparameters coreset size and cauchy size, and the number of columns in the PCPs, are set this way so that the amount of communication that each algorithm is allowed is roughly equal (Algorithm 9 is allowed slightly more communication). To see this, let d be the number of rows in the data matrix A. If c is the number of columns in the PCPs, then the total communication used to transmit the PCPs between servers is 2scd, since the servers must first send their respective AiRi to the coordinator, and the coordinator then sends AR to all of the servers. By comparison, if r1 is the number of rows in the initial Cauchy matrix in our protocol, and r2 is the number of columns in each coreset sent by the servers to the coordinator, then the total communication required to transmit the Cauchy matrix and the coresets is (r1 + r2)sd. We choose r1, r2 and c so that 2scd is slightly higher than (r1 + r2)sd. Note that if k is the number of columns ultimately selected, then our protocol uses 2dk additional bits of communication between the servers and the coordinator to recover the final k-subset of columns, while Algorithm 9 will use sdk communication to send the subsets of columns Si (of size k) from the servers to the coordinators. This is not included in our hyperparameter calculations.

F.2.2 HOW TRIALS ARE CONDUCTED

With these choices of hyperparameters, we conduct 15 trials as follows. In each trial, if A ∈ R d×n is our data matrix, then we shuffle the columns and then divide them equally between 2 servers (since for the theoretical guarantees of Algorithms 8 9 to apply, the columns should be partitioned randomly). Using this partition across 2 servers, we run our protocol and Algorithm 9. For each protocol, once the subset of columns is computed, we perform multiple-response 1 regression to evaluate the error. In the next section, we report the minimum error across the 15 trials for each dataset and for each value of k. We also report the mean error, along with the standard deviation. Finally, we compute the work and the span of our protocol and Algorithm 9 using Python's time.process time() utility. This does not include the time taken to perform multiple-response 1 regression. For secom, with k = 30 and k = 60, the trials were performed on a Late-2016 Macbook Pro with a 2.7 GHz Quad-Core Intel Core i7 processor and 16GM of memory. The rest were performed on a 2019 MacBook Pro with a 2.4 GHz Intel Core i5 processor and 8 GB 2133 MHz LPDDR3 memory.

F.3 RESULTS

Our results are shown and discussed in Figures 3 (minimum and mean/standard deviation for 1 error) and 4 (average work and span across 15 trials).

G FULL EXPERIMENTAL RESULTS

For our protocol, we considered various additional hyperparameter settings on real-world datasets (bcsstk13, isolet and 5 images in the Caltech-101 dataset) -these settings are shown in Tables 3, 4 , and 5. As before, cauchy size is the number of rows in our initial Cauchy matrix, sent to the servers at the beginning of the protocol, and coreset size is the size of the strong coreset sent by each server to the coordinator. For regular k-CSS1,2, sketch size is the number of rows in the sparse embedding matrix, and sparsity is the number of nonzero entries in each column of the sparse embedding matrix. For each of these settings, we display the minimum error, as well as the mean error and the standard deviation for each setting and each rank, as shown in Figure 5 . We also display the average work/span of each setting, for each rank, as shown in Figure 6 . We observe that not only does greedy k-CSS1,2 perform better than all settings of regular k-CSS1,2 in minimum and mean errors across multiple trials, it also has a smaller variance in performance.

G.1 DETAILS OF WORK/SPAN COMPUTATION

In Figure 6 , work and span were recorded as the time taken (using Python's time.process time() utility) to compute the column subset in our distributed protocol. In particular, it does not include the time spent performing 1 regression to obtain the right factor and consequently the entry-wise 1 errors. For ranks 90 and 120, our protocol affords a 3% improvement in minimum error secom, while for rank 30 on gastro lesions, our protocol gives over 40% improvement. Note that the mean error for both protocols is noticeably higher on gastro lesions -in the case of our protocol, the error in the case k = 30 is distorted by two trials with 1 errors 205827 and 134945 respectively. 



We provide a detailed analysis of our streaming algorithm in Appendix E. We give this protocol and the analysis in Appendix A http://gabrilovich.com/resources/data/techtc/techtc300/techtc300.html Note that in Algorithm 9, the coordinator now chooses as many columns as chosen by the servers, as opposed to Algorithm 8, where it chooses a somewhat smaller number of columns -note that this cannot harm the performance of the algorithm.We compare our protocol to Algorithm 9, using the following datasets:• gastro lesions, a 76 × 698 matrix dataset available at https://archive.ics.uci.edu/ ml/datasets/Gastrointestinal+Lesions+in+Regular+Colonoscopy.



Figure 1: An overview of the proposed protocol for distributed k-CSS p in the column partition model.Step 1: Server i applies a dense p-stable sketching matrix S to reduce the row dimension of the data matrix A i . S is shared between all servers.Step 2: Server i constructs a strong coreset for its sketched data matrix SA i , which is a subsampled and reweighted set of columns of SA i . Server i then sends the coreset SA i T i , as well as the corresponding unsketched, unweighted columns A i D i selected in the strong coreset SA i T i to the coordinator.Step 3: The coordinator concatenates the SA i T i column-wise, applies k-CSS p,2 to the concatenated columns and computes the set of indices of the selected columns.Step 4: The coordinator recovers the set of selected columns A I from the unsketched, unweighted columns A i D i 's through previously computed indices.

cauchy size, 6k) min(4 × cauchy size, 6k)

(a) synthetic (b) TechTC

Figure 2: Results on synthetic and TechTC. The green line denotes Greedy k-CSS 1,2 , the orange lines denotes Regular k-CSS 1,2 , and the blue line denotes SVD.

PRELIMINARIES FOR OUR UPPER BOUND PROOFS B.1 NORMS Lemma 1. (Norm Relationships) For a matrix A ∈ R d×n , |A|p,2 ≤ |A|p and |A|p ≤ d 1 p -1 2 |A|p,2, where 1 ≤ p < 2. Proof. Let x ∈ R d . For 0 < p < r, BOUND ON THE COST -NO CONTRACTION WHEN APPLYING A p-STABLE SKETCHWe show a lower bound on the approximation error for a sketched subset of columns, |SAT V -SA|p, in terms of |AT V -A|p. The lower bound holds simultaneously for any arbitrary subset AT of chosen columns, and for any arbitrary right factor V .

Lemma 2.3. (No Contraction for All Sketched Subsets and Columns) Let A ∈ R d×n , and k ∈ N. Let t = k • poly(log nd), and let S ∈ R t×d be a matrix whose entries are i.i.d. standard p-stable random variables, rescaled by Θ(1/t 1 p ). Finally, let m = k • poly(log k). Then, with probability 1 -o(1), for all T ⊂ [n] with |T | = m, for all j ∈ [n], and for all y ∈ R |T | , |AT y -A * j|p ≤ |S(AT y -A * j)|p Proof. Step 1: We first extend Lemma 2.1 and argue that applying a p-stable sketching matrix S ∈ R t×n will not shrink the norm |Sy|p ≥ |y|p (1 ≤ p < 2) simultaneously for all y in the column span of [AT , Aj] =: AT,j, by a net argument. In order to bound the p-norm of sketched vectors y in a net, we begin by showing that with high probability all entries in S are bounded. Let D = poly(mt). Consider the following two cases: Case 1: p = 1. The entries of the 1-stable sketching matrix Sij are standard Cauchy random variables. Consider half-Cauchy random variables Xi,j = |Si,j|. The cumulative distribution function of half Cauchy random variables x is F

by a union bound over all entries in S, if we define the event E1 to mean that for all i ∈ [t] and j ∈ [m], we simultaneously have |Sij| ≤ D, then Pr[E1] ≥ 1 -Θ( mt D ) by a union bound. The event E1 occurring implies that for any y ∈ R d , since all entries in S are rescaled by O(1/t 1/p ),|Sy|p ≤ |y|pt 1/p |S|∞ ≤ D|y|p Consider the unit p ball B = {y ∈ R d : |y|p = 1, ∃z ∈ R m s.t. y = AT,jz} in the column span of AT,j. A subset N ⊂ B is a γ-net for B iffor all y ∈ B there exists some u ∈ N such that |y -u|p ≤ γ, for some distance γ > 0. There exists such a net N for B of size |N | = ( 1 γ ) O(m) by a standard greedy construction, since the column span of AT,j has dimension at most m + 1. We choose γ = 1 m 2 D , and thus |N | ≤ (m 2 D) O(m) = D O(m log m) . By applying Lemma 2.1, and a union bound over all vectors y ∈ N , we have that event E2: for all y ∈ N simultaneously, |Sy|p ≥ |y|p -E2 has probability at least 1 -D O(m log m) e t . Consider an arbitrary unit vector x ∈ B. There exists some y ∈ N such that |x-y|p ≤ γ = 1 m 2 D . Conditioning on both events E1 and event E2, we have the following with probability 1 -Θ( mt D ) -D O(m log m) e t : |Sx|p ≥ |Sy|p -|S(x -y)|p Triangle Inequality ≥ |y|p -|S(x -y)|p By event E2 ≥ |y|p -D|(x -y)|p Implication of event E1 ≥ |y|p -Dγ By |x -y|p ≤ γ

[n], to argue that |S(AT y -A * j)|p ≥ |AT y -A * j|p holds for all y ∈ R |T | and all T ⊂ [n], j ⊂ [n] with high probability. Note that |T | = m = O(k • poly(log k)).

[n]: |AT y -A * j|p ≤ |S(AT y -A * j)|p. Lemma 2. (Lower Bound for Sketched Error) Let A ∈ R d×n and k ∈ N. Let t = k • poly(log(nd)), and let S ∈ R t×d be a matrix whose entries are i.i.d. standard p-stable random variables, rescaled by Θ(1/t 1 p ). Then, with probability 1 -o(1), for all T ⊂ [n] with |T | = k • poly(log k) and for all V ∈ R |T |×n , |AT V -A|p ≤ |SAT V -SA|p Proof. Let yj denote the j-th column of V , where j ∈ [n]. By applying Lemma 2.3, and a union bound over all columns of V , for m = |T | = k • poly(log k), t = k • poly(log(nd)) and D = poly(mt) = poly(k log(nd)), the following holds with probability 1

Lemma 3.1. (An Upper Bound on Norm of A Sketched Matrix) Given A ∈ R n×d and p ∈ [1, 2), and U ∈ R n×k and V ∈ R k×d , if S ∈ R t×n is a dense p-stable matrix, whose entries are rescaled byΘ 1 SA| p p ≤ O(log(td))|U V -A| p pHere, the failure probability O(1) can be arbitrarily small.Proof. Lemma E.17 fromSong et al. (2017). Lemma 3. (Upper Bound on Sketched Error) Let A ∈ R d×n and k ∈ N. Let t = k • poly(log(nd)), and let S ∈ R t×d be a matrix whose entries are i.i.d. standard p-stable random variables, rescaled by Θ(1/t 1 p ). Then, for a fixed subset T ⊂ [n] of columns with |T | = k • poly(log k), with probability 1 -O(1), we have

be a basis for the column space of A k . By applying Lemma 5.3 and Lemma 5.4 on the basis V k , we conclude the above theorem by setting the number m of rowsto m = O k log 8 ( k2 ) and sparsity s = poly(log k) in the sparse embedding matrix S.

1 p with probability λi, then with probability 1 -o(1), the following holds for all x ∈ R d , Ω(1)|Ax|p ≤ |SAx|p ≤ O(1)|Ax|p Proof. Theorem 7.1 from Cohen & Peng (2015) Theorem 5.7. (Randomized Dvoretzky's Theorem) Let n ∈ N, and ∈ (0, 1). Let r = n 2 . Let G ∈ R r×n be a random matrix whose entries are i.i.d. standard Gaussian random variables, rescaled by 1 √ r . For r = n 2 , the following holds with probability 1 -e -Θ(n) , for all y ∈ R n , |Gy|p = (1 ± )|y|2

Lemma 7.4 (Expected Increase in Utility -Based on Lemma 6 of Altschuler et al. (2016)). Let A ∈ R d×n , and let T, S ⊂ [n] be two sets of column indices, with k := |S| and ΦA(S) ≥ ΦA(T ). Let T be a set of n log(1/δ) k column indices of A, chosen uniformly at random from [n] \ T . Then,

The unique element of L L ← The result of running Algorithm 4 on the sketched columns in L -for each of these sketched columns, store the corresponding unsketched column as well. U ← The d × O(k) matrix whose columns are the unsketched columns in L return U Algorithm 8 Distributed Greedy Column Subset Selection for the Frobenius Norm (Algorithm 2 of Altschuler et al. (

(a) gastro lesions: min error (b) gastro lesions: mean & std error (c) secom: min error (d) secom: mean & std error

Figure 3: Plots (a) and (c) show minimum 1 error across 15 trials on gastro lesions and secom respectively, while plots (b) and (d) show the mean 1 errors and the standard deviation.For ranks 90 and 120, our protocol affords a 3% improvement in minimum error secom, while for rank 30 on gastro lesions, our protocol gives over 40% improvement. Note that the mean error for both protocols is noticeably higher on gastro lesions -in the case of our protocol, the error in the case k = 30 is distorted by two trials with 1 errors 205827 and 134945 respectively.

(a) gastro lesions: Average Work (b) gastro lesions: Average Span (c) secom: Average Work (d) secom: Average Span error

Figure 4: Plots (a) and (b) show the average work and span respectively for gastro lesions, while plots (c) and (d) show the average work and span for secom. As expected, our protocol using GREEDY k-CSS 1,2 takes more time -to our knowledge, there is not yet an optimization similar to LAZIER-THAN-LAZY GREEDY for the 1,2 -norm. Nevertheless, running time is less important than communication in the distributed setting.

(a) bcsstk13: min error (b) bcsstk13: mean & std error (c) isolet: min error (d) isolet: mean & std error (e) caltech-101: min error (f) caltech-101: mean & std error

Figure 5: Results on bcsstk13, isolet, and caltech-101 from top to bottom. The left plots show minimum error across all 15 trails; the right plots show the corresponding mean and standard deviation in error. In all plots, the first bar denotes SVD, the second bar denotes greedy k-CSS 1,2 , and the rest of the bars denote all settings of regular k-CSS 1,2 , at all ranks on the axis 10 through 60.

(a) bcsstk13: work (b) bcsstk13: span (c) isolet: work (d) isolet: span (e) caltech-101: work (f) caltech-101: span

Figure 6: Results on bcsstk13, isolet, and caltech-101 from top to bottom. The left plots show average work in seconds across all 15 trails; the right plots show the corresponding average span. In all plots, the first bar denotes greedy k-CSS 1,2 , and the rest of the bars denote all settings of regular k-CSS 1,2 , at all ranks on the axis 10 through 60.

2 (Sketched Error Lower Bound). Let A ∈ R d×n and k ∈ N.

Cost. Sharing the dense p-stable sketching matrix S with all servers costs O(sdk • poly(log(nd))) communication (this can be removed with a shared random seed). Sending all coresets SA i T i (∀i ∈ [s]) and the corresponding columns A i D i to the coordinator costs O(sdk)

A summary of datasets used in the experiments.

Let us consider bounding |X * SA -XSA|p,2.

Now, we merge coresets in L as much as possible. */ while True do Exit this while loop if L has only 1 element.

3k) min(20, 5k) min(40, 3k) min(40, 5k) min(2, k/2) min(2, k/3) min(2, k/5) min(5, k/2) min(5, k/3) min(5, k/5)

All hyperparameters used on bcsstk13, when regular k-CSS 1,2 is used. In this table, k denotes the number of columns ultimately selected. In all settings (including when greedy k-CSS 1,2 , not included here, is used), cauchy size is either 5k or 8k, and coreset size is 5k. The setting numbers are shown on top -in the following error plots, Setting 0 is used to refer to our protocol when greedy k-CSS 1,2 is used. Note that sparsity can be at most sketch size, since sketch size is the number of rows of the sparse embedding matrix, while sparsity is the number of nonzero entries in any column. (There is a slight typo in Table2in Section 8 of our main paper, where coreset size is given as 10k.)

All hyperparameters used on isolet, when regular k-CSS 1,2 is used. In all settings (including when greedy k-CSS 1,2 , not included here, is used), cauchy size is 4k, and coreset size is 4k. The setting numbers are shown on top -in the following error plots, Setting 0 is used to refer to our protocol when greedy k-CSS 1,2 is used.

C.1 SPARSE EMBEDDING MATRICES

The sparse embedding matrix S ∈ R O(k)×d of Nelson & Nguyen (2013) , and used by Clarkson & Woodruff (2015) , is constructed as follows: each column of S has exactly s non-zero entries chosen in uniformly random locations. Each non-zero entry is a random value ± 1 √ s with equal probability. s is also called the sparsity of S. Let h be the hash function that picks the location of the non-zero entries in each column of S and σ be the hash function that determines the sign ± of each non-zero entry.Applying the sparse embedding matrix S to A enables us to obtain a rank-k right factor that is at most a factor of O(1) worse than the best rank-k approximation error in the p,2 norm. We adapt Theorem 32 from Clarkson & Woodruff (2015) to show this in Theorem 5.5. Notice that in Theorem 32 of Clarkson & Woodruff (2015) , the number of rows required for S is O(k 2 ), but this can be reduced to O(k) through a different choice of hyperparameters when constructing the sparse embedding matrix S.We note two choices of hyperparameters, i.e., the number m of rows and sparsity s, of S in Theorem 5.1 and Theorem 5.2, both of which give the same result. The proof of Theorem 32 from Clarkson & Woodruff (2015) uses the hyperparameters from Theorem 5.1. We replace the hyperparameters from Theorem 5.2 and show in Lemma 5.3 that O(k) rows of S suffice to preserve certain desired properties. We then combine Lemma 5.3 and Lemma 5.4 adapted from Clarkson & Woodruff (2015) , to conclude our result in Theorem 5.5, following the analysis from Clarkson & Woodruff (2015) .Theorem 5.1. (Theorem 3 from Nelson & Nguyen (2013) ) For a sparse embedding matrix S ∈ R m×n with sparsity s = 1 and a data matrix U ∈ R n×d , let ∈ (0, 1). With probability at least 1 -δ all singular values of SU are (1 ± ) as long as m ≥ δ -1 (d 2 + d)/(2 -2 ) 2 . For the hash functions used to construct S, σ is 4-wise independent and h is pairwise independent.Theorem 5.2. (Theorem 9 from Nelson & Nguyen (2013) ) For a sparse embedding matrix S ∈ R m×n with sparsity s = Θ(log 3 (d/δ)/ ) and a data matrix U ∈ R n×d , let ∈ (0, 1). With probability at least 1 -δ all singular values of SU are (1 ± ) as long as m = Ω(d log 8 (d/δ)/ 2 ). For the hash functions used to construct S, we have that σ, h are both Ω(log(d/δ))-wise independent.Lemma 5.3. Let C be a constraint set and A ∈ R n×d , B ∈ R n×d be two arbitrary matrices. For a sparse embedding matrix S ∈ R m×n , there is m = O(), such that with constant probability, the following holds for all X ∈ R d×d Proof. The proof is the same as the proof of Lemma 29 from Clarkson & Woodruff (2015) , except that we use a different choice of hyperparameters in constructing S, i.e., sparsity s and the number m of rows. In the proof of Lemma 29 from Clarkson & Woodruff (2015) , the construction of S follows Theorem 5.1, where the sparsity s = 1, but requires m = O(d 2 ) rows. We replace the construction by Theorem 5.2, where we pick δ = p+1 . Now the sparsity s is larger but this construction reduces the number of rows required to m = O(d). Since both choices of hyperparameters to construct S result in bounded (1 ± ) singular values of SU for any data matrix U , the rest of the proof follows.Lemma 5.4. Consider a data matrix A ∈ R n×d . Let the best rank-k matrix in the p,2 norm beProof. Lemma 31 from Clarkson & Woodruff (2015) .Theorem 5.5. ( p,2-Low Rank Approximation) Let the data matrix be A ∈ R d×n and k ∈ N be the desired rank. Let S ∈ R m×d be a sparse embedding matrix with m = O(kpoly(log k)poly( 1)) rows, and sparsity s = poly(log k). Then, the following holds with constant probability,By Theorem 5.7Therefore,We are now ready to show an O(1) approximation factor and polynomial running time for the bi-criteria k-CSSp,2 presented in Algorithm 4, given in Theorem 5.Theorem 5 (Bicriteria O(1)-Approximation Algorithm for k-CSSp,2). Let A ∈ R d×n and k ∈ N. There exists an algorithm that runs in (nnz(A) + d 2 ) • kpoly(log k) time and outputs a rescaled subset of columnsProof. Approximation Factor First notice that the minimizer X of | XSAS -AS |p,2 has to be in the column span of AS . Thus we can write X = (AS )Y for some matrix Y . By Theorem 5.8,We denote Y SA = V . We take the left factor U = AS and solving for minV |U V -A|p,2 will give us a Θ(1)A good minimizer for the right factor V in the Euclidean space is V = (AS ) † A. This concludes our result. Notice that since S is a sampling matrix with O(k) columns, we get a rank-k left factor U as a subset of columns of A as desired. 

D ANALYSIS FOR GREEDY k-CSS 1,2

We propose a greedy algorithm for selecting columns in k-CSS1,2 presented in Algorithm ??. We give a detailed analysis on the first additive approximation compared to the best possible subset of columns for Greedy k-CSS1,2.Our analysis is inspired by the analysis of Greedy k-CSS2 for the Frobenius norm in Altschuler et al. (2016) . We then describe how the running time of Algorithm ?? can be improved to O( n σ • F ), where F is the running time required to evaluate minV |AT ∪j V -A|1,2 for a fixed j ∈ T (note that this running time is O(nk 2 + knd) if we evaluate this by computing the pseudo-inverse of AT ∪j ). This improvement in the running time is obtained by randomly sampling candidate columns from T and adding the best of these randomly sampled columns to AT , rather than trying all columns of T -this method was previously used in Altschuler et al. (2016) for Frobenius norm column subset selection, where it is referred to as "Lazier-than-lazy Greedy," and the general approach was first introduced in Mirzasoleiman et al. (2015) . This version of the greedy algorithm is shown in Algorithm 2, and we show both versions of the greedy algorithm below for convenience.where PM on the left-hand side denotes the projection onto the column span of M , and PT on the right-hand side denotes the projection to the column span of SAT . Hence, by Lemma 1, k) is the matrix whose columns are the unsketched columns corresponding to M , then minand by Lemmas 2 and 3, this means minNow, we analyze the space complexity of Algorithm 7. Note that at any iteration of the algorithm, L can hold at most log n lists of columns (since each element of L is a coreset corresponding to an interval of column indices in [n] of size 2 k for some k ∈ N, and if two adjacent coresets are of the same size then they will have been merged, so all coresets in L are of different sizes). Each list of columns is of size at most O(k), and each column has d entries, meaning the amount of space used in any iteration is at most O(dk).

F COMPARISON OF OUR PROTOCOL WITH THE DISTRIBUTED GREEDY PROTOCOL OF ALTSCHULER ET AL. (2016) FOR THE FROBENIUS NORM

In this section, we perform an empirical comparison of our protocol with the distributed greedy protocol for column subset selection in the Frobenius norm due to Altschuler et al. (2016) .F.1 DISTRIBUTED GREEDY PROTOCOL OF ALTSCHULER ET AL. ( 2016)We first recall the distributed protocol of Altschuler et al. (2016) , in Algorithm 8. Note that it is a bi-criteria algorithm, i.e., more than k columns are selected, and in Altschuler et al. (2016) it is shown that this gives a good approximation relative to the optimal column subset for the Frobenius norm.Naïvely, this would require a large communication cost, since the entire data matrix A would have to be communicated between the servers and the coordinator, in order for them to perform the calls GREEDY(A, Ti, 32k σ ) and GREEDY(A, T, 12k σ ). Instead, Altschuler et al. (2016) 2) columns, where each entry is independently and uniformly set to ± 1 n . Then for any matrix A ∈ R d×n , with probability 1 -O(δ), the following holds: for some constant c ≥ 0, for any k-dimensional subspace U of R d , if PU ∈ R d×d is the corresponding projection matrix, thenthat is, AR is a Projection-Cost Preserving sketch for A. In other words, R preserves the cost of all projections of A onto rank-k subspaces of R d .Hence, when each server performs the call GREEDY(A, Ti, 32k σ ), in place of A, it can use a projection-cost preserving sketch AR for A (and the coordinator can similarly use AR when it makes the call GREEDY(A, T, 12k σ )). This is because the calls to GREEDY repeatedly compute the cost of the projection of the matrix A onto various column subsets of Ti or T , and hence all that is needed is a way to efficiently compute the cost of the projection of A onto various subsets of size O( k σ ). AR can be computed without communicating the entire matrix A between the servers, as follows:• First, the coordinator generates R ∈ R n×n (where n = O(k)) as specified in Theorem 11. This is sent to all the servers (which can be done with negligible cost if a random seed is sent, for example).• Suppose the i th server holds Ai, which consists of columns with indices si through ti of A. Then, if Ri is the (ti -si +1)×n submatrix of R consisting of rows si through ti, then the i th server can send AiRi to the coordinator. Since AiRi is a d × O(k) matrix, this step takes O(sdk) communication.• Finally, by definition of matrix multiplication, AR = s i=1 AiRi. The coordinator performs this computation and sends AR to each server. This step also takes O(sdk) communication. Table 5 : All hyperparameters used on isolet, when regular k-CSS 1,2 is used. In all settings (including when greedy k-CSS 1,2 , not included here, is used), cauchy size is 4k, and coreset size is 4k. The setting numbers are shown on top -in the following error plots, Setting 0 is used to refer to our protocol when greedy k-CSS 1,2 is used.Figure 5 shows that we encounter a tradeoff between accuracy and running time when choosing between these two subroutines. Both lead to the same overall communication cost, and since accuracy is of more interest than running time in the distributed setting, it is (empirically) preferable to use Greedy k-CSS1,2 within the protocol.

G.2 ADDITIONAL DETAILS

All experiments on the Caltech-101 dataset were run on a Late-2016 Macbook Pro with a 2.7 GHz Quad-Core Intel Core i7 processor and 16GM of memory. All experiments on bcsstk13 and isolet were run on an AWS z1d.xlarge instance with Deep Learning AMI (Amazon Linux 2) Version 29.0.

