AN EFFICIENT PROTOCOL FOR DISTRIBUTED COLUMN SUBSET SELECTION IN THE ENTRYWISE p NORM

Abstract

We give a distributed protocol with nearly-optimal communication and number of rounds for Column Subset Selection with respect to the entrywise 1 norm (k-CSS 1 ), and more generally, for the p -norm with 1 ≤ p < 2. We study matrix factorization in 1 -norm loss, rather than the more standard Frobenius norm loss, because the 1 norm is more robust to noise, which is observed to lead to improved performance in a wide range of computer vision and robotics problems. In the distributed setting, we consider s servers in the standard coordinator model of communication, where the columns of the input matrix A ∈ R d×n (n d) are distributed across the s servers. We give a protocol in this model with O(sdk) communication, 1 round, and polynomial running time, and which achieves a multiplicative k 1 p -1 2 poly(log nd)-approximation to the best possible column subset. A key ingredient in our proof is the reduction to the p,2 -norm, which corresponds to the p-norm of the vector of Euclidean norms of each of the columns of A. This enables us to use strong coreset constructions for Euclidean norms, which previously had not been used in this context. This naturally also allows us to implement our algorithm in the popular streaming model of computation. We further propose a greedy algorithm for selecting columns, which can be used by the coordinator, and show the first provable guarantees for a greedy algorithm for the 1,2 norm. Finally, we implement our protocol and give significant practical advantages on real-world data analysis tasks.

1. INTRODUCTION

Column Subset Selection (k-CSS) is a widely studied approach for rank-k approximation and feature selection. In k-CSS, one seeks a small subset U ∈ R d×k of k columns of a data matrix A ∈ R d×n , typically n d, for which there is a right factor V such that |U V -A| is small under some norm | • |. k-CSS is a special case of low rank approximation for which the left factor is an actual subset of columns. The main advantage of k-CSS over general low rank approximation is that the resulting factorization is more interpretable, as columns correspond to actual features while general low rank approximation takes linear combinations of such features. In addition, k-CSS preserves the sparsity of the data matrix A. k-CSS has been extensively studied in the Frobenius norm (Guruswami & Sinop, 2012; Boutsidis et al., 2014; Boutsidis & Woodruff, 2017; Boutsidis et al., 2008) and operator norms (Halko et al., 2011; Woodruff, 2014) . A number of recent works (Song et al., 2017; Chierichetti et al., 2017; Dan et al., 2019; Ban et al., 2019; Mahankali & Woodruff, 2020) studied this problem in the p norm (k-CSS p ) for 1 ≤ p < 2. The 1 norm is less sensitive to outliers, and better at handling missing data and non-Gaussian noise, than the Frobenius norm (Song et al., 2017) . Specifically, the 1 norm leads to improved performance in many real-world applications, such as structure-from-motion (Ke & Kanade, 2005) and image denoising (Yu et al., 2012) . Distributed low-rank approximation arises naturally when a dataset is too large to store on one machine, takes prohibitively long time for a single machine to compute a rank-k approximation, or is collected simultaneously on multiple machines. Despite the flurry of recent work on k-CSS p , this problem remains largely unexplored in the distributed setting. This should be contrasted to Frobenius norm column subset selection and low rank approximation, for which a number of results in the distributed model are known, see, e.g., Altschuler et al. (2016) ; Balcan et al. (2015; 2016) ; Boutsidis et al. (2016) . We consider a widely applicable model in the distributed setting, where s Step 1: Server i applies a dense p-stable sketching matrix S to reduce the row dimension of the data matrix A i . S is shared between all servers. Step 2: Server i constructs a strong coreset for its sketched data matrix SA i , which is a subsampled and reweighted set of columns of SA i . Server i then sends the coreset SA i T i , as well as the corresponding unsketched, unweighted columns A i D i selected in the strong coreset SA i T i to the coordinator. Step 3: The coordinator concatenates the SA i T i column-wise, applies k-CSS p,2 to the concatenated columns and computes the set of indices of the selected columns. Step 4: The coordinator recovers the set of selected columns A I from the unsketched, unweighted columns A i D i 's through previously computed indices. servers communicate to a central coordinator via 2-way channels. This model can simulate arbitrary point-to-point communication by having the coordinator forward a message from one server to another; this increases the total communication by a factor of 2 and an additive log s bits per message to identify the destination server. We consider the column partition model, in which each column of A ∈ R d×n is held by exactly one server. The column partition model is widely-studied and arises naturally in many real world scenarios such as federated learning (Farahat et al., 2013; Altschuler et al., 2016; Liang et al., 2014) . In the column partition model, we typically have n d, i.e., A has many more columns than rows. Hence, we desire a protocol for distributed k-CSS p that has a communication cost that is only logarithmic in the large dimension n, as well as fast running time. In addition, it is important that our protocol only uses a small constant number of communication rounds (meaning back-and-forth exchanges between servers and the coordinator). Indeed, otherwise, the servers and coordinator would need to interact more, making the protocol sensitive to failures in the machines, e.g., if they go offline. Further, a 1-round protocol can naturally be adapted to an single pass streaming algorithm when we consider applications with limited memory and access to the data. In fact, our protocol can be easily extended to yield such a streaming algorithmfoot_0 . In the following, we denote A i * and A * j as the i-th row and j-th column of A respectively, for i ∈ [d], j ∈ [n]. We denote A T as the subset of columns of A with indices in T ⊆ [n]. The entrywise p -norm of A is |A| p = ( d i=1 n j=1 |A ij | p ) 1 p . The p,2 norm is defined as |A| p,2 = ( d j=1 |A * j | p 2 ) 1 p . We consider 1 ≤ p < 2. We denote the best rank-k approximation error for A in p norm by OPT := min rank-k A k |A -A k | p . Given an integer k > 0, we say U ∈ R d×k , V ∈ R k×n are the left and right factors of a rank-k factorization for A in the p norm with approximation factor α if |U V -A| p ≤ α • OPT. Since general rank-k approximation in 1 norm is NP hard (Gillis & Vavasis, 2015) , we follow previous work and consider bi-criteria k-CSS algorithms which obtain polynomial running time. Instead of outputting exactly k columns, such algorithms return a subset of O(k) columns of A, suppressing logarithmic factors in k or n. It is known that the best approximation factor to OPT that can be obtained through the span of a column subset of size O(k) is Ω(k 1/2-γ ) for p = 1 (Song et al., 2017) and Ω(k 1/p-1/2-γ ) for p ∈ (1, 2) (Mahankali & Woodruff, 2020) , where γ is an arbitrarily small constant.



We provide a detailed analysis of our streaming algorithm in Appendix E.



Figure 1: An overview of the proposed protocol for distributed k-CSS p in the column partition model.Step 1: Server i applies a dense p-stable sketching matrix S to reduce the row dimension of the data matrix A i . S is shared between all servers.Step 2: Server i constructs a strong coreset for its sketched data matrix SA i , which is a subsampled and reweighted set of columns of SA i . Server i then sends the coreset SA i T i , as well as the corresponding unsketched, unweighted columns A i D i selected in the strong coreset SA i T i to the coordinator.Step 3: The coordinator concatenates the SA i T i column-wise, applies k-CSS p,2 to the concatenated columns and computes the set of indices of the selected columns.Step 4: The coordinator recovers the set of selected columns A I from the unsketched, unweighted columns A i D i 's through previously computed indices.

