A FRAMEWORK FOR LEARNED COUNTSKETCH

Abstract

Sketching is a compression technique that can be applied to many problems to solve them quickly and approximately. The matrices used to project data to smaller dimensions are called "sketches". In this work, we consider the problem of optimizing sketches to obtain low approximation error over a data distribution. We introduce a general framework for "learning" and applying CountSketch, a type of sparse sketch. The sketch optimization procedure has two stages: one for optimizing the placements of the sketch's non-zero entries and another for optimizing their values. Next, we provide a way to apply learned sketches that has worst-case guarantees for approximation error. We instantiate this framework with three sketching applications: least-squares regression, low-rank approximation (LRA), and k-means clustering. Our experiments demonstrate that our approach substantially decreases approximation error compared to classical and naïvely learned sketches. Finally, we investigate the theoretical aspects of our approach. For regression and LRA, we show that our method obtains state-of-the art accuracy for fixed time complexity. For LRA, we prove that it is strictly better to include the first optimization stage for two standard input distributions. For k-means, we derive a more straightforward means of retaining approximation guarantees.

1. INTRODUCTION

In recent years, we have seen the influence of machine learning extend far beyond the field of artificial intelligence. The underlying paradigm, which assumes that a given algorithm has an input distribution for which algorithm parameters can be optimized, has even been applied to classical algorithms. Examples of classical problems that have benefitted from ML include cache eviction strategies, online algorithms for job scheduling, frequency estimation of data stream elements, and indexing strategies for data structures (Lykouris & Vassilvitskii, 2018; Purohit et al., 2018; Hsu et al., 2019; Kraska et al., 2018) . This input distribution assumption is often realistic. For example, many real-world applications use data streaming to track things like product purchasing statistics in real time. Consecutively streamed datapoints are usually tightly correlated and closely fit certain distributions. We are interested in how this distributional paradigm can be applied to sketching, a data compression technique. With the dramatic increase in the dimensions of data collected in the past decade, compression methods are more important than ever. Thus, it is of practical interest to improve the accuracy and efficiency of sketching algorithms. We study a sketching scheme in which the input matrix is compressed by multiplying it with a "sketch" matrix with a small dimension. This smaller, sketched input is then used to compute an approximate solution. Typically, the sketch matrix and the approximation algorithm are designed to satisfy worst-case bounds on approximation error for arbitrary inputs. With the ML perspective in mind, we examine if it is possible to construct sketches which also have low error in expectation over an input distribution. Essentially, we aim for the best of both worlds: good performance in practice with theoretical worst-case guarantees. Further, we are interested in methods that work for multiple sketching applications. Typically, sketching is very application-specific. The sketch construction and approximation algorithm are tailored to individual applications, like robust regression or clustering (Sarlos, 2006; Clarkson & Woodruff, 2009; 2014; 2017; Cohen et al., 2015; Makarychev et al., 2019) . Instead, we consider three applications at once (regression, LRA, k-means) and propose generalizable methods, as well as extending previous application-specific work. Our results. At a high level, our work's aim is to make sketch learning more effective, general, and ultimately, practical. We propose a framework for constructing and using learned CountSketch. We chose CountSketch because it is a sparse, input-independent sketch (Charikar et al., 2002) . Specifically, it has one non-zero entry (±1) per column and does not need to be constructed anew for each input matrix it is applied to. These qualities enable CountSketch to be applied quickly, since sparse matrix multiplication is fast and we can reuse the same CountSketch for different inputs. Our "learned" CountSketch will retain this characteristic sparsity pattern and input-independencefoot_0 , but its non-zero entries will range in R. We list our main contributions and follow this with a discussion. • Two-stage sketch optimization: to first place the non-zero entries and then learn their values. • Theoretical worst-case guarantees, two ways: we derived a time-optimal method which applies to MRR, LRA, k-means, and more. We also proved a simpler method works for k-means. • SOTA experimental results: we showed the versatility of our method on 5 data sets with 3 types. Our method dominated on the majority of experiments. • Theoretical analysis on the necessity of two stages: we proved that including the first stage is strictly better for LRA and two common input distributions. • Empirical demonstration of the necessity of two stages: showed that including the first stage gives a 12, 20% boost for MRR, LRA. Our sketch learning algorithm first places the sparse non-zero entries using a greedy strategy, and then learns their values using gradient descent. The resulting learned CountSketch is very different from the classical CountSketch: the non-zero entries no longer have random positions and ±1 values. As a result, the usual worst-case guarantees do not hold. We sought a way to obtain worst-case guarantees that was fast and reasonably general. Our solution is a fast comparison step which performs an approximate evaluation of learned and classical sketches and takes the better of the two. Importantly, we can run this step before the approximation algorithm without increasing its overall time complexity. As such, this solution is time-optimal and applies to MRR, LRA, k-means, and more. An alternate method was proposed by a previous work, but it was only proved for LRA (Indyk et al., 2019) . This "sketch concatenation" method just involves sketching with the concatenation of a learned and a classical sketch. Since it is somewhat simpler, we wanted to extend its applicability. In a novel theoretical result, we proved this works for k-means as well. We also ran a diverse set of experiments to demonstrate the versatility and practicality of our approach. We chose five data sets spanning three categories (image, text, and graph) to test our method on three applications (MRR, LRA, k-means). Importantly, these experiments have real-world counterparts. For example, LRA and k-means can be used to compress images, applying SVD (LRA) to text data is the basis of a natural language processing technique, and LRA can be used to compute approximate max cuts on graph adjacency matrices. Ultimately, our method dominated on the vast majority of tests, giving a 31, 70% improvement over classical CountSketch for MRR, LRA. Finally, we conducted ablation study of the components of our algorithm. In another novel theoretical result, we proved that including the time-consuming first optimization stage is strictly better than not to for LRA and two input distributions (spiked covariance and Zipfian). Empirically, this is case for all 3 applications.

Related work.

In the last few years, there has been much work on leveraging ML to improve classical algorithms; we only mention a few examples here. One related body of work is data-dependent dimensionality reduction, such as an approach for pair-wise/multi-wise similarity preservation for indexing big data (Wang et al., 2017) and a method for learning linear projections for general applications (Hegde et al., 2015) . We note that multiplying an input matrix on the left with a sparse sketch is equivalent to hashing its rows to a small number of bins. Thus, we also find connections with the body of work on learned hashes, most of which addresses the nearest neighbor search problem (see Wang et al. for a survey). However, in order to obtain approximation guarantees, our "hash function" (sparse sketch) must satisfy properties which these learned hashes usually do not, such as affine -embedding (Def. A.1). In particular, we build off of the work of Indyk et al. (2019) , which introduced gradient descent optimization for the LRA application. It also gave an LRA-specific method for worst-case guarantees. We have surpassed the sketching performance and breadth of this work. Namely, we introduce a sparsity pattern optimization step which is clearly crucial for sparse sketches. We also provide a more general method for worst-case guarantees and extend their method to k-means.

2. PRELIMINARIES

Our learned sketches have the sparsity pattern of the classical CountSketch. The construction of this sketch is described below. We also define affine -embeddings, which is a class of sketches that includes CountSketch. This -embedding property is desirable because it allows us to prove that certain sketching algorithms give (1 + )-approximations. Definition 2.1 (Classical CountSketch). The CountSketch (abbreviated as CS) matrix S has one non-zero entry in each column with a random location and random value in {±1}. Definition 2.2 (Affine Embedding). Given a pair of matrices A and B, a matrix S is an affine -embedding if for all X of the appropriate shape, S(AX -B) 2 F = (1 ± ) AX -B 2 F . Notation. We denote the singular value decomposition (SVD) of A by A = U ΣV with orthogonal U, V and diagonal Σ. Relatedly, the Moore-Penrose pseudo-inverse of A is A † = V Σ -1 U , where Σ -1 is constructed by inverting the non-zero diagonal entries.

3. FRAMEWORK

We describe a framework for learned CountSketch that can be adopted by many different applications, including least-squares regression (MRR), low-rank approximation (LRA), and k-means clustering. We will return to these applications in the next section. In this section, we first describe how to optimize a CountSketch over a set of training samples. Then, we explain how to use the learned CountSketch to achieve good expected performance with worst-case guarantees. By running a "fast comparison" step before the approximation algorithm, we can do this in optimal time.

SKETCH OPTIMIZATION

Most applications are optimization problems. That is, they are defined by an objective function, L(•). For example, least-squares regression solves min X AX -B 2 F given A ∈ R n×d , B ∈ R n×d . Of course, the optimal solution is a function of the inputs: X * = arg min X AX -B 2 F = A † B However, in the sketching paradigm, we compute an approximately optimal solution as a function of the sketched (compressed) inputs. Taking CountSketch S ∈ R m×n for m n, we have: X * = (SA) † (SB) Our goal is to minimize the expected approximation error with respect to S, which is constrained to the set of CountSketch-sparse matrices (CS). However, this is simply equivalent to minimizing the objective value of the approximate solution. S * = arg min S∈CS E (A,B)∼D L A,B ((SA) † (SB)) -L A,B (X * ) = arg min S∈CS E (A,B)∼D L A,B ((SA) † (SB)) For ease of notation, we will define G(•) as a function which maps a sketch and inputs to an approximate solution. G(•) is defined by the application-specific approximation algorithm. For MRR, G(S, (A, B)) = (SA) † (SB). More generally, the sketch optimization objective is: S * = arg min S∈CS E A∼D [L A (G(S, A))] (3.1) If the application is regression, we let A be (A, B). We will solve this constrained optimization in two stages. For both stages, we approximate the expectation in empirical risk minimization (ERM) fashion. That is, we approximate the expectation over the true distribution by averaging over a sampled batch of the training set. Now, in the first optimization stage, we compute positions for the CountSketch-sparse nonzero entries. Then, in the second stage, we fix the positions and optimize the nonzero values. Stage 1: Placing the nonzero entries. We want to maintain the sparsity pattern of CS (one nonzero entry per column), but we are free to place that nonzero entry wherever we like for each column. A naïve method would be to evaluate the objective for the exponential number of full placements. This is clearly intractable, so we consider a greedy alternative. In essence, we construct the sketch one nonzero entry at a time, and we choose the location of the next entry by minimizing (3.1) over the discrete set of possibilities. More precisely, we build the sketch S ∈ R m×n iteratively, placing one nonzero entry at a time. For each nonzero entry, we consider m locations and 2 values for each location (±1). We evaluate the sketch optimization objective (3.1) for all 2m incremental updates to S and choose the minimizing update. In the pseudo-code below, we iterate through the n columns of S, each of which contains a non-zero entry. Note that S w,j = S + w( e j e i ) adds a single entry w to the i-th column, j-th row of the current, partially-constructed S. Algorithm 1 GREEDY STAGE Require: A train = {A 1 , ..., A N } with A i ∈ R n×d ; sketch dimension m 1: initialize S = O m×n 2: for i = 1 to n do 3: w * , j * = arg min w∈{±1},j∈[m] Ai∈Atrain L Ai (G(S w,j , A i )) where S w,j = S + w( e j e i ) 4: S[j * , i] = w * 5: end for For some applications it can be inefficient to evaluate (3.1), since it requires computing the approximate solution. For MRR and LRA, the approximate solution has a closed form, but for k-means, it must be computed iteratively. This is prohibitively expensive, since we perform many evaluations. In this case, we recommend finding a surrogate L(•) with a closed-form solution, as we illustrate in later sections. Stage 2: Optimizing the nonzero values. We now fix the positions of the nonzero entries and optimize their values using gradient descent. To fix the positions, we represent S as just a vector of its nonzero entries, v ∈ R n . We will denote H( v) : R n → R m×n as the function which maps this concise representation of S to the full matrix. H(•) depends on the positions computed in the last stage, which are fixed. Now, we simply differentiate E A∼D [L A (G(H( v), A))] (3.1) with respect to v. Algorithm 2 GRADIENT STAGE Require: A train = {A 1 , ..., A N } with A i ∈ R n×d ; H(•) from Alg. 1; learning rate α 1: for i = 1 to n iter do 2: S = O m×n 3: sample A batch from A train 4: v ← v -α A∈A batch ∂L A (G(H( v)),A) ∂ v 5: end for LEARNED SKETCH WITH WORST-CASE GUARANTEES We first run a fast comparison between our learned sketch and a classical one and then run the approximation algorithm with the "winner". This allows us to compute an approximate solution (to the applications we consider here; i.e., MRR and LRA) that does not perform worse than classical CountSketch. In other words, our solution has the same worst-case guarantees as classical CountSketch. The benefit of this is that these guarantees hold for arbitrary inputs, so our method is protected from out-of-distribution inputs as well as in-distribution inputs which were not well-represented in the training data. More precisely, for a given input A, we quickly compute a rough estimate of the approximation errors for learned and classical CountSketches. This rough estimate can be obtained by sketching. We take the sketch with the better approximation error and use it to run the usual approximation algorithm. The choice to compute a rough estimate of the approximation error rather than the exact value is crucial here. It allows us to append this fast comparison step without increasing the time complexity of the approximation algorithm. Thus, the whole procedure is still time-optimal. Though this method is simple, an even simpler method exists for some applications. Indyk et al. proved that "sketch concatenation" (sketching with a classical sketch appended to the learned one) retains worst-case guarantees for LRA. We prove that this also works for k-means clustering (Theorem 4.6). Algorithm 3 LEARNED-SKETCH-ALGORITHM Require: learned sketch S L ; classical sketch S C ; trade-off parameter β; input data A 1: Define M L , M C such that L A (G(S L , A)) = M L 2 F , L A (G(S C , A)) = M C 2 F 2: S ← CountSketch ∈ R 1 β 2 ×n , R ← CountSketch ∈ R 1 β 2 ×d 3: ∆ L ← SM L R 2 F , ∆ C ← SM C R 2 F 4: if ∆ L ≤ ∆ C then 5: return G(S L , A) 6: end if 7: return G(S C , A) This algorithm can be used for applications that minimize a Frobenius norm. In the MRR example, L A (G(S L , A)) = A (S L A) † (SB) -B 2 F so M L = A (S L A) † (SB) -B. Note that β is a parameter which trades off the precision of the approximation error estimate and the runtime of Algorithm 3.

4. INSTANTIATIONS

For each of 3 problems (least-squares regression, low-rank approximation, k-means clustering), we define the problem's objective, L A (•), and the approximation algorithm, G(•), which maps a sketch and inputs (S, A) to an approximate minimizer of L A (•).

4.1. LEAST-SQUARES REGRESSION

We consider a generalized version of 2 regression called "multiple-response regression" (MRR). Given a matrix of observations (A ∈ R n×d , with n d) and a matrix of corresponding values (B ∈ R n×d , with n d ), the goal of MRR is to solve min X L (A,B) (X) = min X AX -B 2 F Approximation algorithm. The algorithm is simply to sketch A, B with a sparse sketch S (e.g., CS) and compute the closed-form solution on SA, SB, which are small (Algorithm 4). Algorithm 4 SKETCH-REGRESSION (Sarlos, 2006; Clarkson & Woodruff, 2017 ) Require: A ∈ R n×d , B ∈ R n×d , S ∈ R m×n 1: return: (SA) † (SB) Sketch optimization. For both the greedy and gradient stages, we use the objective L (A,B) (G(S, (A, B))) = A(SA) + (SB) -B 2 F . Learned sketch algorithm. For fixed accuracy, our learned sketch algorithm achieves stateof-the-art time complexity, besting the classical algorithm. We prove an equivalent statement: the learned sketch algorithm yields better approximation error for fixed time complexity. (See Lemma A.4 for the classical algorithm's worst-case guarantee). In the following theorem, we assume Alg. 3 uses a learned sketch which is an affine β-embedding and also a classical sketch of the same size which is an affine -embedding. If β < , the above statement is true. Theorem 4.1 (Learned sketching with guarantees for MRR). Given a learned sparse sketching matrix S L ∈ R d 2 2 ×n which attains a (1 + γ)-approximation for MRR on A ∈ R n×d , B ∈ R n×d , Alg. 3 gives a (1 + O(β + min(γ, )))-approximation for MRR on A in O(nnz(A) + nnz(B) + d 5 d -4 + dβ -4 ) time where β is a trade-off parameter. Remark 4.2. By setting the trade-off parameter β -4 = O( -4 d 4 d ), Alg. 3 has the same asymptotic runtime as the best (1 + )-approximation algorithm of MRR with classical sparse embedding matrices. Moreover, Alg. 3 for MRR achieves a strictly better approximation bound (1 + O(β + γ)) = (1 + o( )) when γ = o( ). On the other hand, in the worst case scenario when the learned sketch performs poorly (i.e., γ = Ω( )) the approximation guarantee of Alg. 3 remains (1 + O( )).

4.2. LOW-RANK APPROXIMATION

Given an input matrix A ∈ R n×d and a desired rank k, the goal of LRA is to solve min X,rank k L A (X) = min X,rank k X -A 2 F Approximation algorithm. We consider the time-optimal (up to low order terms) approximation algorithm by Sarlos; Clarkson & Woodruff; Avron et al. (Algorithm 5) with worst-case guarantees described in Lemma A.8. Algorithm 5 SKETCH-LOWRANK (Sarlos, 2006; Clarkson & Woodruff, 2017; Avron et al., 2016) . Require: A ∈ R n×d , S ∈ R m S ×n , R ∈ R m R ×d , S 2 ∈ R m S 2 ×n , R 2 ∈ R m R 2 ×d 1: U C T C T C ← S 2 AR , T D T D U D ← SAR 2 with U C , U D orthogonal 2: G ← S 2 AR 2 3: Z L Z R ← [U C GU D ] k 4: Z L = Z L T - D 0 , Z R = T -1 C Z R 0 5: Z = Z L Z R 6: return: AR ZSA in form P n×k , Q k×d Sketch optimization. For the greedy stage, we optimize sketches S and R individually. However, we do not use L A (G((S, R, S 2 , R 2 ), A)) = X -A 2 F with X from Alg. 5 as the objective. This is because the optimization for one sketch would then depend on the other sketches. Instead, we use a proxy objective. We observe that the proof that Alg. 5 is an -approximation uses the fact that the row space of SA and the column space of AR both contain a good rank-k approximation to A. Thus, the proxy objectives for S and R are for k-rank approximation in the row/column space of SA/AR, respectively. For example, the proxy objective for S is and [•] k takes the optimal k-rank approximation using truncated SVD. The proxy objective for R is defined analogously. L A (G(S, A)) = [AV ] k V -A 2 F where V is from the SVD SA = U ΣV For the gradient stage, we optimize all four sketches (S, R, S 2 , R 2 ) simultaneously using L A (G((S, R, S 2 , R 2 ), A)) = X -A 2 F with X from Algorithm 5, since it can easily be implemented with differentiable functions. Learned sketch algorithm. Just like for regression (Section 4.1), we can prove that our learned sketch algorithm achieves a better accuracy than the classical one given the same runtime. Theorem 4.3 (Low-Rank Approximation). Given learned sparse sketching matrix S L ∈ R poly( k )×n , R L ∈ R poly( k )×d which attain a (1 + γ)-approximation for LRA on A ∈ R n×d , Alg. 3 gives a (1 + β + min(γ, ))-approximation for LRA on A in O(nnz(A) + (n + d) poly( k ) + k 4 β 4 • poly( k )) time where β is a trade-off parameter. Remark 4.4. For k(n+d) -4 , by setting the trade-off parameter β -4 = O(k(n+d) -4 ), Alg. 3 has the same asymptotic runtime as the best (1 + )-approximation algorithm of LRA with classical sparse embedding matrices. Moreover, Alg. 3 for LRA achieves a strictly better approximation bound (1 + O(β + γ)) = (1 + o( )) when γ = o( ). On the other hand, in the worst case scenario when the learned sketches perform poorly (i.e., γ = Ω( )) the approximation guarantee of Alg. 3 remains (1 + O( )). Greedy stage offers strict improvement. Finally, we prove for LRA that including the greedy stage is strictly better than not including it. We can show this for two different natural distributions (spiked covariance and Zipfian). The intuition is that the greedy algorithm separates heavy-norm rows (which are important "directions" in the row space) into different bins. Theorem 4.5. Consider a matrix A from either the spiked covariance or Zipfian distribution. Let S L denote a sparse sketch that Algorithm 1 has computed by iterating through indices in order of non-increasing row norms of A. Let S C denote a CountSketch matrix. Then, there is a fixed η > 0 such that min rank-k X∈rowsp(S L A) X -A 2 F ≤ (1 -η) min rank-k X∈rowsp(S C A) X -A 2 F . 4.3 k-MEANS CLUSTERING Let A ∈ R n×d represent a set of n points, A 1 , . . . , A n ∈ R d . In k-means clustering, the goal is to find a partition of A 1 , ..., A n into k clusters C = {C 1 , . . . , C k } to min C L A (C) = min C k i=1 min µi∈R d j∈Ci A j -µ i 2 F where µ i denotes the center of cluster C i . Approximation algorithm. First, compress A into AV , where V is from SVD SA = U ΣV and S ∈ R O(k 2 / 2 )×n is a CountSketch. Then, we use an approximation algorithm for k-means, such as Lloyd's algorithm with a k-means++ initialization (Lloyd, 1982; Arthur & Vassilvitskii, 2007) . Solving with AV gives a (1 + )-approximation (Cohen et al., 2015 ) and Lloyd's gives an O(log(k)) approximation, so we have a (1 + )O(log(k)) approximation overall. Sketch optimization. k-means is an interesting case study because it presents a challenge for both stages of optimization. For the greedy stage, we observe that k-means does not have a closed form solution, which means that evaluating the objective requires a (relatively) time-consuming iterative process. In the gradient stage, it is possible to differentiate through this iterative computation, but propagating the gradient through such a nested expression is time-consuming. Thus, a proxy objective which is simple and quick to evaluate would be useful for both stages. Cohen et al. showed that we get a good k-means solution if A is projected to an approximate top singular vector space. This suggests that a suitable proxy objective would be low-rank approximation: L A (G(S, A)) = [AV ] k V -A 2 F , like in Section 4.2. We use this objective for both the greedy and gradient stages. Learned sketch algorithm. For k-means, we can use a different method to obtain worst-case guarantees. We prove that by concatenating a classical sketch to our learned sketch, our sketched solution will be a (1 + O( ))-approximation. Theorem 4.6 (Sketch monotonicity property for k-means). For a given classical CountSketch S C ∈ R O(poly(k/ ))×n , sketching with any extension of S C (i.e., by a learned sparse sketch S L ) yields a better approximate solution for k-means than sketching with S C itself.

5. EVALUATION

We implemented and compared the following sketches: • Ours: a sparse sketch for which both the values and the positions of the non-zero entries have been optimized • GD only: a sparse sketch with optimized values and random positions for non-zero entries • Random: classical random CountSketch We also consider two "naïvely" learned sketches for LRA, which are computed on just one sample from the training set. • Exact SVD: sketch as AV m , where V m contains the top m right singular vectors of random sample A i ∈ A train • Column sampling: sketch as AR, where R is computed from a randomly sampled A i ∈ A train . Each column of R contains one entry. The location of this entry is sampled according to the squared column norms of A i ; the value of this entry is the inverse of the selected norm. Data sets. We used high-dimensional data sets of three different types (image, text, and graph) to test the performance and versatility of our method. Note that the regression datasets are formed from the LRA/k-means datasets by splitting each matrix into two parts. Evaluation metric. To evaluate the quality of a sketch S, we compute the difference between the objective value of the sketched solution and the optimal solution (with no sketching). That is, the values in this section's tables denote ∆ S = L A (G(S, A test )) -L * A averaged over 10 trials. Analysis of results. For least-squares regression and LRA, our method is the best. It is significantly better than random, obtaining improvements of around 31% and 70% for MRR and LRA respectively compared to a classical sparse sketch. For k-means, "Column Sampling" and "Exact SVD" dominated on several data sets. However, we note that our method was always a close second and also, more importantly, the values for k-means were only trivially different (< 1%) between methods. Ablation of greedy stage We find empirically that including the greedy optimization stage is always better than not including it (compare the "Ours" vs. "SGD only" methods). For regression, LRA, and k-means, including this step offers around 12%, 20%, and < 1% improvement respectively. We should note that for k-means, the values generally do not differ much between methods. )×n be a randomly chosen sparse embedding matrix (i.e., random CS). Then with constant probability, SA 2 F = (1 ± ) A 2 F . Lemma A.4 (Sarlos (2006); Clarkson & Woodruff (2017)). Given classical CountSketch S ∈ R m×n , SKETCH-REGRESSION(A, B, S) returns a (1 + )-approximation in time O(nnz(A) + nnz(B) + dd m 2 + min(d 2 m, dm 2 )). Proof: Since S is an affine -embedding matrix of A, B, then SAX -SB 2 F = (1 ± ) AX -B 2 F Next, by the normal equations, (SA) + (SB) is a minimizer of min X SAX -SB  AX O -B 2 F ≤ (1 + 3 ) min X AX -B 2 F . Together with the assumption that the solution constructed from S L , which is denoted as X L , over A train is a γ-approximate solution, min( AX L -B 2 F , AX O -B 2 F ) ≤ (1 + min(3 , γ)) min X AX -B 2 F . (A.1) Hence, it only remains to compute the minimum of AX L -B 2 F and AX O -B 2 F efficiently. Note that it takes Ω(n • d • d ) to compute these values exactly. However, for our purpose it suffices to compare (1 + β)-estimates of these values and return the minimum of the estimates. To achieve this, we use two applications of Lemma A. 3 with R ∈ R O( 1 β 2 )×d , S ∈ R O( 1 β 2 )×n . For any X (in particular, both X O and X L ), with constant probability, S(AX -B)R 2 F = R (AX -B) S 2 F = (1 ± β) (AX -B) 2 F (A.2) Let Γ L and Γ O respectively denote AX L -B and AX O -B and let Γ M = arg min( SΓ L R F , SΓ O R F ) . By Eq. (A.2) and union bound over of X O and X L , with constant probability, Γ M ≤ (1 + O(β)) • SΓ M R 2 F by Lemma A.3 ≤ (1 + O(β)) • min( Γ O 2 F , Γ L 2 F ) by Lemma A.3 ≤ (1 + O(β + min( , β))) AX -B 2 F by Eq. (A.1) Runtime Analysis. By Lemma A.4, X O an X L can be computed in O(nnz(A) + nnz(B) + d 5 d -4 ) time. Next, the time to compute ∆ L and ∆ O is O(nnz(A) + nnz(XR) + nnz(B) + d β 4 ) = O(nnz(A) + nnz(B) + dd + d β 4 ), where X is either X O or X L and we use that fact that nnz(X) ≤ d • d (i.e., the total number of cells in X). Thus, the total runtime of Algorithm 3 is O(nnz(A) + nnz(B) + d 5 d -4 + dβ -4 ). Theorem A.5 (Least Squares Regression). Suppose there exists a learned, sparse, subspace embedding matrix S L computed over A train with poly( d ) rows that attains a β-approximation over A test . Then, there exists an algorithm that runs in time O(nnz(A) + poly(d/ )) that outputs a (1 + min(β, ))-approximation to the least squares regression problem. Proof: Note that to solve min x∈R d Ax -b 2 , given a sparse embedding matrix S ∈ R m×n for the columns of A together with the vector b, the problem can be solved within (1 + )-approximation in time nnz(A) + poly(d/ ) (e.g., see Theorem 2.14 Woodruff ( 2014)). The proof outline is similar to the proof of Theorem 4.1. First, we compute solutions x L , x O ∈ R d to the given instance using respectively a learned sketching matrix S L and a learning-free sketching algorithm. Then, we compare Ax L -b 2 and Ax O -b 2 and report the better. Note that since x L , x O are vectors, unlike the case of MRR, the naive comparison (i.e., without applying any sketching matrices) takes nnz(A). Hence, the algorithm runs in time nnz(A) + poly(d/ ) and returns a (1 + min( , β))-approximate solution of the least squares regression problem. Moreover, we can employ the best known sketching techniques for the least squares regression problem and achieve the dependence d/ 2 in the running time (e.g., see Section 2.5 of Woodruff (2014)). A.1 CORRESPONDING TO SECTION 4.2: LOW-RANK APPROXIMATION Lemma A.6. Suppose that S ∈ R m S ×n and R ∈ R m R ×d are sparse affine -embedding matrices for (A , A) and ((SA) , A ). Then, min rank-k X AR XSA -A 2 F ≤ (1 + ) A k -A 2 F Proof: Consider the following multiple-response regression problem: min rank-k X A k X -A 2 F . (A.3) Note that since X = I k is a feasible solution to Eq. (A.3), min rank-k X A k X -A 2 F = A k -A 2 F . Let S ∈ R m S ×n be a sketching matrix that satisfies the condition of Lemma A.9 for A := A k and B := A. By the normal equations, the rank-k minimizer of SA k X -SA 2 F is (SA k ) + SA. Hence, A k (SA k ) + SA -A 2 F ≤ (1 + ) A k -A 2 F , (A.4) which in particular shows that a (1 + ) rank-k approximation of A exists in the row space of SA. In other words, min rank-k X XSA -A 2 F ≤ (1 + ) A k -A 2 F . (A.5) Next, let R ∈ R m R ×d be a sketching matrix that satisfies the condition of Lemma A.9 for A := (SA) and B := A . Let Y denote the rank-k minimizer of R(SA) X -RA 2 F . Hence, (SA) Y -A 2 F ≤ (1 + ) min rank-k X XSA -A 2 F Lemma A.9 ≤ (1 + O( )) A k -A 2 F Eq. (A.5) (A.6) Note that by the normal equations, again rowsp(Y ) ⊆ rowsp(RA ) and we can write Y = AR Z where rank(Z) = k. Thus, min rank-k X AR XSA -A 2 F ≤ AR ZSA -A 2 F = (SA) Y -A 2 F Y = AR Z ≤ (1 + O( )) A k -A 2 F Eq. (A.6) Proof: Let U C and U D be orthogonal bases for colsp(C) and rowsp(D), respectively, so that for each Z, CZD = U C Z U D for some Z . Let P C and P D be the projection matrices onto the subspaces spanned by the rows of C and D , respectively: P C = U C U C and P D = U D U D . Then by the Pythagorean theorem, CZD -G 2 F = P C U C Z U D P D -G 2 F = P C U C Z U D P D -P C GP D 2 F + P C G(I -P D ) 2 F + (I -P C )G 2 F , where the first equality holds since P C U C = U C and U D P D = U D and the second equality follows from the Pythagorean theorem. Hence, arg min rank-k Z CZD -G 2 F = arg min rank-k Z P C U C ZU D P D -P C GP D 2 F . Moreover, Lemma A.8. Let S ∈ R poly(k/ )×d , R ∈ R poly(k/ )×d be CS matrices such that β 4 • poly(k/ )). The approximation guarantee follows from Eq. (A.8) and the fact that S 2 and R 2 are respectively affine β-embedding matrices of AR and SA (see Lemma A.2). P C U C ZU D P D -P C GP D 2 F = U C ZU D -U C U C GU D U D 2 F = Z -U C GU D 2 F , min rank-k X AR XSA -A 2 F ≤ (1 + γ) A k -A 2 F . (A.8) Moreover, let S 2 ∈ R Lemma A.9 (Avron et al. ( 2016); Lemma 25). Suppose that A ∈ R n×d and B ∈ R n×d . Moreover, let S be an oblivious sparse affine -embedding matrix (i.e., a random CS) with (rank(A) 2 / 2 ) rows. Then with constant probability, X = arg min rank-k X SAX -SB 2 F , satisfies A X -B 2 F ≤ (1 + ) min rank-k X AX -B 2 F . In other words, in O(nnz(A) + nnz(B)) + (d + d )(rank(A) 2 / 2 ) time, we can reduce the problem to a smaller (multi-response regression) problem with (rank(A) 2 / 2 ) rows whose optimal solution is a (1 + )-approximate solution to the original problem. Proof of Theorem 4.3. Let S O and R O be CountSketch matrices of size poly(k/ ) × n and poly(k/ ) × d. Note that since rank(A k ) = k and rank((S O A) ) ≤ poly(k/ ), S O and R O are respectively affine -embedding matrices of (A k , A) and ((S O A) , A ). Then, by an application of Lemma A.6 min rank-k X AR O XS O A -A 2 F ≤ (1 + O( )) A k -A 2 F (A.9) Similarly, by the assumption that S L and R L finds a (1 + γ)-approximate solution of LRA over matrices A ∈ A test , then min rank-k X AR L XS L A -A 2 F ≤ (1 + O(γ)) A k -A 2 F (A.10) Next, let (P L , Q L ) and (P O , Q O ) be respectively the rank-k approximations of A in factored form using (S L , R L ) and (S O , R O ) (see Algorithm 5). Then, Eq. (A.10) together with Eq. (A.9) implies that min( P L Q L -A 2 F , P O Q O -A 2 F ) = (1 + O(min( , γ))) A k -A 2 F (A.11) Hence, it only remains to compute the minimum of P L Q L -A 2 F and P O Q O -A F efficiently and we proceed similarly to the proof of Theorem 4.1. We use two applications of Lemma A. 3 with R ∈ R O( 1 β 2 )×d , S ∈ R O( 1 β 2 )×n . Let Γ L = P L Q L -A, Γ O = P O Q O -A and Γ M = arg min( SΓ L R F , SΓ O R F ). Hence, Γ M 2 F ≤ (1 + O(β)) SΓ M R 2 F by Lemma A.3 ≤ (1 + O(β)) • min( Γ L 2 F , Γ O 2 F ) by Lemma A.3 ≤ (1 + O(β + min( , γ))) A k -A 2 F by Eq. (A.11) Runtime analysis. By Lemma A.8, Algorithm 5 computes P L , Q L and P O , Q O in O(nnz(A) + (n + d) poly( k ) + k 4 β 4 • poly( k )). Next, it takes O(nnz(A) + (n + d) • k + k β 4 ) to compute ∆ L and ∆ O . As an example, we bound the amount of time required to compute SP L Q L R -SAR corresponding to ∆ L . Since S and R are sparse sketching matrices, SP L , Q L R and SAR can be computed in nnz(SP L ) + nnz(Q L R) + nnz(A). Since SP L and Q L R are respectively of size 1 β 2 ×k and k× 1 β 2 , in total it takes O(nnz(A)+ k β 2 ) to compute these three matrices. Then, we can compute SP L Q L R and SP L Q L RSAR F in time O( k β 4 ). Hence, the total runtime of Algorithm 3 for LRA is O(nnz(A) + (n + d) • poly( k ) + k 4 β 4 • poly( k )). A.2 CORRESPONDING TO SECTION 4.3: k-MEANS We restate notation and the main result below for ease of reference. Notation. We define A m as the optimal rank-m approximation of A formed by truncated SVD: A m = U m Σ m V m . Given a matrix U with orthogonal columns, let π U (A) = AU U , which is the projection of the rows of A onto col(U ). Let C U be the optimal k-means partition of π U (A). Further, we let µ C U,i denote the i-th cluster's center in the optimal k-means clustering on π U (A). We denote dist 2 (A, µ) as the k-means loss given cluster centers (and their corresponding partition): dist 2 (A, µ) = i∈[k] j∈Ci A j -µ i 2 2 Likewise, cost(C) is the k-means loss given a partition: cost(C) = i∈[k] min µi j∈Ci A j -µ i 2 2 Definition A.10 (Projection-cost preserving sketch). Ã is a projection-cost preserving sketch of A if for any low rank projection matrix P and c not dependent on P : (1 -) A -P A 2 F ≤ Ã -P Ã 2 F + c ≤ (1 + ) A -P A 2 F Theorem (4.6: Sketch monotonicity property for k-means). Assume we have A ∈ R n×d . We also have random CountSketch S ∈ R O(poly(k/ ))×n and define U ∈ R d×O(poly(k/ )) with orthogonal columns such that colsp(U ) = rowsp(SA). Then, any extension of S to S (for example, concatenation with a learned CountSketch S L ) yields a better approximate k-means approximation. Specifically, define W with orthogonal columns such that col(W ) = row(S A). Let C * denote the optimal partition of A, C U denote the optimal partition of π U (A), and C W denote the optimal partition of π W (A). Then cost(C W ) ≤ (1 + O( ))cost(C U ) ≤ (1 + O( ))cost(C * ) Proof: cost(C W ) = i∈[k] min µi j∈C W,i A j -µ i 2 ≤ i∈[k] j∈C W,i A j -µ C W,i 2 = i∈[k] j∈C W,i A j -π W (A j ) 2 + π W (A j ) -µ C W,i 2 (A.12) Theorem A.13. Assume we have A ∈ R n×d , j ∈ Z + , ∈ (0, 1]. Define m = min(O(poly(j/ )), d). Let Ãm be a (1 + )-approximation to A m of the form Ãm = AV V where SA = U ΣV for CountSketch S ∈ R m×n . Then, for any non-empty set µ contained in a j-dimensional subspace, we have: dist 2 ( Ãm , µ) + Ãm -A 2 F -dist 2 (A, µ) ≤ dist 2 (A, µ) Proof: We follow the proof of Theorem 22 in Feldman et al. (2013) , but substitute different analyses in place of Corollaries 16 and 20. The result of Feldman et al. (2013) involves the best rank-m approximation of A, A m ; we will show it for the approximate rank-m approximation, Ãm . Define X ∈ R d×j with orthonormal columns such that colsp(X) = span(µ). Likewise, define Y ∈ R d× (d-j) with orthonormal columns such that colsp(Y ) = span(µ) ⊥ . By the Pythagorean theorem: . Specifically, it says AV is a project-cost preserving sketch, which means Ãm = AV V is too: V has orthonormal columns so dist 2 ( Ãm , µ) = Ãm Y 2 F + dist 2 ( Ãm XX T , µ) and dist 2 (A, µ) = AY 2 F + dist 2 (AXX T , µ). (A.21) Hence, dist 2 ( Ãm , µ) + A -Ãm 2 F -dist 2 (A, µ) = Ãm Y 2 F + dist 2 ( Ãm XX T , µ) + A -Ãm 2 F -AY 2 F + dist 2 (AXX T , µ) ≤ Ãm Y 2 F + A -Ãm 2 F -AY 2 F + dist 2 ( Ãm XX T , µ) -dist 2 (AXX T , µ) (A.22) ≤ ε 2 8 • AY 2 F + dist 2 ( Ãm XX T , µ) -dist 2 (AXX T , µ) (A.23) ≤ ε 2 8 • dist 2 (A, µ) + dist 2 ( Ãm XX T , µ) -dist 2 (AXX T , (I -P )AV 2 F = (I -P )AV V 2 F . By Corollary A.12, Ãm XX T -AXX T 2 F ≤ 2 8 • AY 2 F . Since µ ∈ rowsp(X), we have AY , µ) . Finally, we combining the last two inequalities with (A.24): | dist 2 ( Ãm XX T , µ)-dist 2 (AXX T , µ)| ≤ 4 •dist 2 (AXX T , µ)+(1+ 4 )• Ãm XX T -AXX T 2 F By A.21, dist 2 (AXX T , µ) ≤ dist 2 (A dist 2 ( Ãm , µ) + A -Ãm 2 F -dist 2 (A, µ) ≤ ε 2 8 • dist 2 (A, µ) + 4 • dist 2 (A, µ) + 2 8 • (1 + 4 ) • dist 2 (A, µ) ≤ • dist 2 (A, µ) , where in the last inequality we used the assumption ≤ 1. Corollary A.14. Assume we have A ∈ R n×d and CountSketch S ∈ R O(poly(k/ ))×n . Then, define U ∈ R d×O(poly(k/ )) with orthogonal columns spanning row(SA). Also define A U = π U (A) and µ U as the set of optimal cluster centers found on A U . Now, assume ∈ (0, 1/3]. Then, µ U is a (1 + )-approximation to the optimal k-means clustering of A. That is, defining µ * U as the cluster centers which minimize the cost of partition C U on A, we have: dist 2 (A, µ U ) ≤ (1 + ) dist 2 (A, µ * U ) Proof: By using 3 in Theorem A.13 with j as k, dist 2 (A U , µ U ) + A -A U 2 F -dist 2 (A, µ U ) ≤ 3 dist 2 (A, µ U ) which implies that (1 - 3 ) dist 2 (A, µ U ) ≤ dist 2 (A U , µ U ) + A -A U 2 F (A.25) Likewise, by Theorem A.13 on A U and µ * U (and taking j as k), dist 2 (A U , µ * U ) + A -A U 2 F -dist 2 (A, µ * U ) ≤ 3 dist 2 (A, µ * U ) which implies that dist 2 (A U , µ * U ) + A -A U 2 F ≤ (1 + 3 ) dist 2 (A, µ * U ) (A.26) By (A.25) and (A.26) together, we have: (1 - 3 ) dist 2 (A, µ U ) ≤ dist 2 (A U , µ U ) + A -A U 2 F ≤ dist 2 (A U , µ * U ) + A -A U 2 F ≤ (1 + 3 ) dist 2 (A, µ * U ) Now, 1+ /3 1-/3 ≤ 1 + , so we have dist 2 (A, µ U ) ≤ (1 + ) dist 2 (A, µ * U ).

B SPECTRAL NORM GUARANTEE FOR ZIPFIAN MATRICES

In this section, we show that if the singular values of the input matrix A follow a Zipfian distribution (i.e., σ i ∝ i -α for a constant α), we can find a (1 + ) rank-k approximation of A with respect to the spectral norm. A key theorem in this section is the following. Theorem B.1 (Cohen et al. (2015) , Theorem 27). Given an input matrix A ∈ R n×d , there exists an algorithm that runs in O(nnz(A)+(n+d) poly(k/ )) and returns a projection matrix P = QQ such that with constant probability the following holds: AP -A 2 2 ≤ (1 + ) A -A k 2 2 + O( k ) A -A k 2 F (B.1) Next, we prove the main claim in this section. There are various ways to prove this, using, for example, a bound on the stable rank of A; we give the following short proof for completeness. Theorem B.2. Given a matrix A ∈ R n×d whose singular values follow a Zipfian distribution (i.e., σ i ∝ i -α ) with a constant α ≥ 1/2, there exists an algorithm that computes a rank-k matrix B (in factored form) such that A -B 2 2 ≤ (1 + ) A -A k 2 2 . Proof: Note that since the singular values of A follow a Zipfian distribution with parameter α, for any value of k, A -A k 2 F = rank(A) i=k+1 σ 2 i = C • rank(A) i=k+1 i -2α σ i = √ C/i α ≤ C • rank(A) k x -2α dx ≤ k • 1 2α -1 • (1 + 1 k ) 2α • C/(k + 1) 2α = O(k • σ 2 k+1 ) = O(k • A -A k 2 2 ) (B.2) By an application of Theorem B.1, we can compute a matrix B in a factored form in time O(nnz(A) + (n + d) poly(k/ )) such that with constant probability, B -A 2 2 ≤ (1 + ) A -A k 2 2 + O( k ) A -A k 2 F By Eq. (B.1) ≤ (1 + O( )) A -A k 2 2 Eq. (B.2)

C GREEDY INITIALIZATION

In this section, we analyze the performance of the greedy algorithm on the two distributions mentioned in Theorem 4.5. Preliminaries and Notation. Left-multiplying A by CountSketch S ∈ R m×n is equivalent to hashing the rows of A to m bins with coefficients in {-1, 1}. The greedy algorithm proceeds through the rows of A (in some order) and decides which bin to hash to, denoting this by adding an entry to S. We will denote the bins as b i and their summed contents as w i . C.1 SPIKED COVARIANCE MODEL WITH SPARSE LEFT SINGULAR VECTORS. To recap, every matrix A ∈ R n×d from the distribution A sp (s, ) has s < k "heavy" rows (A r1 , • • • , A rs ) of norm > 1. The indices of the heavy rows can be arbitrary, but must be the same for all members of the distribution and are unknown to the algorithm. The remaining rows (called "light" rows) have unit norm. In other words: let R = {r 1 , . . . , r s }. For all rows A i , i ∈ [n]: A i = • v i if i ∈ R v i o.w. where v i is a uniformly random unit vector. We also assume that S r , S g ∈ R k×n and non-increasing row norm ordering for the greedy algorithm. Proof sketch. First, we show that the greedy algorithm using a non-increasing row norm ordering will isolate heavy rows (i.e., each is alone in a bin). Then, we conclude by showing that this yields a better k-rank approximation error when d is sufficiently large compared to n. We begin with some preliminary observations that will be of use later. It is well known that a set of uniformly random vectors are -almost orthogonal (i.e., the magnitudes of their pairwise inner products are at most ). Observation C.1. Let v 1 , • • • , v n ∈ R d be a set of random unit vectors. Then with high probability | v i , v j | ≤ 2 log n d , ∀ i < j ≤ n. We define = 2 log n d . Observation C.2. Let u 1 , • • • , u t be a set of vectors such that for each pair of i < j ≤ t, | ui ui , uj uj | ≤ , and g i , • • • , g j ∈ {-1, 1}. Then, t i=1 u i 2 2 -2 i<j≤t u i 2 u j 2 ≤ t i=1 g i u i 2 2 ≤ t i=1 u i 2 2 + 2 i<j≤t u i 2 u j 2 (C.1) Next, a straightforward consequence of -almost orthogonality is that we can find a QR-factorization of the matrix of such vectors where R (an upper diagonal matrix) has diagonal entries close to 1 and entries above the diagonal are close to 0. Lemma C.3. Let u 1 , • • • , u t ∈ R d be a set of unit vectors such that for any pair of i < j ≤ t, | u i , u j | ≤ where = O(t -2 ). There exists an orthonormal basis e 1 , • • • , e t for the subspace spanned by u 1 , • • • , u t such that for each i ≤ t, u i = i j=1 a i,j e j where a 2 i,i ≥ 1 -i-1 j=1 j 2 • 2 and for each j < i, a 2 i,j ≤ j 2 2 . Proof: We follow the Gram-Schmidt process to construct the orthonormal basis e 1 , • • • , e t of the space spanned by u 1 , • • • , u t . by first setting e 1 = u 1 and then processing u 2 , • • • , u t , one by one. The proof is by induction. We show that once the first j vectors u 1 , • • • , u j are processed the statement of the lemma holds for these vectors. Note that the base case of the induction trivially holds as u 1 = e 1 . Next, suppose that the induction hypothesis holds for the first vectors u 1 , • • • , u . Claim C.4. For each j ≤ , a 2 +1,j ≤ j 2 2 . Proof: The proof of the claim is itself by induction. Note that, for j = 1 and using the fact that | u 1 , u +1 | ≤ , the statement holds and a 2 +1,1 ≤ 2 . Next, suppose that the statement holds for all j ≤ i < , then by | u i+1 , u +1 | ≤ , |a +1,i+1 | ≤ (| u +1 , u i+1 | + i j=1 |a +1,j | • |a i+1,j |)/|a i+1,i+1 | ≤ ( + i j=1 j 2 2 )/|a i+1,i+1 | by induction hypothesis on a +1,j for j ≤ i ≤ ( + i j=1 j 2 2 )/(1 - i j=1 j 2 • 2 ) 1/2 by induction hypothesis on a i+1,i+1 ≤ ( + i j=1 j 2 2 ) • (1 - i j=1 j 2 • 2 ) 1/2 • (1 + 2 • i j=1 j 2 2 ) ≤ ( + i j=1 j 2 2 ) • (1 + 2 • i j=1 j 2 2 ) ≤ (( i j=1 j 2 ) • (1 + 4 • i j=1 j 2 ) + 1) ≤ (i + 1) by = O(t -2 ) Finally, since u +1 2 2 = 1, a 2 +1, +1 ≥ 1 -j=1 j 2 2 . Corollary C.5. Suppose that = O(t -2 ). There exists an orthonormal basis e 1 , • • • , e t for the space spanned by the randomly picked vectors v 1 , • • • , v t , of unit norm, so that for each i, v i = i j=1 a i,j e j where a 2 i,i ≥ 1 -i-1 j=1 j 2 • 2 and for each j < i, a 2 i,j ≤ j 2 • 2 . Proof: The proof follows from Lemma C.3 and the fact that the set of vectors v 1 , • • • , v t are -almost orthogonal (by Observation C.1). The first main step is to show that the greedy algorithm (with non-increasing row norm ordering) will isolate rows into their own bins until all bins are filled. In particular, this means that the heavy rows (the first to be processed) will all be isolated. We note that because we set rank(SA) = k, the k-rank approximation cost is the simplified expression AV V -A 2 F , where U ΣV = SA, rather than [AV ] k V -A 2 F . This is just the projection cost onto row(SA). Also, we observe that minimizing this projection cost is the same as maximizing the sum of squared projection coefficients: min S A -AV V 2 F ∼ min S i∈[n] A i -( A i , v 1 v 1 + . . . + A i , v k v k ) 2 2 ∼ min S i∈[n] ( A i 2 2 - j∈[k] A i , v j 2 ) ∼ max S i∈[n] j∈[k] A i , v j 2 In the following sections, we will prove that our greedy algorithm makes certain choices by showing that these choices maximize the sum of squared projection coefficients. Lemma C.6. For any matrix A or batch of matrices A, at the end of iteration k, the learned CountSketch matrix S maps each row to an isolated bin. In particular, heavy rows are mapped to isolated bins. Proof: For any iteration i ≤ k, we consider the choice of assigning A i to an empty bin versus an occupied bin. Without loss of generality, let this occupied bin be b i-1 , which already contains A i-1 . We consider the difference in cost for empty versus occupied. We will do this cost comparison for A j with j ≤ i -2, j ≥ i + 1, and finally, j ∈ {i -1, i}. First, we let {e 1 , . . . , e i } be an orthonormal basis for {A 1 , . . . , A i } such that for each r ≤ i, A r = r j=1 a r,j e j where a r,r > 0. This exists by Lemma C.3. Let {e 1 , . . . , e i-2 , e} be an orthonormal basis for {A 1 , . . . , A i+2 , A i-1 ± A i }. Now, e = c 0 e i-1 + c 1 e i for some c 0 , c 1 because (A i-1 ± A i ) -proj {e1,...,ei-2} (A i-1 ± A i ) ∈ span(e i-1 , e i ). We note that c 2 0 + c 2 1 = 1 because we let e be a unit vector. We can find c 0 , c 1 to be: c 0 = a i-1,i-1 + a i,i-1 (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i , c 1 = a i,i (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i 1. j ≤ i -2: The cost is zero for both cases because A j ∈ span({e 1 , . . . , e i-2 }). 2. j ≥ i + 1: We compare the rewards (sum of squared projection coefficients) and find that {e 1 , . . . , e i-2 , e} is no better than {e 1 , . . . , e i }. A j , e 2 = (c 0 A j , e i-1 + c 1 A j , e i ) 2 ≤ (c 2 1 + c 2 0 )( A j , e i-1 2 + A j , e i 2 ) Cauchy-Schwarz inequality = A j , e i-1 2 + A j , e i 2 3. j ∈ {i -1, i}: We compute the sum of squared projection coefficients of A i-1 and A i onto e: ( 1 (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i ) • (a 2 i-1,i-1 (a i-1,i-1 + a i,i-1 ) 2 + (a i,i-1 (a i-1,i-1 + a i,i-1 ) + a i,i a i,i ) 2 ) =( 1 (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i ) • ((a i-1,i-1 + a i,i-1 ) 2 (a 2 i-1,i-1 (C.2) + a 2 i,i-1 ) + a 4 i,i + 2a i,i-1 a 2 i,i (a i-1,i-1 + a i,i-1 )) (C.3) On the other hand, the sum of squared projection coefficients of A i-1 and A i onto e i-1 ∪e i is: ( (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i ) • (a 2 i-1,i-1 + a 2 i,i-1 + a 2 i,i ) (C.4) Hence, the difference between the sum of squared projections of A i-1 and A i onto e and e i-1 ∪ e i is ((C.4) -(C.3)) a 2 i,i ((a i-1,i-1 + a i,i-1 ) 2 + a 2 i-1,i-1 + a 2 i,i-1 -2a i,i-1 (a i-1,i-1 + a i,i-1 )) ((a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i ) = 2a 2 i,i a 2 i-1,i-1 ((a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i ) > 0 Thus, we find that {e 1 , . . . , e i } is a strictly better basis than {e 1 , . . . , e i-2 , e}. This means the greedy algorithm will choose to place A i in an empty bin. Next, we show that none of the rows left to be processed (all light rows) will be assigned to the same bin as a heavy row. The main proof idea is to compare the cost of "colliding" with a heavy row to the cost of "avoiding" the heavy rows. Specifically, we compare the decrease (before and after bin assignment of a light row) in the sum of squared projection coefficients, lower-bounding it in the former case and upper-bounding it in the latter. We introduce some results that will be used in Lemma C.  ≤ k -2, b j ≤ β 2 j 2 . Next, since | w k , v k+r | ≤ β, |b k | ≤ 1 | w k , e k | • (| w k , v k+r | + k-1 j=1 |b j • w k , e j |) ≤ 1 1 - k-1 j=1 β 2 • j 2 • (β + k-2 j=1 β 2 • j 2 + (k -1) • β) |b k-1 | ≤ 1 = β + k-2 j=1 β 2 • j 2 1 - k-1 j=1 β 2 • j 2 + (k -1)β ≤ 2(k -1)β - β 2 (k -1) 2 1 - k-1 j=1 β 2 • j 2 similar to the proof of Lemma C.3 < 2β • k Hence, the squared projection of A k+r onto e k is at most 4β 2 •k 2 • A k+r 2 . We assumed A k+r . Claim C.9. We assume that the absolute values of the inner products of vectors in v 1 , • • • , v n are at most < 1/(n 2 Ai∈b A i 2 ) and the absolute values of the inner products of the normalized vectors of w 1 , • • • , w k are at most β = O(n -3 k -3 2 ). Suppose that bin b contains the row A k+r . Then, the squared projection of A k+r onto the direction of w orthogonal to span({w 1 , • • • , w k } \ {w}) is at most A k+r 4 2 w 2 2 + O(n -2 ) and is at least A k+r 4 2 w 2 2 -O(n -2 ). Proof: Without loss of generality, we assume that A k+r is mapped to b k ; w = w k . First, we provide an upper and a lower bound for | v k+r , w k | where for each i ≤ k, we let w i = wi wi 2 denote the normalized vector of w i . Recall that by definition v k+r = A k+r A k+r 2 . | w k , v k+r | ≤ A k+r 2 + Ai∈b k A i 2 w k 2 ≤ A k+r 2 + n -2 w k 2 by < n -2 Ai∈b k A i 2 ≤ A k+r 2 w k 2 + n -2 w k 2 ≥ 1 (C.5) | w k , v k+r | ≥ A k+r 2 -Ai∈b k A i 2 • w k 2 ≥ A k+r 2 w k 2 -n -2 (C.6) Now, let {e 1 , • • • , e k } be an orthonormal basis for the subspace spanned by {w 1 , • • • , w k } guaranteed by Lemma C.3. Next, we expand the orthonormal basis to include e k+1 so that we can write v k+r = k+1 j=1 b j e j . By a similar approach to the proof of Lemma C.3, we can show that for each j ≤ k -1, b 2 j ≤ β 2 j 2 . Moreover, |b k | ≤ 1 | w k , e k | • (| w k , v k+r | + k-1 j=1 |b j • w k , e j |) ≤ 1 1 - k-1 j=1 β 2 • j 2 • (| w k , v k+r | + k-1 j=1 β 2 • j 2 ) by Lemma C.3 ≤ 1 1 - k-1 j=1 β 2 • j 2 • (n -2 + A k+r 2 w k 2 + k-1 j=1 β 2 • j 2 ) by (C.5) < β • k + 1 1 -β 2 k 3 • (n -2 + A k+r 2 w k 2 ) similar to the proof of Lemma C.3 ≤ O(n -2 ) + (1 + O(n -2 )) A k+r 2 w k 2 by β = O(n -3 k -3 2 ) ≤ A k+r 2 w k 2 + O(n -2 ) A k+r 2 w k 2 ≤ 1 and, |b k | ≥ 1 | w k , e k | • (| w k , v k+r | - k-1 j=1 |b j • w k , e j |) ≥ | w k , v k+r | - k-1 j=1 β 2 • j 2 since | w k , e k | ≤ 1 ≥ A k+r 2 w k 2 -n -2 - k-1 j=1 β 2 • j 2 by (C.6) ≥ A k+r 2 w k 2 -O(n -2 ) by β = O(n -3 k -3 2 ) Hence, the squared projection of A k+r onto e k is at most A k+r 4 2 w k 2 2 +O(n -2 ) and is at least A k+r 4 2 w k 2 2 - O(n -2 ). Now, we show that at the end of the algorithm no light row will be assigned to the bins that contain heavy rows. Lemma C.10. We assume that the absolute values of the inner products of vectors in v 1 , • • • , v n are at most < min{n -2 k -5 3 , (n Ai∈w A i 2 ) -1 }. At each iteration k + r, the greedy algorithm will assign the light row A k+r to a bin that does not contain a heavy row. Proof: The proof is by induction. Lemma C.6 implies that no light row has been mapped to a bin that contains a heavy row for the first k iterations. Next, we assume that this holds for the first k + r -1 iterations and show that is also must hold for the (k + r)-th iteration. To this end, we compare the sum of squared projection coefficients when A k+r avoids and collides with a heavy row. First, we upper bound β = max i =j≤k | w i , w j |/( w i 2 w j 2 ). Let c i and c j respectively denote the number of rows assigned to b i and b j . β = max i =j≤k | w i , w j | w i 2 w j 2 ≤ c i • c j • c i -2 c 2 i • c j -2 c 2 j Observation C.2 ≤ 16 √ c i c j ≤ n -2 k -5/3 ≤ n -1 k -5 3 ≤ n -2 k -5/3 1. If A k+r is assigned to a bin that contains c light rows and no heavy rows. In this case, the projection loss of the heavy rows A 1 , • • • , A s onto row(SA) remains zero. Thus, we only need to bound the change in the sum of squared projection coefficients of the light rows before and after iteration k + r. Without loss of generality, let w k denote the bin that contains A k+r . Since S k-1 = span({w 1 , • • • , w k-1 }) has not changed, we only need to bound the difference in cost between projecting onto the component of w k -A k+r orthogonal to S k-1 and the component of w k orthogonal to S k-1 , respectively denoted as e k and e k . I. By Claim C.7, for the light rows that are not yet processed (i.e., A j for j > k + r), the squared projection of each onto e k is at most β 2 k 2 . Hence, the total decrease in the squared projection is at most (n -k -r) • β 2 k 2 . II. By Claim C.8, for the processed light rows that are not mapped to the last bin, the squared projection of each onto e k is at most 4β 2 k 2 . Hence, the total decrease in the squared projection cost is at most (r -1) • 4β 2 k 2 . III. For each row A i = A k+r that is mapped to the last bin, by Claim C.9 and the fact A i Hence, the total squared projection of the rows in the bin b k decreases by at least: ( Ai∈w k /{A r+k } A i 2 2 w k -A r+k 2 2 + O(n -2 )) -( Ai∈w k A i 2 2 w k 2 2 -O(n -2 )) ≤ w k -A r+k 2 2 + O(n -1 ) w k -A r+k 2 2 - w k 2 2 -O(n -1 ) w k 2 2 + O(n -1 ) by Observation C.2 ≤O(n -1 ) Hence, summing up the bounds in items I to III above, the total decrease in the sum of squared projection coefficients is at most O(n -1 ). 2. If A k+r is assigned to a bin that contains a heavy row. Without loss of generality, we can assume that A k+r is mapped to b k that contains the heavy row A s . In this case, the distance of heavy rows A 1 , • • • , A s-1 onto the space spanned by the rows of SA is zero. Next, we bound the amount of change in the squared distance of A s and light rows onto the space spanned by the rows of SA. Note that the (k -1)-dimensional space corresponding to w 1 , • • • , w k-1 has not changed. Hence, we only need to bound the decrease in the projection distance of A k+r onto e k compared to e k (where e k , e k are defined similarly as in the last part). 1. For the light rows other than A k+r , the squared projection of each onto e k is at most β 2 k 2 . Hence, the total increase in the squared projection of light rows onto e k is at most (n -k) • β 2 k 2 = O(n -1 ). Proof: Let b denote the bin that contains the rows A r1 , • • • , A rc and suppose that it has c light rows as well. Note that by Claim C.8 and Claim C.9, the squared projection of each row A ri onto the subspace spanned by the k bins is at most Hence, the expected total squared loss of heavy rows is at least: • (s -k(1 -e -s k-1 )) -s • n -O(1) ≥ • k(α -1 + e -α ) -α -n -O(1) s = α • (k -1) where 0.7 < α < 1 ≥ k 2e --n -O(1) α ≥ 0.7 ≥ k 4e -O(n -1 ) assuming k > 4e Next, we compute a lower bound on the expected squared loss of the light rows. Note that Claim C.8 and Claim C.9 imply that when a light row collides with other rows, its contribution to the total squared loss (which the loss accounts for the amount it decreases from the squared projection of the other rows in the bin as well) is at least 1 -O(n -1 ). Hence, the expected total squared loss of the light rows is at least: (n -s -k)(1 -O(n -1 )) ≥ (n -(1 + α) • k) -O(n -1 ) Hence, the expected squared loss of a CountSketch whose sparsity is picked at random is at least k 4e -O(n -1 ) + n -(1 + α)k -O(n -1 ) ≥ n + k 4e -(1 + α)k -O(n -1 ) Corollary C.14. Let s = α(k -1) where 0.7 < α < 1 and let ≥ (4e+1)n αk . Let S g be the CountSketch whose sparsity pattern is learned over a training set drawn from A sp via the greedy approach. Let S r be a CountSketch whose sparsity pattern is picked uniformly at random. Then, for an n × d matrix A ∼ A sp where d = Ω(n 6 2 ), the expected loss of the best rank-k approximation of A returned by S r is worse than the approximation loss of the best rank-k approximation of A returned by S g by at least a constant factor. (C.7) 2. If A k+r is assigned to a bin that contains a heavy row. Without loss of generality and by the induction hypothesis, we assume that A k+r is assigned to a bin b that only contains a heavy row A j . Since the rows are orthogonal, we only need to bound the difference in the projection of A k+r and A j In this case, the total squared loss corresponding to A j and A k+r before and after adding A k+1



While learned CountSketch is data-dependent (it is optimized using sample input matrices), it is still considered input-independent because it is applied to unseen input matrices (test samples). http://youtu.be/xmLZsEfXEgE http://youtu.be/L5HQoFIaT4I https://www.yelp.com/dataset The numerical calculation is computed by WolframAlpha.



and is a (1 + 3 )-approximate solution of min X AX -B 2 F . To bound the runtime, note that since S is a CountSketch we can compute SA and SB in time O(nnz(A) + nnz(B)) and reduce the problem to an instance of multiple-response regression with m rows. Then, we can solve the reduced size problem in time O(dd m 2 + min(d 2 m, dm 2 )): O(min(dm 2 , d 2 m)) to compute (SA) + and O(dd m 2 ) to compute (SA) + (SB). Proof of Theorem 4.1. By Lemma A.2, a random CS S O with O( d 2 2 ) rows is an affine -embedding matrix of A, B with constant probability. Next, let X L and X O be respectively the solutions returned by SKETCH-REGRESSION(A, B, S L ) and SKETCH-REGRESSION(A, B, S O ). By Lemma A.4, with constant probability,

Avron et al. (2016); Lemma 27). For C ∈ R p×m , D ∈ R m×p , G ∈ R p×p ,in O(pm r C + p mr D + pp (r D + r C )) time, where r C = rank(C) ≤ min{m , p} and r D = rank(D) ≤ min{m, p }.

where the first equality holds since U C U C = I and U D U D = I, and the second equality holds since U C and U D are orthonormal. Hence,arg min rank-k Z CZD -G 2 = arg min rank-k Z Z -U C GU D 2 F . Next, we can find Z = [U C GU D ] k by computing the SVD of U C GU D in form of Z L Z R where Z L ∈ R m ×k and Z R ∈ R k×m .Runtime analysis. We can compute U C and U D by the Gram-Schmidt process in time O(pm r C + p mr D ), and U C GU D in time O(min{r C p (p + r D ), r D p(p + r C )}). Finally, the time to compute Z (i.e., an SVD computation of U C GU D ) is O(r C r D • min{r C , r D }). Since r C ≤ min{p, m } and r D ≤ min{p , m}, the total runtime to minimize Z in Eq. (A.7) is O(pm r C + p mr D + pp (r C + r D )).

be CS matrices. Then, Algorithm 5 runs in O(nnz(A) + (n + d) poly(k/ )) time and with constant probability gives a (1 + O(β + γ))approximate rank-k approximation of A. Proof: The algorithm first computes C = S 2 AR , D = SAR 2 , G = S 2 AR 2 which can be done in time O(nnz(A)). As an example, we bound the time to compute C = S 2 AR. Note that since S 2 is a CS, S 2 A can be computed in O(nnz(A)) time and the number of non-zero entries in the resulting matrix is at most nnz(A). Hence, since R is a CS as well, C can be computed in time O(nnz(A) + nnz(S 2 A)) = O(nnz(A)). Then, it takes an extra poly(k/ ) • k 2 β 2 time to store C, D and G in matrix form. Next, as we showed in Lemma A.7, the time to compute Z in Algorithm 5 is O( k 4 β 4 •poly(k/ )). Finally, it takes (n+d) poly(k/ ) time to compute Q = AR Z L and P = Z R SA and return the solution in the form of P n×k Q k×d . Hence, the total runtime is O(nnz(A) + (n + d) poly(k/ ) + k 4

23) Take in Theorem 16 from Cohen et al. (2015) as 2 /8. This theorem implies that Ãm is a projection-cost preserving sketch with the c term as A -Ãm 2 F

≤ dist 2 (A, µ). Using Corollary 21 from Feldman et al. (2013) while taking as /4, A as Ãm XX T , and B as AXX T yields

the squared projection of A i onto e k is at most n -2 ) and the squared projection of A i onto e k is at least squared projection of A k+r onto e k compared to e k increases by at least (

• (c + O(n -1 ) + O(n -1 ) ≤ c + O(n -1 )Hence, the total squared loss of these c heavy rows is at least c -c• ( c + O(n -1 )) ≥ (c -1) -O(n -1).

(1 + 1/α)(n -s) ≥ (4e + 1)n αk = (1 + 1/α) min rank-k X∈rowsp(SgA)



2: Average errors for least-squares regression

4: Average errors for k-means clustering David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science, 10(1-2):1-157, 2014.

10. Claim C.7. Let A k+r , r ∈ [1, . . . , n -k] be a light row not yet processed by the greedy algorithm. Let {e 1 , . . . , e k } be the Gram-Schmidt basis for the current {w 1 , . . . , w k }. Let β = O(n -1 k -3 ) upper bound the inner products of normalized A k+r , w 1 , . . . , w k . Then, for any bin i, e i , A k+r2 ≤ β 2 • k 2 . Proof: This is a straightforward application of Lemma C.3. From that, we have A k+r , e i 2 ≤ i 2 β 2 , for i ∈ [1, . . . , k], which means A k+r , e i 2 ≤ k 2 β 2 .Claim C.8. Let A k+r be a light row that has been processed by the greedy algorithm. Let {e 1 , . . . , e k } be the Gram-Schmidt basis for the current {w 1 , . . . , w k }. If A k+r is assigned to bin b k-1 (w.l.o.g.), the squared projection coefficient ofA k+r onto e i , i = k -1 is at most 4β 2 • k 2 , where β = O(n -1 k -3 )upper bounds the inner products of normalized A k+r , w 1 , • • • , w k . Proof: Without loss of generality, it suffices to bound the squared projection of A k+r onto the direction of w k that is orthogonal to the subspace spanned by w 1 , • • • , w k-1 . Let e 1 , • • • , e k be an orthonormal basis of w 1 , • • • , w k guaranteed by Lemma C.3. Next, we expand the orthonormal basis to include e k+1 so that we can write the normalized vector of A k+r as v k+r = k+1 j=1 b j e j . By a similar approach to the proof of Lemma C.3, for each j

annex

(A.12): µ C W ∈ colsp(W ) so we can apply the Pythagorean Theorem.(A.13): (µ C W , C W ) is an optimal k-means clustering of the projected points π W (A).(A.14): µ C U ∈ colsp(U ) ⊂ colsp(W ), so we can apply the Pythagorean Theorem.(A.15): We apply Corollary A.14.Remark A.11. Our result shows that the "sketch monotonicity" property holds for sketching matrices that provide strong coresets for k-means clustering. Besides strong coresets, an alternate approach to showing that the clustering objective is approximately preserved on sketched inputs is to show a weaker property: the clustering cost is preserved for all possible partitions of the points into k groups Makarychev et al. (2019) . While the dimension reduction mappings satisfying strong coresets require poly(k/ ) dimensions, Makarychev et al. (2019) shows that O(log k/ 2 ) dimensions suffice to satisfy this "partition" guarantee. An interesting question for further research is if the sketch monotonicity guarantee also applies to the construction of Makarychev et al. (2019) .Corollary A.12. Assume we have A ∈ R n×d , j ∈ Z + , > 0. Define m = min(O(poly(j/ )), d).Let Ãm be a (1 + )-approximation to A m of the form Ãm = AV V where SA = U ΣV for CountSketch S ∈ R m×n . Let X ∈ R d×j be a matrix whose columns are orthonormal, and let Y ∈ R d×(d-j) be a matrix with orthonormal columns that spans the orthogonal complement of colsp(X). ThenProof: Thus, for a sufficiently large value of , the greedy algorithm will assign A k+r to a bin that only contains light rows. This completes the inductive proof and in particular implies that at the end of the algorithm, heavy rows are assigned to isolated bins.Corollary C.11. The approximation loss of the best rank-k approximate solution in the rowspace S g A for A ∼ A sp (s, ) where A ∈ R n×d for d = Ω(n 4 k 4 log n) and S g is the CountSketch constructed by the greedy algorithm with non-increasing order is at most n -s.Proof: First, we need to show that absolute values of the inner products of vectors in} so that we can apply Lemma C.10. To show this, note that by Observation C.1, ≤ 2 log n d ≤ n -2 k -2 since d = Ω(n 4 k 4 log n). The proof follows from Lemma C.6 and Lemma C.10. Since all heavy rows are mapped to isolated bins, the projection loss of the light rows is at most n -s.Next, we bound the Frobenius norm error of the best rank-k-approximation solution constructed by the standard CountSketch with a randomly chosen sparsity pattern. Lemma C.12. Let s = αk where 0.7 < α < 1. The expected squared loss of the best rank-k approximate solution in the rowspace S r A for A ∈ R n×d ∼ A sp (s, ) where d = Ω(n 6 2 ) and S r is the sparsity pattern of CountSketch is chosen uniformly at random is at least nProof: We can interpret the randomized construction of the CountSketch as a "balls and bins" experiment. In particular, considering the heavy rows, we compute the expected number of bins (i.e., rows in S r A) that contain a heavy row. Note that the expected number of rows in S r A that do not contain any heavy row is kHence, the number of rows in S r A that contain a heavy row of A is at most k(1 -e -s k-1 ). Thus, at least s -k(1 -e -s k-1 ) heavy rows are not mapped to an isolated bin (i.e., they collide with some other heavy rows). Then, it is straightforward to show that the squared loss of each such row is at least -n -O(1) . Claim C.13. Suppose that heavy rows A r1 , • • • , A rc are mapped to the same bin via a CountSketch S. Then, the total squared distances of these rows from the subspace spanned by SA is at least (c -1) -O(n -1 ).

C.2 ZIPFIAN ON SQUARED ROW NORMS.

Each matrix A ∈ R n×d ∼ A zipf has rows which are uniformly random and orthogonal. Each A has 2 i+1 rows of squared norm n 2 /2 2i for i ∈ [1, . . . , O(log(n))]. We also assume that each row has the same squared norm for all members of A zipf .In this section, the s rows with largest norm are called the heavy rows and the remaining are the light rows. For convenience, we number the heavy rows 1 -s; however, the heavy rows can appear at any indices, as long as any row of a given index has the same norm for all members of A zipf . Also, we assume that s ≤ k/2 and, for simplicity, s = hs i=1 2 i+1 for some h s ∈ Z + . That means the minimum squared norm of a heavy row is n 2 /2 2hs and the maximum squared norm of a light row is n 2 /2 2hs+2 .The analysis of the greedy algorithm ordered by non-increasing row norms on this family of matrices is similar to our analysis for the spiked covariance model. Here we analyze the case in which rows are orthogonal. By continuity, if the rows are close enough to being orthogonal, all decisions made by the greedy algorithm will be the same.As a first step, by Lemma C.6, at the end of iteration k the first k rows are assigned to different bins. Then, via a similar inductive proof, we show that none of the light rows are mapped to a bin that contains one of the top s heavy rows. Lemma C.15. At each iteration k + r, the greedy algorithm picks the position of the non-zero value in the (k + r)-th column of the CountSketch matrix S so that the light row A k+r is mapped to a bin that does not contain any of top s heavy rows.Proof: We prove the statement by induction. The base case r = 0 trivially holds as the first k rows are assigned to distinct bins. Next we assume that in none of the first k + r -1 iterations a light row is assigned to a bin that contains a heavy row. Now, we consider the following cases:1. If A k+r is assigned to a bin that only contains light rows. Without loss of generality we can assume that A k+r is assigned to b k . Since the vectors are orthogonal, we only need to bound the difference in the projection of A k+r and the light rows that are assigned to b k onto the direction of w k before and after adding A k+r to b k . In this case, the total squared loss corresponding to rows in b k and A k+r before and after adding A k+1 are respectively before adding A k+r to b k :Hence, the greedy algorithm will map A k+r to a bin that only contains light rows. Corollary C.16. The squared loss of the best rank-k approximate solution in the rowspace of S g A for A ∈ R n×d ∼ A zipf where A ∈ R n×d and S g is the CountSketch constructed by the greedy algorithm with non-increasing order, is < n 2 2 h k -2 . Proof: At the end of iteration k, the total squared loss isAfter that, in each iteration k + r, by (C.7), the squared loss increases by at most A k+r 2 2 . Hence, the total squared loss in the solution returned by S g is at mostNext, we bound the squared loss of the best rank-k-approximate solution constructed by the standard CountSketch with a randomly chosen sparsity pattern. Observation C.17. Let us assume that the orthogonal rows A r1 , • • • , A rc are mapped to same bin and for eachHence, the total projection loss ofIn particular, Observation C.17 implies that whenever two rows are mapped into a same bin, the squared norm of the row with smaller norm fully contributes to the total squared loss of the solution. Lemma C.18. For k > 2 10 -2, the expected squared loss of the best rank-k approximate solution in the rowspace of S r A for A n×d ∼ A zipf , where S r is the sparsity pattern of a CountSketch chosen uniformly at random, is at least 1.095n 2 2 h k -2 . Proof: In light of Observation C.17, we need to compute the expected number of collision between rows with "large" norm. We can interpret the randomized construction of the CountSketch as a "balls and bins" experiment.For each 0 ≤ j ≤ h k , let A j denote the set of rows with squared normNext, for a row A r in A j (0 ≤ j < h k ), we compute the probability that at least one row in A >j collide with A r .

Pr[at least one row in

Hence, by Observation C.17, the contribution of rows in A j to the total squared loss is at leastThus, the contribution of rows with "large" squared norm, i.e., A >0 , to the total squared loss is at least 5Corollary C.19. Let S g be a CountSketch whose sparsity pattern is learned over a training set drawn from A sp via the greedy approach. Let S r be a CountSketch whose sparsity pattern is picked uniformly at random. Then, for an n × d matrix A ∼ A zipf , for a sufficiently large value of k, the expected loss of the best rank-k approximation of A returned by S r is worse than the approximation loss of the best rank-k approximation of A returned by S g by at least a constant factor.Proof: The proof follows from Lemma C.18 and Corollary C.16.Remark C.20. We have provided evidence that the greedy algorithm that examines the rows of A according to a non-increasing order of their norms (i.e., greedy with non-increasing order) results in a better rank-k solution compared to the CountSketch whose sparsity pattern is chosen at random. However, still other implementations of the greedy algorithm may result in a better solution compared to the greedy with non-increasing order. To give an example, in the following simple instance the greedy algorithm that check the rows of A in a random order (i.e., greedy with random order) achieves a rank-k solution whose cost is a constant factor better than the solution returned by the greedy with non-increasing order.Let A be a matrix with four orthogonal rows u, u, v, w where u 2 = 1 and v 2 = w 2 = 1 + and suppose that the goal is to compute a rank-2 approximation of A. Note that in the greedy with non-decreasing order, v and w will be assigned to different bins and by a simple calculation we can show that the copies of u also will be assigned to different bins. Hence, the squared loss in the computed rank-2 solution is 1 + (1+ ) 2 2+(1+ ) 2 . However, the optimal solution will assign v and w to one bin and the two copies of u to the other bin which results in the squared loss of (1 + ) 2 which is a constant factor smaller than the solution returned by the greedy with non-increasing order for sufficiently small values of .On the other hand, in the greedy algorithm with random order, with a constant probability ( 13 + 1 8 ), the computed solution is the same as the optimal solution. Otherwise, the greedy algorithm a with random order returns the same solution as the greedy algorithm with a non-increasing order. Hence, in expectation, the solution returned by the greedy with random order is better than the solution returned by the greedy algorithm with non-increasing order by a constant factor.

D EXPERIMENTS -APPENDIX D.1 BASELINES

We comment on two of our baselines:• Exact SVD: In the canonical, learning-free sketching setting (i.e., any matrix is equally probable), sketching using the top m singular vectors yields a (1 + )-approximation for both LRA and k-means Cohen et al. (2015) .• Column sampling: In the canonical, learning-free sketching setting, sketching via column sampling yields a (1 + )-approximation for k-means Cohen et al. (2015) .

D.2 EXPERIMENTAL PARAMETERS

For the tables in Section 5, we describe experimental parameters. First, we provide some general implementation details.We implemented both the greedy (Algorithm 1) and stochastic gradient descent (Algorithm 2) algorithms in PyTorch. In the first case, PyTorch allowed us to harness GPUs to speed up computation on large matrices. We used several Nvidia GeForce GTX 1080 Ti machines. In the second case, PyTorch allowed us to effortlessly compute numerical gradients for each task's objective function.Specifically, PyTorch provides automatic differentiation, which is implemented by backpropagation through chained differentiable operators.There are also two points of note in the greedy algorithm implementation. First, we noticed that for MRR and LRA, each iteration required computing the SVD for many rank-1 updates of the current S. Instead of computing the SVD from scratch for each of these variants, we first computed the SVD of S and then used fast rank-1 SVD updates Brand (2006) . This greatly improved the runtime of Algorithm 1. Second, we decided to set D w (the set of candidate row weights) to 10 samples in [-2, 2] because we noticed most weights were in this range after running Algorithm 2.

D.3 RUNNING TIME

We examine the runtimes of our various sketching algorithms. In Table D .1, the times are obtained for the LRA task with k = 30, m = 60 on the logo data set. However, similar trends should be expected for other combinations of task, task parameters, and data sets.We define the inference runtime as the time to apply the sketching algorithm. The training runtime is the time to train a sketch on A train and only applies to learned sketches. Generally, the long training times are not problematic because training is only done once and can be completed offline. On the other hand, the inference runtime should be as fast as possible.Note that inference was timed using 1 matrix from A test on an Nvidia Geforce GTX 1080 Ti GPU. The values were averaged over 10 trials.We observe that sparse sketches (such as the ones used in GD only and Ours) have much lower inference runtimes than the dense sketches of exact SVD. 

