LEARNING THE POSITIONS IN COUNTSKETCH

Abstract

We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem, e.g., low-rank approximation and regression. In the learning-based sketching paradigm proposed by Indyk, Vakilian, and Yuan (2019), the sketch matrix is found by choosing a random sparse matrix, e.g., CountSketch, and then the values of its non-zero entries are updated by running gradient descent on a training data set. Despite the growing body of work on this paradigm, a noticeable omission is that the locations of the non-zero entries of previous algorithms were fixed, and only their values were learned. In this work, we propose the first learning-based algorithms that also optimize the locations of the non-zero entries. Our first proposed algorithm is based on a greedy algorithm. However, one drawback of the greedy algorithm is its slower training time. We fix this issue and propose approaches for learning a sketching matrix for both low-rank approximation and Hessian approximation for second order optimization. The latter is helpful for a range of constrained optimization problems, such as LASSO and matrix estimation with a nuclear norm constraint. Both approaches achieve good accuracy with a fast running time. Moreover, our experiments suggest that our algorithm can still reduce the error significantly even if we only have a very limited number of training matrices.

1. INTRODUCTION

The work of (Indyk et al., 2019) investigated learning-based sketching algorithms for low-rank approximation. A sketching algorithm is a method of constructing approximate solutions for optimization problems via summarizing the data. In particular, linear sketching algorithms compress data by multiplication with a sparse "sketch matrix" and then use just the compressed data to find an approximate solution. Generally, this technique results in much faster or more space-efficient algorithms for a fixed approximation error. The pioneering work of Indyk et al. (2019) shows it is possible to learn sketch matrices for low-rank approximation (LRA) with better average performance than classical sketches. In this model, we assume inputs come from an unknown distribution and learn a sketch matrix with strong expected performance over the distribution. This distributional assumption is often realisticthere are many situations where a sketching algorithm is applied to a large batch of related data. For example, genomics researchers might sketch DNA from different individuals, which is known to exhibit strong commonalities. The high-performance computing industry also uses sketching, e.g., researchers at NVIDIA have created standard implementations of sketching algorithms for CUDA, a widely used GPU library. They investigated the (classical) sketched singular value decomposition (SVD), but found that the solutions were not accurate enough across a spectrum of inputs (Chien & Bernabeu, 2019) . This is precisely the issue addressed by the learned sketch paradigm where we optimize for "good" average performance across a range of inputs. While promising results have been shown using previous learned sketching techniques, notable gaps remain. In particular, all previous methods work by initializing the sketching matrix with a random sparse matrix, e.g., each column of the sketching matrix has a single non-zero value chosen at a uniformly random position. Then, the values of the non-zero entries are updated by running gradient descent on a training data set, or via other methods. However, the locations of the non-zero entries are held fixed throughout the entire training process. Clearly this is sub-optimal. Indeed, suppose the input matrix A is an n × d matrix with first d rows equal to the d × d identity matrix, and remaining rows equal to 0. A random sketching matrix S with a single non-zero per column is known to require m = Ω(d 2 ) rows in order for S •A to preserve the rank of A (Nelson & Nguyên, 2014) ; this follows by a birthday paradox argument. On the other hand, it is clear that if S is a d × n matrix with first d rows equal to the identity matrix, then ∥S • Ax∥ 2 = ∥Ax∥ 2 for all vectors x, and so S preserves not only the rank of A but all important spectral properties. A random matrix would be very unlikely to choose the non-zero entries in the first d columns of S so perfectly, whereas an algorithm trained to optimize the locations of the non-zero entries would notice and correct for this. This is precisely the gap in our understanding that we seek to fill. Learned CountSketch Paradigm of Indyk et al. (2019) . Throughout the paper, we assume our data A ∈ R n×d is sampled from an unknown distribution D. Specifically, we have a training set Tr = {A 1 , . . . , A N } ∈ D. The generic form of our optimization problems is min X f (A, X), where A ∈ R n×d is the input matrix. For a given optimization problem and a set S of sketching matrices, define ALG(S, A) to be the output of the classical sketching algorithm resulting from using S; this uses the sketching matrices in S to map the given input A and construct an approximate solution X. We remark that the number of sketches used by an algorithm can vary and in its simplest case, S is a single sketch, but in more complicated sketching approaches we may need to apply sketching more than once-hence S may also denote a set of more than one sketching matrix. The learned sketch framework has two parts: (1) offline sketch learning and (2) "online" sketching (i.e., applying the learned sketch and some sketching algorithm to possibly unseen data). In offline sketch learning, the goal is to construct a CountSketch matrix (abbreviated as CS matrix) with the minimum expected error for the problem of interest. Formally, that is, arg min CS S E A∈Tr f (A, ALG(S, A)) -f (A, X * ) = arg min CS S E A∈Tr f (A, ALG(S, A)), where X * denotes the optimal solution. Moreover, the minimum is taken over all possible constructions of CS. We remark that when ALG needs more than one CS to be learned (e.g., in the sketching algorithm we consider for LRA), we optimize each CS independently using a surrogate loss function. In the second part of the learned sketch paradigm, we take the sketch from part one and use it within a sketching algorithm. This learned sketch and sketching algorithm can be applied, again and again, to different inputs. Finally, we augment the sketching algorithm to provide worst-case guarantees when used with learned sketches. The goal is to have good performance on A ∈ D while the worst-case performance on A ̸ ∈ D remains comparable to the guarantees of classical sketches. We remark that the learned matrix S is trained offline only once using the training data. Hence, no additional computational cost is incurred when solving the optimization problem on the test data. Our Results. In this work, in addition to learning the values of the non-zero entries, we learn the locations of the non-zero entries. Namely, we propose three algorithms that learn the locations of the non-zero entries in CountSketch. Our first algorithm (Section 4) is based on a greedy search. The empirical result shows that this approach can achieve a good performance. Further, we show that the greedy algorithm is provably beneficial for LRA when inputs follow a certain input distribution (Section F). However, one drawback of the greedy algorithm is its much slower training time. We then fix this issue and propose two specific approaches for optimizing the positions for the sketches for low-rank approximation and second-order optimization, which run much faster than all previous algorithms while achieving better performance. For low-rank approximation, our approach is based on first sampling a small set of rows based on their ridge leverage scores, assigning each of these sampled rows to a unique hash bucket, and then placing each non-sampled remaining row in the hash bucket containing the sampled row for which it is most similar to, i.e., for which it has the largest dot product with. We also show that the worst-case guarantee of this approach is strictly better than that of the classical Count-Sketch (see Section 5). For sketch-based second-order optimization where we focus on the case that n ≫ d, we observe that the actual property of the sketch matrix we need is the subspace embedding property. We next optimize this property of the sketch matrix. We provably show that the sketch matrix S needs fewer rows, with optimized positions of the non-zero entries, when the input matrix A has a small number of rows with a heavy leverage score. More precisely, while CountSketch takes O(d 2 /(δϵ 2 )) rows with failure probability δ, in our construction, S requires only O((d polylog(1/ϵ) + log(1/δ))/ϵ 2 ) rows if A has at most d polylog(1/ϵ)/ϵ 2 rows with leverage score at least ϵ/d. This is a quadratic improvement in d and an exponential improvement in δ. In practice, it is not necessary to calculate the leverage scores. Instead, we show in our experiments that the indices of the rows of heavy leverage score can be learned and the induced S is still accurate. We also consider a new learning objective, that is, we directly optimize the subspace embedding property of the sketching matrix instead of optimizing the error in the objective function of the optimization problem in hand. This demonstrates a significant advantage over non-learned sketches, and has a fast training time (Section 6). We show strong empirical results for real-world datasets. For low-rank approximation, our methods reduce the errors by 70% than classical sketches under the same sketch size, while we reduce the errors by 30% than previous learning-based sketches. For second-order optimization, we show that the convergence rate can be reduced by 87% over the non-learned CountSketch for the LASSO problem on a real-world dataset. We also evaluate our approaches in the few-shot learning setting where we only have a limited amount of training data (Indyk et al., 2021) . We show our approach reduces the error significantly even if we only have one training matrix (Sections 7 and 8). This approach clearly runs faster than all previous methods. Additional Related Work. In the last few years, there has been much work on leveraging machine learning techniques to improve classical algorithms. We only mention a few examples here which are based on learned sketches. One related body of work is data-dependent dimensionality reduction, such as an approach for pair-wise/multi-wise similarity preservation for indexing big data (Wang et al., 2017) , learned sketching for streaming problems (Indyk et al., 2019; Aamand et al., 2019; Jiang et al., 2020; Cohen et al., 2020; Eden et al., 2021; Indyk et al., 2021) , learned algorithms for nearest neighbor search (Dong et al., 2020) , and a method for learning linear projections for general applications (Hegde et al., 2015) . While we also learn linear embeddings, our embeddings are optimized for the specific application of low rank approximation. In fact, one of our central challenges is that the theory and practice of learned sketches generally needs to be tailored to each application. Our work builds off of (Indyk et al., 2019) , which introduced gradient descent optimization for LRA, but a major difference is that we also optimize the locations of the non-zero entries.

2. PRELIMINARIES

Notation. Denote the canonical basis vectors of R n by e 1 , . . . , e n . Suppose that A has singular value decomposition (SVD) A = U ΣV ⊤ . Define [A] k = U k Σ k V ⊤ k to be the optimal rank-k approximation to A, computed by the truncated SVD. Also, define the Moore-Penrose pseudo-inverse of A to be A † = V Σ -1 U ⊤ , where Σ -1 is constructed by inverting the non-zero diagonal entries. Let row(A) and col(A) be the row space and the column space of A, respectively. CountSketch. We define S C ∈ R m×n as a classical CountSketch (abbreviated as CS). It is a sparse matrix with one nonzero entry from {±1} per column. The position and value of this nonzero entry are chosen uniformly at random. CountSketch matrices can be succinctly represented by two vectors. We define p ∈ [m] n , v ∈ R n as the positions and values of the nonzero entries, respectively. Further, we let CS(p, v) be the CountSketch constructed from vectors p and v. Below we define the objective function f (•, •) and a classical sketching algorithm ALG(S, A) for each individual problem. Low-rank approximation (LRA). In LRA, we find a rank-k approximation of our data that minimizes the Frobenius norm of the approximation error. For A ∈ R n×d , min rank-k B f LRA (A, B) = min rank-k X ∥A -B∥ 2 F . Usually, instead of outputting the a whole B ∈ R n×d , the algorithm outputs two factors Y ∈ R n×k and X ∈ R k×d such that B = Y X for efficiency. Indyk et al. (2019) considered Algorithm 1, which only compresses one side of the input matrix A. However, in practice often both dimensions of the matrix A are large. Hence, in this work we consider Algorithm 2 that compresses both sides of A. Constrained regression. Given a vector b ∈ R n , a matrix A ∈ R n×d (n ≫ d) and a convex set C, we want to find x to minimize the squared error min x∈C f REG ([A b], X) = min x∈C ∥Ax -b∥ 2 2 . (2.1) Iterative Hessian Sketch. The Iterative Hessian Sketching (IHS) method (Pilanci & Wainwright, 2016) solves the constrained least-squares problem by iteratively performing the update x t+1 = arg min x∈C 1 2 ∥S t+1 A(x -x t )∥ 2 2 -⟨A ⊤ (b -Ax t ), x -x t ⟩ , (2.2) where S t+1 is a sketching matrix. It is not difficult to see that for the unsketched version (S t+1 is the identity matrix) of (2.2), the optimal solution x t+1 coincides with the optimal solution to the original constrained regression problem (2.1). The IHS approximates the Hessian A ⊤ A by a sketched version (S t+1 A) ⊤ (S t+1 A) to improve runtime, as S t+1 A typically has very few rows. Algorithm 1 Rank-k approximation of A using a sketch S (see (Clarkson & Woodruff, 2009, Sec. 4.1.1) ) Indyk et al. (2021) studied learning-based algorithms for LRA in the setting where we have access to limited data or computing resources. We provide a brief explanation of learning-based algorithms in the Few-Shot setting in Appendix A.3. Input: A ∈ R n×d , S ∈ R m×n 1: U, Σ, V ⊤ ← COMPACTSVD(SA) ▷ {r = rank(SA), U ∈ R m×r , V ∈ R d×r } 2: Return: [AV ] k V ⊤ Algorithm 2 ALG LRA (SKETCH-LOWRANK) Sar- los (2006); Clarkson & Woodruff (2017); Avron et al. (2017). Input: A ∈ R n×d , S ∈ R m S ×n , R ∈ R m R ×d , V ∈ R m V ×n , W ∈ R m W ×d 1: U C [T C T ′ C ] ← V AR ⊤ , T ⊤ D T ′⊤ D U ⊤ D ← SAW ⊤ with U C , U D orthogonal 2: G ← V AW ⊤ , Z ′ L Z ′ R ← [U ⊤ C GU D ] k 3: Z L ← Z ′ L (T -1 D ) ⊤ 0 , Z R ← T -1 C Z ′ R 0 4: Z ← Z L Z R 5: return: AR ⊤ ZSA in form P n×k , Q k×d Learning-Based Algorithms in the Few-Shot Setting. Recently, Leverage Scores and Ridge Leverage Scores. Given a matrix A, the leverage score of the i-th row a i of A is defined to be τ i := a i (A ⊤ A) † a ⊤ i , which is the squared ℓ 2 -norm of the i-th row of U , where A = U ΣV T is the singular value decomposition of A. Given a regularization parameter λ, the ridge leverage score of the i-th row a i of A is defined to be τ i := a i (A ⊤ A + λI) † a ⊤ i . Our learning-based algorithms employs the ridge leverage score sampling technique proposed in (Cohen et al., 2017) , which shows that sampling proportional to ridge leverage scores gives a good solution to LRA.

3. DESCRIPTION OF OUR APPROACH

We describe our contributions to the learning-based sketching paradigm which, as mentioned, is to learn the locations of the non-zero values in the sketch matrix. To learn a CountSketch for the given training data set, we locally optimize the following in two stages: min S E A∈D [f (A, ALG(S, A))] . (3.1) (1) compute the positions of the non-zero entries, then (2) fix the positions and optimize their values. Stage 1: Optimizing Positions. In Section 4, we provide a greedy search algorithm for this stage, as our starting point. In Section 5 and 6, we provide our specific approaches for optimizing the positions for the sketches for low-rank approximation and second-order optimization. Stage 2: Optimizing Values. This stage is similar to the approach of Indyk et al. (2019) . However, instead of the power method, we use an automatic differentiation package, PyTorch (Paszke et al., 2019) , and we pass it our objective min v∈R n E A∈D [f (A, ALG(CS(p, v), A))], implemented as a chain of differentiable operations. It will automatically compute the gradient using the chain rule. We also consider new approaches to optimize the values for LRA (proposed in Indyk et al. (2021) , see Appendix A.3 for details) and second-order optimization (proposed in Section 6). Worst-Cases Guarantees. In Appendix D, we show that both of our approaches for the above two problems can perform no worse than a classical sketching matrix when A does not follow the distribution D. In particular, for LRA, we show that the sketch monotonicity property holds for the time-optimal sketching algorithm for low rank approximation. For second-order optimization, we propose an algorithm which runs in input-sparsity time and can test for and use the better of a random sketch and a learned sketch.

4. SKETCH LEARNING: GREEDY SEARCH

Algorithm 3 POSITION OPTIMIZATION: GREEDY SEARCH Input: f, ALG, Tr = {A 1 , ..., A N ∈ R n×d }; sketch dimension m 1: initialize S L = O m×n 2: for i = 1 to n do 3: j ← arg min j∈[m] A∈Tr f (A, ALG(S L ± e j e ⊤ i , A)) 4: S L ← S L ± (e j e ⊤ i ) 5: end for 6: return p for S L = CS(p, v) When S is a CountSketch, computing SA amounts to hashing the n rows of A into the m ≪ n rows of SA. The optimization is a combinatorial optimization problem with an empirical risk minimization (ERM) objective. The naïve solution is to compute the objective value of the exponentially many (m n ) possible placements, but this is clearly intractable. Instead, we iteratively construct a full placement in a greedy fashion. We start with S as a zero matrix. Then, we iterate through the columns of S in an order determined by the algorithm, adding a nonzero entry to each. The best position in each column is the one that minimizes Eq. (3.1) if an entry were to be added there. For each column, we evaluate Eq. (3.1) O(m) times, once for each prospective half-built sketch. While this greedy strategy is simple to state, additional tactics are required for each problem to make it more tractable. Usually the objective evaluation (Algorithm 3, line 3) is too slow, so we must leverage our insight into their sketching algorithms to pick a proxy objective. Note that we can reuse these proxies for value optimization, since they may make gradient computation faster too. Proxy objective for LRA. For the two-sided sketching algorithm, we can assume that the two factors X, Y has the form Y = AR ⊤ Ỹ and X = XSA, where S and R are both CS matrices, so we optimize the positions in both S and R. We cannot use f (A, ALG(S, R, A)) as our objective because then we would have to consider combinations of placements between S and R. To find a proxy, we note that a prerequisite for good performance is for row(SA) and col(AR ⊤ ) to both contain a good rank-k approximation to A (see proof of Lemma C.5). Thus, we can decouple the optimization of S and R. The proxy objective for S is [AV ] k V ⊤ -A 2 F where SA = U ΣV ⊤ . In this expression, X = [AV ] k V ⊤ is the best rank-k approximation to A in row(SA). The proxy objective for R is defined analogously. In Appendix F, we show the greedy algorithm is provably beneficial for LRA when inputs follow the spiked covariance or the Zipfian distribution. Despite the good empirical performance we present in Section 7, one drawback is its much slower training time. Also, for the iterative sketching method for second-order optimization, it is non-trivial to find a proxy objective because the input of the i-th iteration depends on the solution to the (i -1)-th iteration, for which the greedy approach sometimes does not give a good solution. In the next section, we will propose our specific approach for optimizing the positions of the sketches for low-rank approximation and second-order optimization, both of which achieve a very high accuracy and can finish in a very short amount of time.

5. SKETCH LEARNING: LOW-RANK APPROXIMATION

Now we present a conceptually new algorithm which runs much faster and empirically achieves similar error bounds as the greedy search approach. Moreover, we show that this algorithm has strictly better guarantees than the classical Count-Sketch. To achieve this, we need a more careful analysis. To provide some intuition, if rank(SA) = k and SA = U ΣV ⊤ , then the rank-k approximation cost is exactly AV V ⊤ -A onto col(V ). Minimizing it is equivalent to maximizing the sum of squared projection coefficients: arg min S A -AV V ⊤ 2 F = arg min S i∈[n] (∥A i ∥ 2 2 - j∈[k] ⟨A i , v j ⟩ 2 ) = arg max S i∈[n] j∈[k] ⟨A i , v j ⟩ 2 . As mentioned, computing SA actually amounts to hashing the n rows of A to the m rows of SA. Hence, intuitively, if we can put similar rows into the same bucket, we may get a smaller error. Algorithm 4 Position optimization: Inner Product  Input: A ∈ R n×d : average of Tr; sketch dim. m 1: initialize S 1 , S 2 = O m×n 2: Sample a set C = {C 1 • • • C m } of p i , v i ← arg max p∈[m],v∈{±1} ⟨ Cp ∥Cp∥2 , v Ai ∥Ai∥2 ⟩ 5: S 1 [p i , i] ← v i 6: end for 7: for i = 1 to m do 8: I i ← {j | p j = i} 9: A (i) ← restriction of A to rows in I i 10: u i ← the top left singular vector of A (i) 11: S 1 [i, I i ] ← u ⊤ i 12: end for 13: for i = 1 to m do 14: q i ← index such that C i is the q i -th row of A 15: S 2 [i, q i ] ← 1 16: end for 17: return S 1 or [ S1 S2 ] Our algorithm is given in Algorithm 4. Suppose that we want to form matrix S with m rows. At the beginning of the algorithm, we sample m rows according to the ridge leverage scores of A. By the property of the ridge leverage score, the subspace spanned by this set of sampled rows contains an approximately optimal solution to the low rank approximation problem. Hence, we map these rows to separate "buckets" of SA. Then, we need to decide the locations of the remaining rows (i.e., the non-sampled rows). Ideally, we want similar rows to be mapped into the same bucket. To achieve this, we use the m sampled rows as reference points and assign each (nonsampled) row A i to the p-th bucket in SA if the normalized row A i and C p have the largest inner product (among all possible buckets). Once the locations of the non-zero entries are fixed, the next step is to determine the values of these entries. We follow the same idea proposed in (Indyk et al., 2021) : for each block A (i) , one natural approach is to choose the unit vector s i ∈ R |Ii| that preserves as much of the Frobenius norm of A (i) as possible, i.e., to maximize s ⊤ i A (i) 2 2 . Hence, we set s i to be the top left singular vector of A (i) . In our experiments, we observe that this step reduces the error of downstream value optimizations performed by SGD. To obtain a worst-case guarantee, we show that w.h.p., the row span of the sampled rows C i is a good subspace. We set the matrix S 2 to be the sampling matrix that samples C i . The final output of our algorithm is the vertical concatenation of S 1 and S 2 . Here S 1 performs well empirically, while S 2 has a worst-case guarantee for any input. Combining Lemma E.2 and the sketch monotonicity for low rank approximation in Section D, we get that O(k log k + k/ϵ) rows is enough for a (1 ± ϵ)approximation for the input matrix A induced by Tr, which is better than the Ω(k 2 ) rows required of a non-learned Count-Sketch, even if its non-zero values have been further improved by the previous learning-based algorithms in (Indyk et al., 2019; 2021) . As a result, under the assumption of the input data, we may expect that S will still be good for the test data. We defer the proof to Appendix E.1. In Appendix A, we shall show that the assumptions we make in Theorem 5.1 are reasonable. We also provide an empirical comparison between Algorithm 4 and some of its variants, as well as some adaptive sketching methods on the training sample. The evaluation result shows that only our algorithm has a significant improvement for the test data, which suggests that both ridge leverage score sampling and row bucketing are essential. Theorem 5.1. Let S ∈ R 2m×n be given by concatenating the sketching matrices S 1 , S 2 computed by Algorithm 4 with input A induced by Tr and let B ∈ R n×d . Then with probability at least 1 -δ, we have min rank-k X:row(X)⊆row(SB) ∥B -X∥ 2 F ≤ (1 + ϵ) ∥B -B k ∥ 2 F if one of the following holds:. 1. m = O(β • (k log k + k/ϵ)), δ = 0.1, and τ i (B) ≥ 1 β τ i (A) for all i ∈ [n]. 2. m = O(k log k + k/ϵ), δ = 0.1 + 1.1β , and the total variation distance d tv (p, q) ≤ β, where p, q are sampling probabilities defined as p i = τi(A) i τi(A) and q i = τi(B) i τi(B) . Time Complexity. As mentioned, an advantage of our second approach is that it significantly reduces the training time. We now discuss the training times of different algorithms. For the value-learning algorithms in (Indyk et al., 2019) , each iteration requires computing a differentiable SVD to perform gradient descent, hence the runtime is at least Ω(n it • T ), where n it is the number of iterations (usually set > 500) and T is the time to compute an SVD. For the greedy algorithm, there are m choices for each column, hence the runtime is at least Ω(mn • T ). For our second approach, the most complicated step is to compute the ridge leverage scores of A and then the SVD of each submatrix. Hence, the total runtime is at most O(T ). We note that the time complexities discussed here are all for training time. There is no additional runtime cost for the test data.

6. SKETCH LEARNING: SECOND-ORDER OPTIMIZATION

In this section, we consider optimizing the sketch matrix in the context of second-order methods. The key observation is that for many sketching-based second-order methods, the crucial property of the sketching matrix is the so-called subspace embedding property: for a matrix A ∈ R n×d , we say a matrix S ∈ R m×n is a (1 ± ϵ)-subspace embedding for the column space of A if (1 -ϵ) ∥Ax∥ 2 ≤ ∥SAx∥ 2 ≤ (1 + ϵ) ∥Ax∥ 2 for all x ∈ R d . For example, consider the iterative Hessian sketch, which performs the update (2.2) to compute {x t } t . Pilanci & Wainwright (2016) showed that if S 1 , . . . , S t+1 are (1 + O(ρ))-subspace embeddings of A, then ∥A(x t -x * )∥ 2 ≤ ρ t ∥Ax * ∥ 2 . Thus, if S i is a good subspace embedding of A and we will have a good convergence guarantee. Therefore, unlike (Indyk et al., 2019) , which treats the training objective in a black-box manner, we shall optimize the subspace embedding property of the matrix A. Optimizing positions. We consider the case that A has a few rows of large leverage score, as well as access to an oracle which reveals a superset of the indices of such rows. Formally, let τ i (A) be the leverage score of the i-th row of A and I * = {i : τ i (A) ≥ ν} be the set of rows with large leverage score. Suppose that a superset I ⊇ I * is known to the algorithm. In the experiments we train an oracle to predict such rows. We can maintain all rows in I explicitly and apply a Count-Sketch to the remaining rows, i.e., the rows in [n] \ I. Up to permutation of the rows, we can write A = A I A I c and S = I 0 0 S ′ , (6.1) where S ′ is a random Count-Sketch matrix of m rows. Clearly S has a single non-zero entry per column. We have the following theorem, whose proof is postponed to Section E.2. Intuitively, the proof for Count-Sketch in (Clarkson & Woodruff, 2017) handles rows of large leverage score and rows of small leverage score separately. The rows of large leverage score are to be perfectly hashed while the rows of small leverage score will concentrate in the sketch by the Hanson-Wright inequality. Theorem 6.1. Let ν = ϵ/d. Suppose that m = O((d/ϵ 2 )(polylog(1/ϵ) + log(1/δ))), δ ∈ (0, 1/m] and d = Ω((1/ϵ) polylog(1/ϵ) log 2 (1/δ)). Then, there exists a distribution on S of the form in (6.1) with m + |I| rows such that Pr ∀x ∈ col(A), | ∥Sx∥ 2 2 -∥x∥ 2 2 | > ϵ ∥x∥ 2 2 ≤ δ. In particular, when δ = 1/m, the sketching matrix S has O((d/ϵ 2 ) polylog(d/ϵ)) rows. Hence, if there happen to be at most d polylog(1/ϵ)/ϵ 2 rows of leverage score at least ϵ/d, the overall sketch length for embedding colsp(A) can be reduced to O((d polylog(1/ϵ) + log(1/δ))/ϵ 2 ), a quadratic improvement in d and an exponential improvement in δ over the original sketch length of O(d 2 /(ϵ 2 δ)) for Count-Sketch. In the worst case there could be O(d 2 /ϵ) such rows, though empirically we do not observe this. In Section 8, we shall show it is possible to learn the indices of the heavy rows for real-world data. Optimizing values. When we fix the positions of the non-zero entries, we aim to optimize the values by gradient descent. Rather than the previous black-box way in (Indyk et al., 2019) that minimizes i f (A, ALG(S, A)), we propose the following objective loss function for the learning algorithm L(S, A) = Ai∈A ∥(A i R i ) ⊤ A i R i -I∥ F , over all the training data, where R i comes from the QR decomposition of SA i = Q i R -1 i . The intuition for this loss function is given by the lemma below, whose proof is deferred to Section E.3. Lemma 6.2. Suppose that ϵ ∈ (0, 1 2 ), S ∈ R m×n , A ∈ R n×d of full column rank, and SA = QR is the QR-decomposition of SA. If ∥(AR -1 ) ⊤ AR -1 -I∥ op ≤ ϵ, then S is a (1 ± ϵ)-subspace embedding of col(A). Lemma 6.2 implies that if the loss function over A train is small and the distribution of A test is similar to A train , it is reasonable to expect that S is a good subspace embedding of A test . Here we use the Frobenius norm rather than operator norm in the loss function because it will make the optimization problem easier to solve, and our empirical results also show that the performance of the Frobenius norm is better than that of the operator norm.

7. EXPERIMENTS: LOW-RANK APPROXIMATION

In this section, we evaluate the empirical performance of our learning-based approach for LRA on three datasets. For each, we fix the sketch size and compare the approximation error ∥A -X∥ F -∥A -A k ∥ F averaged over 10 trials. In order to make position optimization more efficient, in line 3 of Algorithm 3), instead of computing many rank-1 SVD updates, we use formulas for fast rank-1 SVD updates (Brand, 2006) . For the greedy method, we used several Nvidia GeForce GTX 1080 Ti machines. For the maximum inner product method, the experiments are conducted on a laptop with a 1.90GHz CPU and 16GB RAM. Datasets. We use the three datasets from (Indyk et al., 2019) : (1, 2) Friends, Logo (image): frames from a short video of the TV show Friends and of a logo being painted; (3) Hyper (image): hyperspectral images from natural scenes. Additional details are in Table A .1. Baselines. We compare our approach to the following baselines. Classical CS: a random Count-Sketch. IVY19: a sparse sketch with learned values, and random positions for the non-zero entries. Ours (greedy): a sparse sketch where both the values and positions of the non-zero entries are learned. The positions are learned by Algorithm 3. The values are learned similarly to (Indyk et al., 2019) . Ours (inner product): a sparse sketch where both the values and the positions of the non-zero entries are learned. The positions are learned by S 1 in Algorithm 4. IVY19 and greedy algorithm use the full training set and our Algorithm 4 takes the input as the average over the entire training matrix. We also give a sensitivity analysis for our algorithm, where we compare our algorithm with the following variants: Only row sampling (perform projection by ridge leverage score sampling), ℓ 2 sampling (Replace leverage score sampling with ℓ 2 -norm row sampling and maintain the same downstream step), and Randomly Grouping (Use ridge leverage score sampling but randomly distribute the remaining rows). The result shows none of these variants outperforms non-learned sketching. We defer the results of this part to Appendix A.1. Result Summary. Our empirical results are provided in Table 7 .1 for both Algorithm 2 and Algorithm 1, where the errors take an average over 10 trials. We use the average of all training matrices from Tr, as the input to the algorithm 4. We note that all the steps of our training algorithms are done on the training data. Hence, no additional computational cost is incurred for the sketching algorithm on the test data. Experimental parameters (i.e., learning rate for gradient descent) can be found in Appendix G. For both sketching algorithms, Ours are always the best of the four sketches. It is significantly better than Classical CS, obtaining improvements of around 70%. It also obtains a roughly 30% improvement over IVY19. Wall-Clock Times. The offline learning runtime is in Table 7.2, which is the time to train a sketch on A train . We can see that although the greedy method will take much longer (1h 45min), our second approach is much faster (5 seconds) than the previous algorithm in (Indyk et al., 2019) (3 min) and can still achieve a similar error as the greedy algorithm. The reason is that Algorithm 4 only needs to compute the ridge leverage scores on the training matrix once, which is actually much cheaper than IVY19 which needs to compute a differentiable SVD many times during gradient descent. In Section A.4, we also study the performance of our approach in the few-shot learning setting, which has been studied in Indyk et al. (2021) .

8. EXPERIMENTS: SECOND-ORDER OPTIMIZATION

In this section, we consider the IHS on the following instance of LASSO regression: x * = arg min ∥x∥ 1 ≤λ f (x) = arg min ∥x∥ 1 ≤λ 1 2 ∥Ax -b∥ 2 2 , (8.1) where λ is a parameter. We also study the performance of the sketches on the matrix estimation with a nuclear norm constraint problem, the fast regression solver (van den Brand et al. ( 2021)), as well as the use of sketches for first-order methods. The results can be found in Appendix B. All of our experiments are conducted on a laptop with a 1.90GHz CPU and 16GB RAM. The offline training is done separately using a single GPU. The details of the implementation are deferred to Appendix G. Dataset. We use the Electricfoot_1 dataset of residential electric load measurements. Each row of the matrix corresponds to a different residence. Matrix columns are consecutive measurements at different times. Here Experiment Setting. We compare the learned sketch against the classical Count-Sketchfoot_2 . We choose m = 6d, 8d, 10d and consider the error f (x) -f (x * ). For the heavy-row Count-Sketch, we allocate 30% of the sketch space to the rows of the heavy row candidates. For this dataset, each row represents a specific residence and hence there is a strong pattern of the distribution of the heavy rows. We select the heavy rows according to the number of times each row is heavy in the training data. We give a detailed discussion about this in Appendix B.1. We highlight that it is still possible to recognize the pattern of the rows even if the row orders of the test data are permuted. We also consider optimizing the non-zero values after identifying the heavy rows, using our new approach in Section 6. A i ∈ R 370×9 , b i ∈ R 370×1 , Results. We plot in Figures 7.1 the mean errors on a logarithmic scale. The average offline training time is 3.67s to find a superset of the heavy rows over the training data and 66s to optimize the values when m = 10d, which are both faster than the runtime of Indyk et al. (2019) with the same parameters. Note that the learned matrix S is trained offline only once using the training data. Hence, no additional computational cost is incurred when solving the optimization problem on the test data. We see all methods display linear convergence, that is, letting e k denote the error in the k-th iteration, we have e k ≈ ρ k e 1 for some convergence rate ρ. A smaller convergence rate implies a faster convergence. We calculate an estimated rate of convergence ρ = (e k /e 1 ) 1/k with k = 7. We can see both sketches, especially the sketch that optimizes both the positions and values, show significant improvements. When the sketch size is small (6d), this sketch has a convergence rate that is just 13.2% of that of the classical Count-Sketch, and when the sketch size is large (10d), this sketch has a smaller convergence rate that is just 12.1%.

A ADDITIONAL EXPERIMENTS: LOW-RANK APPROXIMATION

The details (data dimension, N train , etc.) are presented in In this section we explore how sensitive the performance of our Algorithm 4 is to the ridge leverage score sampling and maximum inner product grouping process. We consider the following baselines: • ℓ 2 norm sampling: we sample the rows according to their squared length instead of doing ridge leverage score sampling. • Only ridge leverage score sampling: the subspace spanned by only the sampled rows from ridge leverage score sampling. • Randomly grouping: we put the sampled rows into different buckets as before, but randomly divide the non-sampled rows into buckets. The results are shown in Table A .2. Here we set k = 30, m = 60 as an example. To show the difference of the initialization method more clearly, we compare the error using the one-sided sketching Algorithm 1 and do not further optimize the non-zeros values. From the table we can see both that ridge leverage score sampling and the downstream grouping process are necessary, otherwise the error will be similar or even worse than that of the classical Count-Sketch.

A.2 TOTAL VARIATION DISTANCE

As we have shown in Theorem 5.1, if the total variation distance between the row sampling probability distributions p and q is O(1), we have a worst-case guarantee of O(k log k + k/ϵ), which is strictly better than the Ω(k 2 ) lower bound for the random CountSketch, even when its non-zero values have been optimized. We now study the total variation distance between the train and test matrix in our dataset. The result is shown in Figure A.1. From the figure we can see that for all the three dataset, the total variation distance is bounded by a constant, which suggests that the assumptions are reasonable for real-world data. 

A.3 LEARNING-BASED ALGORITHMS FOR LOW-RANK APPROXIMATION IN THE FEW-SHOT SETTING

In this section, we will give a brief explanation of the two algorithms proposed in Indyk et al. (2021) . Both algorithms aim to optimize the non-zero values of a Count-Sketch matrix under fixed locations of the non-zero entries. One-shot closed-form algorithm. Given a sparsity pattern of a Count-Sketch matrix S ∈ R m×n , it partitions the rows of A into m blocks A (1) , ..., A (m) as follows: let I i = {j : S ij = 1}. The block A (i) ∈ R |Ii|×d is the sub-matrix of A that contains the rows whose indices are in I i . The goal here is for each block A (i) , to choose a (non-sparse) one-dimensional sketching vector s i ∈ R |Ii| . The first approach is to set s i to be the top left singular vector of A (i) , which is the algorithm 1Shot2Vec. Another approach is to set s i to be a left singular vector of A (i) chosen randomly and proportional to its squared singular value. The main advantage of the latter approach over the previous one is that it endows the algorithm with provable guarantees on the LRA error. The 1Shot2Vec algorithm combines both ways, obtaining the benefits of both approaches. The advantage of these two algorithms is that they extract a sketching matrix by an analytic computation, requiring neither GPU access nor auto-gradient functionality. Few-shot SGD algorithm. In this algorithm, the authors propose a new loss function for LRA, namely, min CS S E A∈Tr U ⊤ k S ⊤ SU -I 0 2 F , where A = U ΣV ⊤ is the SVD-decomposition of A and U k ∈ R n×k denotes the submatrix of U that contains its first k columns. I 0 ∈ R k×d denotes the result of augmenting the identity matrix of order k with d -k additional zero columns on the right. This loss function is motivated by the analysis of prior LRA algorithms that use random sketching matrices. It is faster to compute and differentiate than the previous empirical loss in Indyk et al. (2019) . In the experiments the authors also show that this loss function can achieve a smaller error in a shorter amount of time, using a small number of randomly sampled training matrices, though the final error will be larger than that of the previous algorithm in Indyk et al. (2019) if we allow a longer training time and access to the whole training set Tr.

A.4 EXPERIMENTS: LRA IN THE FEW-SHOT SETTING

In the rest of this section, we study the performance of our second approach in the few-shot learning setting. We first consider the case where we only have one training matrix randomly sampled from Tr. Here, we compare our method with the 1Shot2Vec method proposed in (Indyk et al., 2021) in the same setting (k = 10, m = 40) as in their empirical evaluation. The result is shown in Table A .3. Compared to 1Shot2Vec, our method reduces the error by around 50%, and has an even slightly faster runtime. 2021) also proposed a FewShotSGD algorithm which further improves the non-zero values of the sketches after different initialization methods. We compare the performance of this approach for different initialization methods: in all initialization methods, we only use one training matrix and we use three training matrices for the FewShotSGD step. The results are shown in Table A .4. We report the minimum error of 50 iterations of the FewShotSGD because we aim to compare the computational efficiency for different methods. From the table we see that our approach plus the FewShotSGD method can achieve a much smaller error, with around a 50% improvement upon (Indyk et al., 2021) . Moreover, even without further optimization by FewShotSGD, our initialization method for learning the non-zero locations in CountSketch obtains a smaller error than other methods (even when they are optimized with 1ShotSGD or FewShotSGD learning).

B ADDITIONAL EXPERIMENTS: SECOND-ORDER OPTIMIZATION

As we mentioned in Section 8, despite the number of problems that learned sketches have been applied to, they have not been applied to convex optimization, or say, iterative sketching algorithms in general. To demonstrate the difficulty, we consider the Iterative Hessian Sketch (IHS) as an example. In that scheme, suppose that we have k iterations of the algorithm. Then we need k independent sketching matrices (otherwise the solution may diverge). A natural way is to follow the method in (Indyk et al., 2019) , which is to minimize the following quantity min S1,...S k E A∈D f (A, ALG(S 1 , ..., S k , A)) , where the minimization is taken over k Count-Sketch matrices S 1 , . . . , S k . In this case, however, calculating the gradient with respect to S 1 would involve all iterations and in each iteration we need to solve a constrained optimization problem. Hence, it would be difficult and intractable to compute the gradients. An alternative way is to train k sketching matrices sequentially, that is, learn the sketching matrix for the i-th iteration using a local loss function for the i-th iteration, and then using the learned matrix in the i-th iteration to generate the training data for the (i + 1)-st iteration. However, the empirical results suggest that it works for the first iteration only, because in this case the training data for the (i + 1)-st iteration depends on the solution to the i-th iteration and may become farther away from the test data in later iterations. The core problem here is that the method proposed in Indyk et al. (2019) treats the training process in a black-box way, which is difficult to extend to iterative methods.

B.1 THE DISTRIBUTION OF THE HEAVY ROWS

In our experiments, we hypothesize that in real-world data there may be an underlying pattern which can help us identify the heavy rows. In the Electric dataset, each row of the matrix corresponds to a specific residence and the heavy rows are concentrated on some specific rows. To exemplify this, we study the heavy leverage score rows distribution over the Electric dataset. For a row i ∈ [370], let f i denote the number of times that row i is heavy out of 320 training data points from the Electric dataset, where we say row i is heavy if ℓ i ≥ 5d/n. Below we list all 74 pairs (i, f i ) with f i > 0. (195,320), (278,320), (361,320), (207,317), (227,285), (240,284), (219,270), (275,232), (156,214), (322,213), (193,196), (190,192), (160,191) Observe that the heavy rows are concentrated on a set of specific row indices. There are only 30 rows i with f i ≥ 50. We view this as strong evidence for our hypothesis. Heavy Rows Distribution Under Permutation. We note that even though the order of the rows has been changed, we can still recognize the patterns of the rows. We continue to use the Electric dataset as an example. To address the concern that a permutation may break the sketch, we can measure the similarity between vectors, that is, after processing the training data, we can instead test similarity on the rows of the test matrix and use this to select the heavy rows, rather than an index which may simply be permuted. To illustrate this method, we use the following example on the Electric dataset, using locality sensitive hashing. After processing the training data, we obtain a set I of indices of heavy rows. For each i ∈ I, we pick q = 3 independent standard Gaussian vectors g 1 , g 2 , g 3 ∈ R d , and compute f (r i ) = (g T 1 r i , g T 2 r i , g T 3 r i ) ∈ R 3 , where r i takes an average of the i-th rows over all training sets. Let A be the test matrix. For each i ∈ I, let j i = argmin j ∥f (A j ) -f (r i )∥ 2 . We take the j i -th row to be a heavy row in our learned sketch. This method only needs an additional O(1) passes over the entries of A and hence, the extra time cost is negligible. To test the performance of the method, we randomly pick a matrix from the test set and permute its rows. The result shows that when k is small, we can roughly recover 70% of the top-k heavy rows, and we plot below the regression error using the learned Count-Sketch matrix generated this way, where we set m = 90 and k = 0.3m = 27. We can see that the learned method still obtains a significant improvement.  X * := arg min X∈R d 1 ×d 2 ∥AX -B∥ 2 F , it is reasonable to model the matrix X * as having low rank. Similar to ℓ 1 -minimization for compressive sensing, a standard relaxation of the rank constraint is to minimize the nuclear norm of X, defined as ∥X∥ * := min{d1,d2} j=1 σ j (X), where σ j (X) is the j-th largest singular value of X. Hence, the matrix estimation problem we consider here is X * := arg min X∈R d 1 ×d 2 ∥AX -B∥ 2 F such that ∥X∥ * ≤ ρ, where ρ > 0 is a user-defined radius as a regularization parameter. We conduct Iterative Hessian Sketch (IHS) experiments on the following dataset: • Tunnelfoot_3 : The data set is a time series of gas concentrations measured by eight sensors in a wind tunnel. Each (A, B) corresponds to a different data collection trial. A i ∈ R 13530×5 , B i ∈ R 13530×6 , |(A, B)| train = 144, |(A, B)| test = 36. In our nuclear norm constraint, we set ρ = 10. Experiment Setting: We choose m = 7d, 10d for the Tunnel dataset. We consider the error 1 2 ∥AX -B∥ 2 2 -1 2 ∥AX * -B∥ 2 2 . The leverage scores of this dataset are very uniform. Hence, for this experiment we only consider optimizing the values of the non-zero entries.

Results of Our Experiments:

We plot on a logarithmic scale the mean errors of the dataset in Figures B.2. We can see that when m = 7d, the gradient-based sketch, based on the first 6 iterations, has a rate of convergence that is 48% of the random sketch, and when m = 10d, the gradient-based sketch has a rate of convergence that is 29% of the random sketch.

B.3 FAST REGRESSION SOLVER

Consider an unconstrained convex optimization problem min x f (x), where f is smooth and strongly convex, and its Hessian ∇ 2 f is Lipschitz continuous. This problem can be solved by Newton's  x t+1 = x t -arg min z (∇ 2 f (x t ) 1/2 ) ⊤ (∇ 2 f (x t ) 1/2 )z -∇f (x t ) 2 , (B.1) provided it is given a good initial point x 0 . In each step, it requires solving a regression problem of the form min z A ⊤ Az -y 2 , which, with access to A, can be solved with a fast regression solver in (van den Brand et al., 2021) . The regression solver first computes a preconditioner R via a QR decomposition such that SAR has orthonormal columns, where S is a sketching matrix, then solves ẑ = arg min z ′ (AR) ⊤ (AR)z ′ -y 2 by gradient descent and returns Rẑ in the end. Here, the point of sketching is that the QR decomposition of SA can be computed much more efficiently than the QR decomposition of A, since S has only a small number of rows. In this section, We consider the unconstrained least squares problem min x f (x) with f (x) = 1 2 ∥Ax -b∥ 2 2 using the Electric dataset, using the above fast regression solver. Training: Note that ∇ 2 f (x) = A ⊤ A, independent of x. In the t-th round of Newton's method, by (B.1), we need to solve a regression problem min z A ⊤ Az -y 2 2 with y = ∇f (x t ). Hence, we can use the same methods in the preceding subsection to optimize the learned sketch S i . For a general problem where ∇ 2 f (x) depends on x, one can take x t to be the solution obtained from the learned sketch S t to generate A and y for the (t + 1)-st round, train a learned sketch S t+1 , and repeat this process. Experiment Setting: For the Electric dataset, we set m = 10d = 90. We observe that the classical Count-Sketch matrix makes the solution diverge terribly in this setting. To make a clearer comparison, we consider the following sketch matrix: • Gaussian sketch: S = 1 √ m G, where G ∈ R m×n with i.i.d. N (0, 1) entries. • Sparse Johnson-Lindenstrauss Transform (SJLT): S is the vertical concatenation of s independent Count-Sketch matrices, each of dimension m/s × n. We note that the above sketching matrices require more time to compute SA but need fewer rows to be a subspace embedding than the classical Count-Sketch matrix. For the step length η in gradient descent, we set η = 1 in all iterations of the learned sketches. For classical random sketches, we set η in the following two ways: (a) η = 1 in all iterations and (b) η = 1 in the first iteration and η = 0.2 in all subsequent iterations. Experimental Results: We examine the accuracy of the subproblem min z A ⊤ Az -y 2 2 and define the error to be A ⊤ ARz t -y 2 / ∥y∥ 2 . We consider the subproblems in the first three iterations of the global Newton method. The results are plotted in Figure B .3. Note that Count-Sketch causes a terrible divergence of the subroutine and is thus omitted in the plots. Still, we observe that in setting (a) of η, the other two classical sketches cause the subroutine to diverge. In setting (b) of η, the other two classical sketches lead to convergence but their error is significantly larger than that of the learned sketches, in each of the first three calls to the subroutine. The error of the learned sketch is less than 0.01 in all iterations of all three subroutine calls, in both settings (a) and (b) of η. We also plot a figure on the convergence of the global Newton method. Here, for each subroutine, we only run one iteration, and plot the error of the original least squares problem. The result is shown in Figure B .4, which clearly displays a significantly faster decay with learned sketches. The rate of convergence using heavy-rows sketches is 80.6% of that using Gaussian or sparse JL sketches.

B.4 FIRST-ORDER OPTIMIZATION

In this section, we study the use of the sketch in first-order methods. Particularly, let QR -1 = SA be the QR-decomposition for SA, where S is a sketch matrix. We use R as an (approximate) preconditioner and use gradient descent to solve the problem min ∥ARx -b∥ 2 . Here we use the Electric dataset where A is 370 × 9 and we set S to have 90 rows. The result is shown in the following table, where the time includes the time to compute R. We can see that if we use a learned sketch matrix, the error converges very fast when we set the learning rate to be 1 and 0.1, while the classical Count-Sketch will lead to divergence. 

C PRELIMINARIES: THEOREMS AND ADDITIONAL ALGORITHMS

In this section, we provide the full description of the time-optimal sketching algorithm for LRA in Algorithm 2. We also provide several definitions and lemmas that are used in the proofs of our results for LRA. Definition C.1 (Affine Embedding). Given a pair of matrices A and B, a matrix S is an affine ϵ-embedding if for all X of the appropriate shape, ∥S(AX -B)∥  ∈ R n×d ′ . Let S ∈ R m×n be a CountSketch with m = rank(A) 2 ϵ 2 . Let X = arg min rank-k X ∥SAX -SB∥ 2 F . Then, 1. With constant probability, A X -B 2 F ≤ (1 + ϵ) min rank-k X ∥AX -B∥ 2 F . In other words, in O(nnz(A) + nnz(B) + m(d + d ′ )) time, we can reduce the problem to a smaller (multi-response regression) problem with m rows whose optimal solution is a (1 + ϵ)-approximate solution to the original instance. 2. The (1 + ϵ)-approximate solution X can be computed in time O(nnz(A) + nnz(B) + mdd ′ + min(m 2 d, md 2 )). Now we turn our attention to the time-optimal sketching algorithm for LRA. The next lemma is known, though we include it for completeness Avron et al. ( 2017): Lemma C.5. Suppose that S ∈ R m S ×n and R ∈ R m R ×d are sparse affine ϵ-embedding matrices for (A k , A) and ((SA) ⊤ , A ⊤ ), respectively. Then, min rank-k X AR ⊤ XSA -A 2 F ≤ (1 + ϵ) ∥A k -A∥ 2 F Proof. Consider the following multiple-response regression problem: min rank-k X ∥A k X -A∥ 2 F . (C.1) Note that since X = I k is a feasible solution to Eq. (C.1), min rank-k X ∥A k X -A∥ 2 F = ∥A k -A∥ 2 F . Let S ∈ R m S ×n be a sketching matrix that satisfies the condition of Lemma C.4 (Item 1) for A := A k and B := A. By the normal equations, the rank-k minimizer of ∥SA k X -SA∥ 2 F is (SA k ) + SA. Hence, A k (SA k ) + SA -A 2 F ≤ (1 + ϵ) ∥A k -A∥ 2 F , (C.2) which in particular shows that a (1 + ϵ)-approximate rank-k approximation of A exists in the row space of SA. In other words, min rank-k X ∥XSA -A∥ 2 F ≤ (1 + ϵ) ∥A k -A∥ 2 F . (C.3) Next, let R ∈ R m R ×d be a sketching matrix which satisfies the condition of Lemma C.4 (Item 1) for A := (SA) ⊤ and B := A ⊤ . Let Y denote the rank-k minimizer of R(SA) ⊤ X ⊤ -RA ⊤ 2 F . Hence, (SA) ⊤ Y ⊤ -A ⊤ 2 F ≤ (1 + ϵ) min rank-k X ∥XSA -A∥ 2 F ▷ Lemma C.4 (Item 1) ≤ (1 + O(ϵ)) ∥A k -A∥ 2 F ▷ Eq. (C.3) (C.4) Note that by the normal equations, again rowsp(Y ⊤ ) ⊆ rowsp(RA ⊤ ) and we can write Y = AR ⊤ Z where rank(Z) = k. Thus, min rank-k X AR ⊤ XSA -A 2 F ≤ AR ⊤ ZSA -A 2 F = (SA) ⊤ Y ⊤ -A ⊤ 2 F ▷ Y = AR ⊤ Z ≤ (1 + O(ϵ)) ∥A k -A∥ 2 F ▷ Eq. (C.4) Lemma C.6 (Avron et al. (2017); Lemma 27). For C ∈ R p×m ′ , D ∈ R m×p ′ , G ∈ R p×p ′ , the following problem min rank-k Z ∥CZD -G∥ 2 F (C.5) can be solved in O(pm ′ r C + p ′ mr D + pp ′ (r D + r C )) time, where r C = rank(C) ≤ min{m ′ , p} and r D = rank(D) ≤ min{m, p ′ }. Lemma C.7. Let S ∈ R m S ×d , R ∈ R m R ×d be CountSketch (CS) matrices such that min rank-k X AR ⊤ XSA -A 2 F ≤ (1 + γ) ∥A k -A∥ 2 F . (C.6) Let V ∈ R m 2 R β 2 ×n , and W ∈ R m 2 S β 2 ×d be CS matrices. Then, Algorithm 2 gives a (1 + O(β + γ))- approximation in time nnz(A) + O( m 4 S β 2 + m 4 R β 2 + m 2 S m 2 R (m S +m R ) β 4 + k(nm S + dm R )) with constant probability. Proof. The approximation guarantee follows from Eq. (C.6) and the fact that V and W are affine β-embedding matrices of AR ⊤ and SA, respectively (see Lemma C.3). The algorithm first computes C = V AR ⊤ , D = SAW ⊤ , G = V AW ⊤ which can be done in time O(nnz(A)). As an example, we bound the time to compute C = V AR. Note that since V is a CS, V A can be computed in O(nnz(A)) time and the number of non-zero entries in the resulting matrix is at most nnz(A). Hence, since R is a CS as well, C can be computed in time O(nnz(A) + nnz(V A)) = O(nnz(A)). Then, it takes an extra O((m 3 S + m 3 R + m 2 S m 2 R )/β 2 ) time to store C, D and G in matrix form. Next, as we showed in Lemma C.6, the time to compute Z in Algorithm 2 is O( m 4 S β 2 + m 4 R β 2 + m 2 S m 2 R (m S +m R ) β 4 ). Finally, it takes O(nnz(A) + k(nm S + dm R )) time to compute Q = AR ⊤ Z L and P = Z R SA and to return the solution in the form of P n×k Q k×d . Hence, the total runtime is O(nnz(A) + m 4 S β 2 + m 4 R β 2 + m 2 S m 2 R (m S + m R ) β 4 + k(nm S + dm R )) D ATTAINING WORST-CASE GUARANTEES D.1 LOW-RANK APPROXIMATION We shall provide the following two methods to achieve worst case guarantees: MixedSketch-whose guarantee is via the sketch monotonicity property, and approximate comparison method (a.k.a. ApproxCheck), which just approximately evaluates the cost of two solutions and takes the better one. These methods asymptotically achieve the same worst-case guarantee. However, for any input matrix A and any pair of sketches S, T , the performance of the MixedSketch method on (A, S, T ) is never worse than the performance of its corresponding ApproxCheck method on (A, S, T ), and can be much better. Remark D.1. Let A = diag(2, 2, √ 2, √ ), and suppose the goal is to find a rank-2 approximation of A. Consider two sketches S and T such that SA and T A capture span(e 1 , e 3 ) and span(e 2 , e 4 ), respectively. Then for both SA and T A, the best solution in the subspace of one of these two spaces is a ( 3 2 )-approximation: ∥A -A 2 ∥ 2 F = 4 and ∥A -P SA ∥ 2 F = ∥A -P T A ∥ 2 F = 6 where P SA and P T A respectively denote the best approximation of A in the space spanned by SA and T A. However, if we find the best rank-2 approximation of A, Z, inside the span of the union of SA and T A, then ∥A -Z∥ 2 F = 4. Since ApproxCheck just chooses the better of SA and T A by evaluating their costs, it misses out on the opportunity to do as well as MixedSketch. Here, we show the sketch monotonicity property for LRA. Theorem D.2. Let A ∈ R n×d be an input matrix, V and W be η-affine embeddings, and S 1 ∈ R m S ×n , R 1 ∈ R m R ×n be arbitrary matrices. Consider arbitrary extensions to S 1 , R 1 : S, R (e.g., S is a concatenation of S 1 with an arbitrary matrix with the same number of columns). Then, A -ALG LRA ((S, R, V, W ), A)) 2 F ≤ (1 + η) 2 ∥A -ALG LRA ((S 1 , R 1 , V, W ), A)∥ 2 F Proof. We have A -ALG LRA ((S, R, V, W ), A) 2 F ≤ (1 + η) min rank-k X ARXSA -A 2 F = (1 + η) min rank-k X:X∈row(SA)∩col(AR) ∥X -A∥ 2 F , which is in turn at most (1 + η) min rank-k X:X∈row(S1A)∩col(AR1) ∥X -A∥ 2 F = (1 + η) min rank-k X ∥AR 1 XS 1 A -A∥ 2 F ≤ (1 + η) 2 ∥A -ALG LRA ((S 1 , R 1 , V, W ), A)∥ 2 F , where we use the fact the V, W are affine η-embeddings (Definition C.1), as well as the fact that (col(AR 1 ) ∩ row(S 1 A)) ⊆ col(AR) ∩ row(SA) . ApproxCheck for LRA. We give the pseudocode for the ApproxCheck method and prove that the runtime of this method for LRA is of the same order as the classical time-optimal sketching algorithm of LRA. Algorithm 5 LRA APPROXCHECK Input: learned sketches S L , R L , V L , W L ; classical sketches S C , R C , V C , W C ; β; A ∈ R n×d 1: P L , Q L ← ALG LRA (S L , R L , V L , W L , A), P C Q C ← ALG LRA (S C , R C , V C , W C , A) 2: Let S ∈ R O(1/β 2 )×n , R ∈ R O(1/β 2 )×d be classical CountSketch matrices 3: ∆ L ← S (P L Q L -A) R ⊤ 2 F , ∆ C ← S (P C Q C -A) R ⊤ 2 F 4: if ∆ L ≤ ∆ C then 5: return P L Q L 6: end if 7: return P C Q C Theorem D.3. Assume we have data A ∈ R n×d , learned sketches S L ∈ R poly( k ϵ )×n , R L ∈ R poly( k ϵ )×d , V L ∈ R poly( k ϵ )×n , W L ∈ R poly( k ϵ )×d which attain a (1 + O(γ))-approximation, classical sketches of the same size, S C , R C , V C , W C , which attain a (1 + O(ϵ))-approximation, and a tradeoff parameter β. Then, Algorithm 5 attains a (1 + β + min(γ, ϵ))-approximation in O(nnz(A) + (n + d) poly( k ϵ ) + k 4 β 4 • poly( k ϵ )) time. Proof. Let (P L , Q L ), (P C , Q C ) be the approximate rank-k approximations of A in factored form using (S L , R L ) and (S O , R O ). Then, clearly, min(∥P L Q L -A∥ 2 F , ∥P C Q C -A∥ 2 F ) = (1 + O(min(ϵ, γ))) ∥A k -A∥ 2 F (D.1) Let Γ L = P L Q L -A, Γ C = P C Q C -A and Γ M = arg min(∥SΓ L R∥ F , ∥SΓ C R∥ F ). Then, ∥Γ M ∥ 2 F ≤ (1 + O(β)) ∥SΓ M R∥ 2 F ▷ by Lemma C.2 ≤ (1 + O(β)) • min(∥Γ L ∥ 2 F , ∥Γ C ∥ 2 F ) ≤ (1 + O(β + min(ϵ, γ))) ∥A k -A∥ 2 F ▷ by Eq. (D.1) Runtime analysis. By Lemma C.7, Algorithm 2 computes P L , Q L and P C , Q C in time O(nnz(A) + k 16 (β 2 +ϵ 2 ) ϵ 24 β 4 + k 3 ϵ 2 (n + dk 2 ϵ 4 )). Next, once we have P L , Q L and P C , Q C , it takes O(nnz(A) + k β 4 ) time to compute ∆ L and ∆ C . O(nnz(A) + k 16 (β 2 + ϵ 2 ) ϵ 24 β 4 + k 3 ϵ 2 (n + dk 2 ϵ 4 ) + k β 4 ) = O(nnz(A) + (n + d + k 4 β 4 ) poly( k ϵ )). To interpret the above theorem, note that when ϵ ≫ k(n + d) -4 , we can set β -4 = O(k(n + d) -4 ) so that Algorithm 5 has the same asymptotic runtime as the best (1 + ϵ)-approximation algorithm for LRA with the classical CountSketch. Moreover, Algorithm 5 is a (1 + o(ϵ))-approximation when the learned sketch outperforms classical sketches, γ = o(ϵ). On the other hand, when the learned sketches perform poorly, γ = Ω(ϵ), the worst-case guarantee of Algorithm 5 remains (1 + O(ϵ)).

D.2 SECOND-ORDER OPTIMIZATION

For the sketches for second-order optimization, the monotonicity property does not hold. Below we provide an input-sparsity algorithm which can test for and use the better of a random sketch and a learned sketch. Our theorem is as follows. Theorem D.4. Let ϵ ∈ (0, 0.09) be a constant and S 1 a learned Count-Sketch matrix. Suppose that A is of full rank. There is an algorithm whose output is a solution x which, with probability at least 0.98, satisfies that ∥A(x -x * )∥ 2 ≤ O min Z2(S1) Z1(S1) , ϵ ∥Ax * ∥ 2 , where x * = arg min x∈C ∥Ax -b∥ 2 is the least-squares solution. Furthermore, the algorithm runs in O(nnz(A) log( 1ϵ ) + poly( d ϵ )) time. Algorithm 6 Solver for (D.2) 1: S 1 ← learned sketch, S 2 ← random sketch with Θ(d 2 /ϵ 2 ) rows 2: ( Ẑi,1 , Ẑi,2 ) ← ESTIMATE(S i , A), i = 1, 2 3: i * ← arg min i=1,2 ( Ẑi,2 / Ẑi,1 ) 4: x ← solution of (D.2) with S = S i * 5: return x 6: function ESTIMATE(S, A) 7: T ← sparse (1 ± η)-subspace embedding matrix for d-dimensional subspaces 8: (Q, R) ← QR(T A) 9: Ẑ1 ← σ min (SAR -1 ) 10: Ẑ2 ← (1 ± η)-approximation to (SAR -1 ) ⊤ (SAR -1 ) -I op 11: return ( Ẑ1 , Ẑ2 ) Consider the minimization problem min x∈C 1 2 ∥SAx∥ 2 2 -⟨A ⊤ y, x⟩ , (D.2) which is used as a subroutine for the IHS (cf. (2.2)). We note that in this subroutine if we let x ← x -x i-1 , b ← b -Ax i-1 , C ← C -x i-1 , we would get the guarantee of the i-th iteration of the original IHS. To analyze the performance of the learned sketch, we define the following quantities (corresponding exactly to the unconstrained case in (Pilanci & Wainwright, 2016) ) Z 1 (S) = inf v∈colsp(A)∩S n-1 ∥Sv∥ 2 2 , Z 2 (S) = sup u,v∈colsp(A)∩S n-1 u, (S ⊤ S -I n )v . When S is a (1 + ϵ)-subspace embedding of colsp(A), we have Z 1 (S) ≥ 1 -ϵ and Z 2 (S) ≤ 2ϵ. For a general sketching matrix S, the following is the approximation guarantee of Ẑ1 and Ẑ2 , which are estimates of Z 1 (S) and Z 2 (S), respectively. The main idea is that AR -1 is well-conditioned, where R is as calculated in Algorithm 6. Lemma D.5. Suppose that η ∈ (0, 1 3 ) is a small constant, A is of full rank and S has poly(d/η) rows. The function ESTIMATE(S, A) returns in O((nnz(A) log 1 η + poly( d η )) time Ẑ1 , Ẑ2 which with probability at least 0.99 satisfy that Z1(S) 1+η ≤ Ẑ1 ≤ Z1(S) 1-η and Z2(S) (1+η) 2 -3η ≤ Ẑ2 ≤ Z2(S) (1-η) 2 + 3η. Proof. Suppose that AR -1 = U W , where U ∈ R n×d has orthonormal columns, which form an orthonormal basis of the column space of A. Since T is a subspace embedding of the column space of A with probability 0.99, it holds for all x ∈ R d that 1 1 + η T AR -1 x 2 ≤ AR -1 x 2 ≤ 1 1 -η T AR -1 x 2 . Since T AR -1 x 2 = ∥Qx∥ 2 = ∥x∥ 2 and ∥W x∥ 2 = ∥U W x∥ 2 = AR -1 x 2 (D.3) we have that 1 1 + η ∥x∥ 2 ≤ ∥W x∥ 2 ≤ 1 1 -η ∥x∥ 2 , x ∈ R d . (D.4) It is easy to see that Z 1 (S) = min x∈S d-1 ∥SU x∥ 2 = min y̸ =0 ∥SU W y∥ 2 ∥W y∥ 2 , and thus, min y̸ =0 (1 -η) ∥SU W y∥ 2 ∥y∥ 2 ≤ Z 1 (S) ≤ min y̸ =0 (1 + η) ∥SU W y∥ 2 ∥y∥ 2 . Recall that SU W = SAR -1 . We see that (1 -η)σ min (SAR -1 ) ≤ Z 1 (S) ≤ (1 + η)σ min (SAR -1 ). By definition, Z 2 (S) = U T (S ⊤ S -I n )U op . It follows from (D.4) that (1 -η) 2 W T U T (S T S -I n )U W op ≤ Z 2 (S) ≤ (1 + η) 2 W T U T (S T S -I n )U W op . and from (D.4), (D.3) and Lemma 5.36 of Vershynin (2012) that (AR -1 ) ⊤ (AR -1 ) -I op ≤ 3η. Since W T U T (S T S -I n )U W op = (AR -1 ) ⊤ (S T S -I n )AR -1 op and (AR -1 ) ⊤ S T SAR -1 -I op -(AR -1 ) ⊤ (AR -1 ) -I op ≤ (AR -1 ) ⊤ (S T S -I n )AR -1 op ≤ (AR -1 ) ⊤ S T SAR -1 -I op + (AR -1 ) ⊤ (AR -1 ) -I op , it follows that (1 -η) 2 (SAR -1 ) ⊤ SAR -1 -I op -3(1 -η) 2 η ≤ Z 2 (S) ≤ (1 + η) 2 (SAR -1 ) ⊤ SAR -1 -I op + 3(1 + η) 2 η. We have so far proved the correctness of the approximation and we next analyze the runtime below. Since S and T are sparse, computing SA and T A takes O(nnz(A)) time. The QR decomposition of T A, which is a matrix of size poly(d/η) × d, can be computed in poly(d/η) time. The matrix SAR -1 can be computed in poly(d) time. Since it has size poly(d/η) × d, its smallest singular value can be computed in poly(d/η) time. To approximate Z 2 (S), we can use the power method to estimate (SAR -1 ) T SAR -1 -I op up to a (1 ± η)-factor in O((nnz(A) + poly(d/η)) log(1/η)) time. Now we are ready to prove Theorem D.4. Proof of Theorem D.4. In Lemma D.5, we have with probability at least 0.99 that Ẑ2 Ẑ1 ≥ 1 (1+ϵ) 2 Z 2 (S) -3ϵ 1 1-ϵ Z 1 (S) ≥ 1 -ϵ (1 + ϵ) 2 Z 2 (S) Z 1 (S) - 3ϵ(1 -ϵ) Z 1 (S) . and similarly, Ẑ2 Ẑ1 ≤ 1 (1-ϵ) 2 Z 2 (S) + 3ϵ 1 1+ϵ Z 1 (S) ≤ 1 + ϵ (1 -ϵ) 2 Z 2 (S) Z 1 (S) + 3ϵ(1 + ϵ) Z 1 (S) . Note that since S 2 is an ϵ-subspace embedding with probability at least 0.99, we have that Z 1 (S 2 ) ≥ 1 -ϵ and Z 2 (S 2 )/Z 1 (S 2 ) ≤ 2.2ϵ. Consider Z 1 (S 1 ). First, we consider the case where Z 1 (S 1 ) < 1/2. Observe that Z 2 (S) ≥ 1 -Z 1 (S). We have in this case Z 1,2 / Z 1,1 > 1/5 ≥ 2.2ϵ ≥ Z 2 (S 2 )/Z 1 (S 2 ). In this case our algorithm will choose S 2 correctly. Next, assume that Z 1 (S 1 ) ≥ 1/2. Now we have with probability at least 0.98 that (1 -3ϵ) Z 2 (S i ) Z 1 (S i ) -3ϵ ≤ Z i,2 Z i,1 ≤ (1 + 4ϵ) Z 2 (S i ) Z 1 (S i ) + 4ϵ, i = 1, 2. Therefore, when Z 2 (S 1 )/Z 1 (S 1 ) ≤ c 1 Z 2 (S 2 )/Z 1 (S 2 ) for some small absolute constant c 1 > 0, we will have Z 1,2 / Z 1,1 < Z 2,2 / Z 2,1 , and our algorithm will choose S 1 correctly. If Z 2 (S 1 )/Z 1 (S 1 ) ≥ C 1 ϵ for some absolute constant C 1 > 0, our algorithm will choose S 2 correctly. In the remaining case, both ratios Z 2 (S 2 )/Z 1 (S 2 ) and Z 2 (S 1 )/Z 1 (S 1 ) are at most max{C 2 , 3}ϵ, and the guarantee of the theorem holds automatically. The correctness of our claim then follows from Proposition 1 of Pilanci & Wainwright (2016) , together with the fact that S 2 is a random subspace embedding. The runtime follows from Lemma D.5 and Theorem 2.2 of Cormode & Dickens (2019) . E SKETCH LEARNING: OMITTED PROOFS E.1 PROOF OF THEOREM 5.1 We need the following lemmas for the ridge leverage score sampling in (Cohen et al., 2017) . Lemma E.1 ((Cohen et al., 2017, Lemma 4)). Let λ = ∥A -A k ∥ 2 F /k. Then we have i τ i (A) ≤ 2k. Lemma E.2 ((Cohen et al., 2017, Theorem 7)). Let λ = ∥A -A k ∥ 2 F /k and τi ≥ τ i (A) be an overestimate to the i-th ridge leverage score of A. Let p i = τi / i τi . If C is a matrix that is constructed by sampling t = O((log k + log(1/δ) ϵ ) • i τi ) rows of A, each set to a i with probability p i , then with probability at least 1 -δ we have min rank-k X:row(X)⊆row(C) ∥A -X∥ 2 F ≤ (1 + ϵ) ∥A -A k ∥ 2 F . Recall that the sketch monotonicity for low-rank approximation says that concatenating two sketching matrices S 1 and S 2 will not increase the error compared to the single sketch matrix S 1 or S 2 , Now matter how S 1 and S 2 are constructed. (see Section D.1 and Section 4 in (Indyk et al., 2019)) Proof. We first consider the first condition. From the condition that τ i (B) ≥ 1 β τ i (A) we know that if we sample m = O(β • (k log k + k/ϵ)) rows according to τ i (A). The actual probability that the i-th row of B gets sampled is 1 -(1 -τ i (A)) m = O(m • τ i (A)) = O ((k log k + k/ϵ) • τ i (B)) . From i τ i (B) ≤ 2k and Lemma E.2 (recall the sketch monotonicity property for LRA), we have that with probability at least 9/10, S 2 is a matrix such that min rank-k X:row(X)⊆row(S2B) ∥B -X∥ 2 F ≤ (1 + ϵ) ∥B -B k ∥ 2 F . Hence, since S = [ S1 S2 ], from the the sketch monotonicity property for LRA we have that min rank-k X:row(X)⊆row(SB) ∥B -X∥ 2 F ≤ (1 + ϵ) ∥B -B k ∥ 2 F . Now we consider the second condition. Suppose that {X i } i≤m and {Y i } i≤m are a sequence of m = O(k log k + k/ϵ) samples from [n] according to the sampling probability distribution p and q, where p i = τi(A) i τi(A) and q i = τi(B) i τi(B) . Let S be the set of index i such that X i ̸ = Y i . From the property of the total variation distance, we get that Pr [X i ̸ = Y i ] ≤ d tv (p, q) = β , and E[|S|] = i Pr [X i ̸ = Y i ] ≤ βm. From Markov's inequality we get that with probability at least 1 -1.1β, |S| ≤ 1/(1.1β) • βm = 10 11 m. Let T be the set of index i such that X i = Y i . We have that with probability at least 1 -1.1β, |T | ≥ m -10 11 m = Ω(k log k + k/ϵ). Because that {Y i } i∈T is i.i.d samples according to q and the actual sample we take is {X i } i∈T . From Lemma E.2 we get that with probability at least 9/10, the row space of B T satisfies min rank-k X:row(X)⊆row(B T ) ∥B -X∥ 2 F ≤ (1 + ϵ) ∥B -B k ∥ 2 F . Similarly, from the the sketch monotonicity property we have that with probability at least 0.9 -1.1β min rank-k X:row(X)⊆row(SB) ∥B -X∥ 2 F ≤ (1 + ϵ) ∥B -B k ∥ 2 F . E.2 PROOF OF THEOREM 6.1 First we prove the following lemma. Lemma E.3. Let δ ∈ (0, 1/m]. It holds with probability at least 1 -δ that sup x∈colsp(A) ∥Sx∥ 2 2 -∥x∥ 2 2 ≤ ϵ ∥x∥ 2 2 , provided that m ≳ ϵ -2 ((d + log m) min{log 2 (d/ϵ), log 2 m} + d log(1/δ)), 1 ≳ ϵ -2 ν((log m) min{log 2 (d/ϵ), log 2 m} + log(1/δ)) log(1/δ). Proof. We shall adapt the proof of Theorem 5 in (Bourgain et al., 2015) to our setting. Let T denote the unit sphere in colsp(A) and set the sparsity parameter s = 1. Observe that ∥Sx∥ 2 2 = ∥x I ∥ 2 2 + ∥Sx I c ∥ 2 2 , and so it suffices to show that Pr ∥S ′ x c ∥ 2 2 -∥x I c ∥ 2 2 > ϵ ≤ δ for x ∈ T . We make the following definition, as in (2.6) of (Bourgain et al., 2015) : A δ,x := m i=1 j∈I c δ ij x j e i ⊗ e j , and thus, S ′ x I c = A δ,x σ. Also by E ∥S ′ x I c ∥ 2 2 = ∥x I c ∥ 2 2 , one has sup x∈T ∥S ′ x I c ∥ 2 2 -∥x I c ∥ 2 2 = sup x∈T ∥A δ,x σ∥ 2 2 -E ∥A δ,x σ∥ 2 2 . (E.1) Now, in (2.7) of (Bourgain et al., 2015) we instead define a semi-norm ∥x∥ δ = max 1≤i≤m   j∈I c δ ij x 2 j   1/2 . Then (2.8) continues to hold, and (2.9) as well as (2.10) continue to hold if the supremum in the left-hand side is replaced with the left-hand side of (E.1). At the beginning of Theorem 5, we define U (i) to be U , but each row j ∈ I c is multiplied by δ ij and each row j ∈ I is zeroed out. Then we have in the first step of (4.5) that j∈I c δ ij d k=1 g k ⟨f k , e j ⟩ 2 ≤ U (i) g 2 2 , instead of equality. One can verify that the rest of (4.5) goes through. It remains true that ∥•∥ δ ≤ (1/ √ s) ∥•∥ 2 , and thus (4.6) holds. One can verify that the rest of the proof of Theorem 5 in (Bourgain et al., 2015) continues to hold if we replace n j=1 with j∈I c and max 1≤j≤n with max j∈I c , noting that E j∈I c δ ij ∥P E e j ∥ 2 2 = s m j∈I c ⟨P E e j , e j ⟩ ≤ s m d and E(U (i) ) * U (i) = j∈I c (E δ ij )u j u * j ⪯ 1 m . Thus, the symmetrization inequalities on j∈I c δ ij ∥P E e j ∥ 2 2 L p δ and j∈I c δ ij u j u * j L p δ continue to hold. The result then follows, observing that max j∈I c ∥P E e j ∥ 2 ≤ ν. The subspace embedding guarantee now follows as a corollary. Theorem 6.1. Let ν = ϵ/d. Suppose that m = Ω((d/ϵ 2 )(polylog(1/ϵ) + log(1/δ))), δ ∈ (0, 1/m) and d = Ω((1/ϵ) polylog(1/ϵ) log 2 (1/δ)). Then, there exists a distribution on S with m + |I| rows such that Pr ∀x ∈ colsp(A), ∥Sx∥ 2 2 -∥x∥ 2 2 > ϵ ∥x∥ 2 2 ≤ δ. Proof. One can verify that the two conditions in Lemma E.3 are satisfied if m ≳ d ϵ 2 polylog( d ϵ ) + log 1 δ , d ≳ 1 ϵ log 1 δ polylog( d ϵ ) + log 1 δ . The last condition is satisfied if d ≳ 1 ϵ log 2 1 δ polylog 1 ϵ . E.3 PROOF OF LEMMA 6.2 Proof. On the one hand, since Q = SAR is an orthogonal matrix, we have ∥x∥ 2 = ∥Qx∥ 2 = ∥SARx∥ 2 . (E.2) On the other hand, the assumption implies that (ARx) T (ARx) -x T x 2 ≤ ϵ ∥x∥ 2 2 , that is, (1 -ϵ) ∥x∥ 2 2 ≤ ∥ARx∥ 2 2 ≤ (1 + ϵ) ∥x∥ 2 2 . (E.3) Combining both (E.2) and (E.3) leads to √ 1 -ϵ ∥SARx∥ 2 ≤ ∥ARx∥ 2 ≤ √ 1 + ϵ ∥SARx∥ 2 , ∀x ∈ R d . Equivalently, it can be written as 1 √ 1 + ϵ ∥SAy∥ 2 ≤ ∥Ay∥ 2 ≤ 1 √ 1 -ϵ ∥SAy∥ 2 , ∀y ∈ R d . The claimed result follows from the fact that 1/ √ 1 + ϵ ≥ 1 -ϵ and 1/ √ 1 -ϵ ≤ 1 + ϵ whenever ϵ ∈ (0, √ 5-1 2 ].

F LOCATION OPTIMIZATION IN COUNTSKETCH: GREEDY SEARCH

While the position optimization idea is simple, one particularly interesting aspect is that it is provably better than a random placement in some scenarios (Theorem. F.1). Specifically, it is provably beneficial for LRA when inputs follow the spiked covariance model or Zipfian distributions, which are common for real data. Spiked covariance model. Every matrix A ∈ R n×d from the distribution A sp (s, ℓ) has s < k "heavy" rows A r1 , • • • , A rs of norm ℓ > 1. The indices of the heavy rows can be arbitrary, but must be the same for all members of A sp (s, ℓ) and are unknown to the algorithm. The remaining ("light") rows have unit norm. In other words, let R = {r 1 , . . . , r s }. For all rows A i , i ∈ [n], A i = ℓ • v i if i ∈ R and A i = v i otherwise , where v i is a uniformly random unit vector. Zipfian on squared row norms. Every A ∈ R n×d ∼ A zipf has rows which are uniformly random and orthogonal. Each A has 2 i+1 rows of squared norm n 2 /2 2i for i ∈ [1, . . . , O(log(n))]. We also assume that each row has the same squared norm for all members of A zipf . Theorem F.1. Consider a matrix A from either the spiked covariance model or a Zipfian distribution. Let S L denote a CountSketch constructed by Algorithm 3 that optimizes the positions of the non-zero values with respect to A. Let S C denote a CountSketch matrix. Then there is a fixed η > 0 such that, min rank-k X∈rowsp(S L A) ∥X -A∥ 2 F ≤ (1 -η) min rank-k X∈rowsp(S C A) ∥X -A∥ 2 F Remark F.2. Note that the above theorem implicitly provides an upper bound on the generalization error of the greedy placement method on the two distributions that we considered in this paper. More precisely, for each of these two distributions, if Π is learned via our greedy approach over a set of sampled training matrices, the solution returned by the sketching algorithm using Π over any (test) matrix A sampled from the distribution has error at most (1 -η) min rank-k X∈rowsp(S C A) ∥X -A∥ 2 F . A key structural property of the matrices from these two distributions that is crucial in our analysis is the ϵ-almost orthogonality of their rows (i.e., (normalized) pairwise inner products are at most ϵ). Hence, we can find a QR-factorization of the matrix of such vectors where the upper diagonal matrix R has diagonal entries close to 1 and entries above the diagonal are close to 0. To state our result, we first provide an interpretation of the location optimization task as a selection of hash function for the rows of A. Note that left-multiplying A by CountSketch S ∈ R m×n is equivalent to hashing the rows of A to m bins with coefficients in {±1}. The greedy algorithm proceeds through the rows of A (in some order) and decides which bin to hash to, denoting this by adding an entry to S. The intuition is that our greedy approach separates heavy-norm rows (which are important "directions" in the row space) into different bins. Proof Sketch of Theorem F.1 The first step is to observe that in the greedy algorithm, when rows are examined according to a non-decreasing order of squared norms, the algorithm will isolate rows into their singleton bins until all bins are filled. In particular, this means that the heavy norm rows will all be isolated-e.g., for the spiked covariance model, Lemma F.8 presents the formal statement. Next, we show that none of the rows left to be processed (all light rows) will be assigned to the same bin as a heavy row. The main proof idea is to compare the cost of "colliding" with a heavy row to the cost of "avoiding" the heavy rows. This is the main place we use the properties of the aforementioned distributions and the fact that each heavy row is already mapped to a singleton bin. Overall, we show that at the end of the algorithm no light row will be assigned to the bins that contain heavy rows-the formal statement and proof for the spiked covariance model is in Lemma F.12. Finally, we can interpret the randomized construction of CountSketch as a "balls and bins" experiment. In particular, considering the heavy rows, we compute the expected number of bins (i.e., rows in S C A) that contain a heavy row. Note that the expected number of rows in S C A that do not contain any heavy row is k • (1 -1 k ) s ≥ k • e -s k-1 . Hence, the number of rows in S C A that contain a heavy row of A is at most k(1 -e -s k-1 ). Thus, at least s -k(1 -e -s k-1 ) heavy rows are not mapped to an isolated bin (i.e., they collide with some other heavy rows). Then, it is straightforward to show that the squared loss of the solution corresponding to S C is larger than the squared loss of the solution corresponding to S L , the CountSketch constructed by Algorithm 3-please see Lemma F.14 for the formal statement of its proof. Preliminaries and notation. Left-multiplying A by a CountSketch S ∈ R m×n is equivalent to hashing the rows of A to m bins with coefficients in {±1}. The greedy algorithm proceeds through the rows of A (in some order) and decides which bin to hash to, which we can think of as adding an entry to S. We will denote the bins by b i and their summed contents by w i . F.1 SPIKED COVARIANCE MODEL WITH SPARSE LEFT SINGULAR VECTORS. To recap, every matrix A ∈ R n×d from the distribution A sp (s, ℓ) has s < k "heavy" rows (A r1 , • • • , A rs ) of norm ℓ > 1. The indices of the heavy rows can be arbitrary, but must be the same for all members of the distribution and are unknown to the algorithm. The remaining rows (called "light" rows) have unit norm. In other words: let R = {r 1 , . . . , r s }. For all rows A i , i ∈ [n]: A i = ℓ • v i if i ∈ R v i o.w. where v i is a uniformly random unit vector. We also assume that S r , S g ∈ R k×n and that the greedy algorithm proceeds in a non-increasing row norm order. Proof sketch. First, we show that the greedy algorithm using a non-increasing row norm ordering will isolate heavy rows (i.e., each is alone in a bin). Then, we conclude by showing that this yields a better k-rank approximation error when d is sufficiently large compared to n. We begin with some preliminary observations that will be of use later. It is well-known that a set of uniformly random vectors is ϵ-almost orthogonal (i.e., the magnitudes of their pairwise inner products are at most ϵ). Observation F.3. Let v 1 , • • • , v n ∈ R d be a set of random unit vectors. Then with probability 1 -1/ poly(n), we have |⟨v i , v j ⟩| ≤ 2 log n d , ∀ i < j ≤ n. We define ϵ = 2 log n d . Observation F.4. Let u 1 , • • • , u t be a set of vectors such that for each pair of i < j ≤ t, |⟨ ui ∥ui∥ , uj ∥uj ∥ ⟩| ≤ ϵ, and g i , • • • , g j ∈ {-1, 1}. Then, t i=1 ∥u i ∥ 2 2 -2ϵ i<j≤t ∥u i ∥ 2 ∥u j ∥ 2 ≤ t i=1 g i u i 2 2 ≤ t i=1 ∥u i ∥ 2 2 + 2ϵ i<j≤t ∥u i ∥ 2 ∥u j ∥ 2 (F.1) Next, a straightforward consequence of ϵ-almost orthogonality is that we can find a QR-factorization of the matrix of such vectors where R (an upper diagonal matrix) has diagonal entries close to 1 and entries above the diagonal are close to 0. Lemma F.5. Let u 1 , • • • , u t ∈ R d be a set of unit vectors such that for any pair of i < j ≤ t, |⟨u i , u j ⟩| ≤ ϵ where ϵ = O(t -2 ). There exists an orthonormal basis e 1 , • • • , e t for the subspace spanned by u 1 , • • • , u t such that for each i ≤ t, u i = i j=1 a i,j e j where a 2 i,i ≥ 1 -i-1 j=1 j 2 • ϵ 2 and for each j < i, a 2 i,j ≤ j 2 ϵ 2 . Proof. We follow the Gram-Schmidt process to construct the orthonormal basis e 1 , • • • , e t of the space spanned by u 1 , • • • , u t , by first setting e 1 = u 1 and then processing u 2 , • • • , u t , one-by-one. The proof is by induction. We show that once the first j vectors u 1 , • • • , u j are processed, the statement of the lemma holds for these vectors. Note that the base case of the induction trivially holds as u 1 = e 1 . Next, suppose that the induction hypothesis holds for the first ℓ vectors u 1 , • • • , u ℓ . Claim F.6. For each j ≤ ℓ, a 2 ℓ+1,j ≤ j 2 ϵ 2 . Proof. The proof of the claim is itself by induction. Note that, for j = 1 and using the fact that |⟨u 1 , u ℓ+1 ⟩| ≤ ϵ, the statement holds and a 2 ℓ+1,1 ≤ ϵ 2 . Next, suppose that the statement holds for all j ≤ i < ℓ. Then using that |⟨u i+1 , u ℓ+1 ⟩| ≤ ϵ, |a ℓ+1,i+1 | ≤ (|⟨u ℓ+1 , u i+1 | + i j=1 |a ℓ+1,j | • |a i+1,j |)/|a i+1,i+1 | ≤ (ϵ + i j=1 j 2 ϵ 2 )/|a i+1,i+1 | ▷ by the induction hypothesis on a ℓ+1,j for j ≤ i ≤ (ϵ + i j=1 j 2 ϵ 2 )/(1 - i j=1 j 2 • ϵ 2 ) 1/2 ▷ by the induction hypothesis on a i+1,i+1 ≤ (ϵ + i j=1 j 2 ϵ 2 ) • (1 - i j=1 j 2 • ϵ 2 ) 1/2 • (1 + 2 • i j=1 j 2 ϵ 2 ) ≤ (ϵ + i j=1 j 2 ϵ 2 ) • (1 + 2 • i j=1 j 2 ϵ 2 ) ≤ ϵ(( i j=1 j 2 ϵ) • (1 + 4ϵ • i j=1 j 2 ϵ) + 1) ≤ ϵ(i + 1) ▷ by ϵ = O(t -2 ) Finally, since ∥u ℓ+1 ∥ 2 2 = 1, a 2 ℓ+1,ℓ+1 ≥ 1 - ℓ j=1 j 2 ϵ 2 . Corollary F.7. Suppose that ϵ = O(t -2 ). There exists an orthonormal basis e 1 , • • • , e t for the space spanned by the randomly picked vectors v 1 , • • • , v t , of unit norm, so that for each i, v i = i j=1 a i,j e j where a 2 i,i ≥ 1 -i-1 j=1 j 2 • ϵ 2 and for each j < i, a 2 i,j ≤ j 2 • ϵ 2 . Proof. The proof follows from Lemma F.5 and the fact that the set of vectors v 1 , • • • , v t is ϵ-almost orthogonal (by Observation F.3). The first main step is to show that the greedy algorithm (with non-increasing row norm ordering) will isolate rows into their own bins until all bins are filled. In particular, this means that the heavy rows (the first to be processed) will all be isolated. We note that because we set rank(SA) = k, the k-rank approximation cost is the simplified expression AV V ⊤ -A 2 F , where U ΣV ⊤ = SA, rather than [AV ] k V ⊤ -A F . This is just the projection cost onto row(SA). Also, we observe that minimizing this projection cost is the same as maximizing the sum of squared projection coefficients: arg min S A -AV V ⊤ 2 F = arg min S i∈[n] ∥A i -(⟨A i , v 1 ⟩v 1 + . . . + ⟨A i , v k ⟩v k )∥ 2 2 = arg min S i∈[n] (∥A i ∥ 2 2 - j∈[k] ⟨A i , v j ⟩ 2 ) = arg max S i∈[n] j∈[k] ⟨A i , v j ⟩ 2 In the following sections, we will prove that our greedy algorithm makes certain choices by showing that these choices maximize the sum of squared projection coefficients. Lemma F.8. For any matrix A or batch of matrices A, at the end of iteration k, the learned CountSketch matrix S maps each row to an isolated bin. In particular, heavy rows are mapped to isolated bins. Proof. For any iteration i ≤ k, we consider the choice of assigning A i to an empty bin versus an occupied bin. Without loss of generality, let this occupied bin be b i-1 , which already contains A i-1 . We consider the difference in cost for empty versus occupied. We will do this cost comparison for A j with j ≤ i -2, j ≥ i + 1, and finally, j ∈ {i -1, i}. First, we let {e 1 , . . . , e i } be an orthonormal basis for {A 1 , . . . , A i } such that for each r ≤ i, A r = r j=1 a r,j e j where a r,r > 0. This exists by Lemma F.5. Let {e 1 , . . . , e i-2 , e} be an orthonormal basis for {A 1 , . . . , A i+2 , A i-1 ± A i }. Now, e = c 0 e i-1 + c 1 e i for some c 0 , c 1 because (A i-1 ± A i ) -proj {e1,...,ei-2} (A i-1 ± A i ) ∈ span(e i-1 , e i ). We note that c 2 0 + c 2 1 = 1 because we let e be a unit vector. We can find c 0 , c 1 to be: c 0 = a i-1,i-1 + a i,i-1 (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i , c 1 = a i,i (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i 1. j ≤ i -2: The cost is zero for both cases because A j ∈ span({e 1 , . . . , e i-2 }). 2. j ≥ i + 1: We compare the rewards (sum of squared projection coefficients) and find that {e 1 , . . . , e i-2 , e} is no better than {e 1 , . . . , e i }. ⟨A j , e⟩ 2 = (c 0 ⟨A j , e i-1 ⟩ + c 1 ⟨A j , e i ⟩) 2 ≤ (c 2 1 + c 2 0 )(⟨A j , e i-1 ⟩ 2 + ⟨A j , e i ⟩ 2 ) ▷ Cauchy-Schwarz inequality = ⟨A j , e i-1 ⟩ 2 + ⟨A j , e i ⟩ 2 3. j ∈ {i -1, i}: We compute the sum of squared projection coefficients of A i-1 and A i onto e: ( 1 (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i ) • (a 2 i-1,i-1 (a i-1,i-1 + a i,i-1 ) 2 + (a i,i-1 (a i-1,i-1 + a i,i-1 ) + a i,i a i,i ) 2 ) = ( 1 (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i ) • ((a i-1,i-1 + a i,i-1 ) 2 (a 2 i-1,i-1 + a 2 i,i-1 ) + a 4 i,i + 2a i,i-1 a 2 i,i (a i-1,i-1 + a i,i-1 )) (F.2) On the other hand, the sum of squared projection coefficients of A i-1 and A i onto e i-1 ∪ e i is: ( (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i ) • (a 2 i-1,i-1 + a 2 i,i-1 + a 2 i,i ) (F.3) Hence, the difference between the sum of squared projections of A i-1 and A i onto e and e i-1 ∪ e i is ((F.3) -(F.2)) a 2 i,i ((a i-1,i-1 + a i,i-1 ) 2 + a 2 i-1,i-1 + a 2 i,i-1 -2a i,i-1 (a i-1,i-1 + a i,i-1 )) (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i = 2a 2 i,i a 2 i-1,i-1 (a i-1,i-1 + a i,i-1 ) 2 + a 2 i,i > Thus, we find that {e 1 , . . . , e i } is a strictly better basis than {e 1 , . . . , e i-2 , e}. This means the greedy algorithm will choose to place A i in an empty bin. Next, we show that none of the rows left to be processed (all light rows) will be assigned to the same bin as a heavy row. The main proof idea is to compare the cost of "colliding" with a heavy row to the cost of "avoiding" the heavy rows. Specifically, we compare the decrease (before and after bin assignment of a light row) in sum of squared projection coefficients, lower-bounding it in the former case and upper-bounding it in the latter. We introduce some results that will be used in Lemma F.12. Claim F.9. Let A k+r , r ∈ [1, . . . , n -k] be a light row not yet processed by the greedy algorithm. Let {e 1 , . . . , e k } be the Gram-Schmidt basis for the current {w 1 , . . . , w k }. Let β = O(n -1 k -3 ) upper bound the inner products of the normalized A k+r , w 1 , . . . , w k . Then, for any bin i, ⟨e i , A k+r ⟩ 2 ≤ β 2 • k 2 . Proof. This is a straightforward application of Lemma F.5. From that, we have ⟨A k+r , e i ⟩ 2 ≤ i 2 β 2 , for i ∈ [1, . . . , k], which means ⟨A k+r , e i ⟩ 2 ≤ k 2 β 2 . Claim F.10. Let A k+r be a light row that has been processed by the greedy algorithm. Let {e 1 , . . . , e k } be the Gram-Schmidt basis for the current {w 1 , . . . , w k }. If A k+r is assigned to bin b k-1 (w.l.o.g.), the squared projection coefficient of A k+r onto e i , i ̸ = k -1 is at most 4β 2 • k 2 , where β = O(n -1 k -3 ) upper bounds the inner products of normalized A k+r , w 1 , • • • , w k . Proof. Without loss of generality, it suffices to bound the squared projection of A k+r onto the direction of w k that is orthogonal to the subspace spanned by w 1 , • • • , w k-1 . Let e 1 , • • • , e k be an orthonormal basis of w 1 , • • • , w k guaranteed by Lemma F.5. Next, we expand the orthonormal basis to include e k+1 so that we can write the normalized vector of A k+r as v k+r = k+1 j=1 b j e j . By a similar approach to the proof of Lemma F.5, for each j ≤ k -2, b j ≤ β 2 j 2 . Next, since |⟨w k , v k+r ⟩| ≤ β, |b k | ≤ 1 |⟨w k , e k ⟩| • (|⟨w k , v k+r ⟩| + k-1 j=1 |b j • ⟨w k , e j ⟩|) ≤ 1 1 - k-1 j=1 β 2 • j 2 • (β + k-2 j=1 β 2 • j 2 + (k -1) • β) ▷ |b k-1 | ≤ 1 = β + k-2 j=1 β 2 • j 2 1 - k-1 j=1 β 2 • j 2 + (k -1)β ≤ 2(k -1)β - β 2 (k -1) 2 1 - k-1 j=1 β 2 • j 2 ▷ similar to the proof of Lemma F.5 < 2β • k Hence, the squared projection of A k+r onto e k is at most 4β 2 • k 2 • ∥A k+r ∥ 2 2 . We assumed ∥A k+r ∥ = 1; hence, the squared projection of A k+r onto e k is at most 4β 2 • k 2 . Claim F.11. We assume that the absolute values of the inner products of vectors in v 1 , • • • , v n are at most ϵ < 1/(n 2 Ai∈b ∥A i ∥ 2 ) and the absolute values of the inner products of the normalized vectors of w 1 , • • • , w k are at most β = O(n -3 k -3 2 ). Suppose that bin b contains the row A k+r . Then, the squared projection of A k+r onto the direction of w orthogonal to span({w 1 , • • • , w k } \ {w}) is at most ∥A k+r ∥ 4 2 ∥w∥ 2 2 + O(n -2 ) and is at least ∥A k+r ∥ 4 2 ∥w∥ 2 2 -O(n -2 ). Proof. Without loss of generality, we assume that A k+r is mapped to b k ; w = w k . First, we provide an upper and a lower bound for |⟨v k+r , w k ⟩| where for each i ≤ k, we let w i = wi ∥wi∥ 2 denote the normalized vector of w i . Recall that by definition v k+r = A k+r ∥A k+r ∥ 2 . |⟨w k , v k+r ⟩| ≤ ∥A k+r ∥ 2 + Ai∈b k ϵ ∥A i ∥ 2 ∥w k ∥ 2 ≤ ∥A k+r ∥ 2 + n -2 ∥w k ∥ 2 ▷ by ϵ < n -2 Ai∈b k ∥A i ∥ 2 ≤ ∥A k+r ∥ 2 ∥w k ∥ 2 + n -2 ▷ ∥w k ∥ 2 ≥ 1 (F.4) |⟨w k , v k+r ⟩| ≥ ∥A k+r ∥ 2 -Ai∈b k ∥A i ∥ 2 • ϵ ∥w k ∥ 2 ≥ ∥A k+r ∥ 2 ∥w k ∥ 2 -n -2 (F.5) Now, let {e 1 , • • • , e k } be an orthonormal basis for the subspace spanned by {w 1 , • • • , w k } guaranteed by Lemma F.5. Next, we expand the orthonormal basis to include e k+1 so that we can write v k+r = k+1 j=1 b j e j . By a similar approach to the proof of Lemma F.5, we can show that for each j ≤ k -1, b 2 j ≤ β 2 j 2 . Moreover, |b k | ≤ 1 |⟨w k , e k ⟩| • (|⟨w k , v k+r ⟩| + k-1 j=1 |b j • ⟨w k , e j ⟩|) ≤ 1 1 - k-1 j=1 β 2 • j 2 • (|⟨w k , v k+r ⟩| + k-1 j=1 β 2 • j 2 ) ▷ by Lemma F.5 ≤ 1 1 - k-1 j=1 β 2 • j 2 • (n -2 + ∥A k+r ∥ 2 ∥w k ∥ 2 + k-1 j=1 β 2 • j 2 ) ▷ by (F.4) < β • k + 1 1 -β 2 k 3 • (n -2 + ∥A k+r ∥ 2 ∥w k ∥ 2 ) ▷ similar to the proof of Lemma F.5 ≤ O(n -2 ) + (1 + O(n -2 )) ∥A k+r ∥ 2 ∥w k ∥ 2 ▷ by β = O(n -3 k -3 2 ) ≤ ∥A k+r ∥ 2 ∥w k ∥ 2 + O(n -2 ) ▷ ∥A k+r ∥ 2 ∥w k ∥ 2 ≤ 1 and, |b k | ≥ 1 |⟨w k , e k ⟩| • (|⟨w k , v k+r ⟩| - k-1 j=1 |b j • ⟨w k , e j ⟩|) ≥ |⟨w k , v k+r ⟩| - k-1 j=1 β 2 • j 2 ▷ since |⟨w k , e k ⟩| ≤ 1 ≥ ∥A k+r ∥ 2 ∥w k ∥ 2 -n -2 - k-1 j=1 β 2 • j 2 ▷ by (F.5) ≥ ∥A k+r ∥ 2 ∥w k ∥ 2 -O(n -2 ) ▷ by β = O(n -3 k -3 2 ) Hence, the squared projection of A k+r onto e k is at most ∥A k+r ∥ 4 2 ∥w k ∥ 2 2 +O(n -2 ) and is at least ∥A k+r ∥ 4 2 ∥w k ∥ 2 2 - O(n -2 ). Now, we show that at the end of the algorithm no light row will be assigned to the bins that contain heavy rows. Lemma F.12. We assume that the absolute values of the inner products of vectors in v 1 , • • • , v n are at most ϵ < min{n -2 k -5 3 , (n Ai∈w ∥A i ∥ 2 ) -1 }. At iteration k + r, the greedy algorithm will assign the light row A k+r to a bin that does not contain a heavy row. Proof. The proof is by induction. Lemma F.8 implies that no light row has been mapped to a bin that contains a heavy row for the first k iterations. Next, we assume that this holds for the first k + r -1 iterations and show that is also must hold for the (k + r)-th iteration. To this end, we compare the sum of squared projection coefficients when A k+r avoids and collides with a heavy row. First, we upper bound β = max i̸ =j≤k |⟨w i , w j ⟩|/(∥w i ∥ 2 ∥w j ∥ 2 ). Let c i and c j respectively denote the number of rows assigned to b i and b j . β = max i̸ =j≤k |⟨w i , w j ⟩| ∥w i ∥ 2 ∥w j ∥ 2 ≤ c i • c j • ϵ c i -2ϵc 2 i • c j -2ϵc 2 j ▷ Observation F.4 ≤ 16ϵ √ c i c j ▷ϵ ≤ n -2 k -5/3 ≤ n -1 k -5 3 ▷ϵ ≤ n -2 k -5/3 1. If A k+r is assigned to a bin that contains c light rows and no heavy rows. In this case, the projection loss of the heavy rows A 1 , • • • , A s onto row(SA) remains zero. Thus, we only need to bound the change in the sum of squared projection coefficients of the light rows before and after iteration k + r. Without loss of generality, let w k denote the bin that contains A k+r . Since S k-1 = span({w 1 , • • • , w }) has not changed, we only need to bound the difference in cost between projecting onto the component of w k -A k+r orthogonal to S k-1 and the component of w k orthogonal to S k-1 , respectively denoted as e k and e k . 1. By Claim F.9, for the light rows that are not yet processed (i.e., A j for j > k + r), the squared projection of each onto e k is at most β 2 k 2 . Hence, the total decrease in the squared projection is at most (n -k -r) • β 2 k 2 . 2. By Claim F.10, for the processed light rows that are not mapped to the last bin, the squared projection of each onto e k is at most 4β 2 k 2 . Hence, the total decrease in the squared projection cost is at most (r -1) • 4β 2 k 2 . 3. For each row A i ̸ = A k+r that is mapped to the last bin, by Claim F.11 and the fact ∥A i ∥ Hence, the total squared projection of the rows in the bin b k decreases by at least: ( Ai∈w k /{A r+k } ∥A i ∥ 2 2 ∥w k -A r+k ∥ 2 2 + O(n -2 )) -( Ai∈w k ∥A i ∥ 2 2 ∥w k ∥ 2 2 -O(n -2 )) ≤ ∥w k -A r+k ∥ 2 2 + O(n -1 ) ∥w k -A r+k ∥ 2 2 - ∥w k ∥ 2 2 -O(n -1 ) ∥w k ∥ 2 2 + O(n -1 ) ▷ by Observation F.4 ≤O(n -1 ) Hence, summing up the bounds in items 1 to 3 above, the total decrease in the sum of squared projection coefficients is at most O(n -1 ). 2. If A k+r is assigned to a bin that contains a heavy row. Without loss of generality, we can assume that A k+r is mapped to b k that contains the heavy row A s . In this case, the distance of heavy rows A 1 , • • • , A s-1 onto the space spanned by the rows of SA is zero. Next, we bound the amount of change in the squared distance of A s and light rows onto the space spanned by the rows of SA. Note that the (k -1)-dimensional space corresponding to w 1 , • • • , w k-1 has not changed. Hence, we only need to bound the decrease in the projection distance of A k+r onto e k compared to e k (where e k , e k are defined similarly as in the last part). 1. For the light rows other than A k+r , the squared projection of each onto e k is at most β 2 k 2 . Hence, the total increase in the squared projection of light rows onto e k is at most (n-k)•β 2 k 2 = O(n -1 ). Claim F.15. Suppose that heavy rows A r1 , • • • , A rc are mapped to the same bin via a CountSketch S. Then, the total squared distances of these rows from the subspace spanned by SA is at least (c -1)ℓ -O(n -1 ). Proof. Let b denote the bin that contains the rows A r1 , • • • , A rc and suppose that it has c ′ light rows as well. Note that by Claim F.10 and Claim F.11, the squared projection of each row A ri onto the subspace spanned by the k bins is at most ∥A hi ∥ 4 2 ∥w∥ 2 2 + O(n -1 ) ≤ ℓ 2 cℓ + c ′ -2ϵ(c 2 ℓ + cc ′ √ ℓ + c ′2 ) + O(n -1 ) ≤ ℓ 2 cℓ -n -O(1) + n -O(1) ▷ by ϵ ≤ n -3 ℓ -1 ≤ ℓ 2 c 2 ℓ 2 • (cℓ + O(n -1 ) + O(n -1 ) ≤ ℓ c + O(n -1 ) Hence, the total squared loss of these c heavy rows is at least cℓ -c • ( ℓ c + O(n -1 )) ≥ (c -1)ℓ -O(n -1 ). Thus, the expected total squared loss of the heavy rows is at least: Next, we compute a lower bound on the expected squared loss of the light rows. Note that Claim F.10 and Claim F.11 imply that when a light row collides with other rows, its contribution to the total squared loss (note that the loss accounts for the amount it decreases from the squared projection of the other rows in the bin as well) is at least 1 -O(n -1 ). Hence, the expected total squared loss of the light rows is at least: (n -s -k)(1 -O(n -1 )) ≥ (n -(1 + α) • k) -O(n -1 ) Hence, the expected squared loss of a CountSketch whose sparsity is picked at random is at least . Let S g be the CountSketch whose sparsity pattern is learned over a training set drawn from A sp via the greedy approach. Let S r be a CountSketch whose sparsity pattern is picked uniformly at random. Then, for an n × d matrix A ∼ A sp where d = Ω(n 6 ℓ 2 ), the expected loss of the best rank-k approximation of A returned by S r is worse than the approximation loss of the best rank-k approximation of A returned by S g by at least a constant factor. Recall that all heavy rows have squared norm at least n 2 2 2hs . There must be a bin b that only contains light rows and has squared norm at most ∥w∥ 2 2 = Ai∈b ∥A i ∥ 2 2 ≤ n 2 2 2(hs+1) + hn i=h k +1 2 i+1 n 2 2 2i k -s ≤ n 2 2 2(hs+1) + 2n 2 2 h k (k -s) ≤ n 2 2 2(hs+1) + n 2 2 2h k ▷ s ≤ k/2 and k > 2 h k +1 ≤ n 2 2 2hs+1 ▷ h k ≥ h s + 1 < ∥A s ∥ 2 2 Hence, the greedy algorithm will map A k+r to a bin that only contains light rows. Corollary F.18. The squared loss of the best rank-k approximate solution in the rowspace of S g A for A ∈ R n×d ∼ A zipf where A ∈ R n×d and S g is the CountSketch constructed by the greedy algorithm with non-increasing order, is < n 2 2 h k -2 . Proof. At the end of iteration k, the total squared loss is hn i=h k +1 2 i+1 • n 2 2 2i . After that, in each iteration k + r, by (F.6), the squared loss increases by at most ∥A k+r ∥ 2 2 . Hence, the total squared



F , the projection cost https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 The framework ofIndyk et al. (2019) does not apply to the iterative sketching methods in a straightforward manner, so here we only compare with the classical CountSketch. For more details, please refer to Section B. https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+exposed+to+ turbulent+gas+mixtures The numerical calculation is computed using WolframAlpha.



Figure 7.1: Test error of LASSO in Electric dataset.

and |(A, b) train | = 320, |(A, b) test | = 80. We set λ = 15.

Figure A.1: Total variation distance between train and test matrices. left: Logo, middle: friend, right: Hyper.

Figure B.1: Test error of LASSO on Electric dataset B.2 MATRIX NORM ESTIMATION WITH A NUCLEAR NORM CONSTRAINT In many applications, for the problem

Figure B.2: Test error of matrix estimation with a nuclear norm constraint on the Tunnel dataset

2 (Clarkson & Woodruff (2017); Lemma 40). Let A be an n × d matrix and let S ∈ R O(1/ϵ 2 )×n be a CountSketch matrix. Then with constant probability, ∥SA∥ 2 = (1 ± ϵ) ∥A∥ 2 F . The following result is shown in Clarkson & Woodruff (2017) and sharpened with Nelson & Nguyên (2013); Meng & Mahoney (2013). Lemma C.3. Given matrices A, B with n rows, a CountSketch with O(rank(A) 2 /ϵ 2 ) rows is an affine ϵ-embedding matrix with constant probability. Moreover, the matrix product SA can be computed in O(nnz(A)) time, where nnz(A) denotes the number of non-zero entries of matrix A.

Lemma C.4 (Sarlos (2006); Clarkson & Woodruff (2017)). Suppose that A ∈ R n×d and B

the squared projection of A i onto e k is at most∥Ai∥ 2 2 ∥w k -A k+r ∥ 2 2 + O(n -2) and the squared projection of A i onto e k is at least squared projection of A k+r onto e k compared to e k increases by at least (

• (s -k(1 -e -s k-1 )) -s • n -O(1) ≥ℓ • k(α -1 + e -α ) -ℓα -n -O(1)▷ s = α • (k -1) where 0.7 < α < 1

n -1 ) + n -(1 + α)k -O(n -1 ) ≥ n + ℓk 4e -(1 + α)k -O(n -1 )Corollary F.16. Let s = α(k -1) where 0.7 < α < 1 and let ℓ ≥ (4e+1)n αk

is larger than (F.6) if ∥A j ∥ 2 2 ≥ Ai∈b k ∥A i ∥ 2 2. Next, we show that at every inductive iteration, there exists a bin b which only contains light rows and whose squared norm is smaller than the squared norm of any heavy row. For each value m, define h m so that m = hm i=1 2 i+1 = 2 hm+2 -2.

Table A.1.

ACKNOWLEDGEMENTS

Yi Li would like to thank for the partial support from the Singapore Ministry of Education under Tier 1 grant RG75/21. Honghao Lin and David Woodruff were supported in part by an Office of Naval Research (ONR) grant N00014-18-1-2562. Ali Vakilian was supported by NSF award CCF-1934843. 

annex

Thus, for a sufficiently large value of ℓ, the greedy algorithm will assign A k+r to a bin that only contains light rows. This completes the inductive proof and in particular implies that at the end of the algorithm, heavy rows are assigned to isolated bins.Corollary F.13. The approximation loss of the best rank-k approximate solution in the rowspace S g A for A ∼ A sp (s, ℓ), where A ∈ R n×d for d = Ω(n 4 k 4 log n) and S g is the CountSketch constructed by the greedy algorithm with non-increasing order, is at most n -s.Proof. First, we need to show that the absolute values of the inner products of vectors in v 1 , • • • , v n are at most ϵ < min{n -2 k -2 , (n Ai∈w ∥A i ∥ 2 ) -1 } so that we can apply Lemma F.12. To show this, note that by Observation F.3, ϵ ≤ 2 log n d ≤ n -2 k -2 since d = Ω(n 4 k 4 log n). The proof follows from Lemma F.8 and Lemma F.12. Since all heavy rows are mapped to isolated bins, the projection loss of the light rows is at most n -s.Next, we bound the Frobenius norm error of the best rank-k-approximation solution constructed by the standard CountSketch with a randomly chosen sparsity pattern. Lemma F.14. Let s = αk where 0.7 < α < 1. The expected squared loss of the best rank-k approximate solution in the rowspace S r A for A ∈ R n×d ∼ A sp (s, ℓ), where d = Ω(n 6 ℓ 2 ) and S r is the sparsity pattern of CountSketch is chosen uniformly at random, is at least nProof. We can interpret the randomized construction of the CountSketch as a "balls and bins" experiment. In particular, considering the heavy rows, we compute the expected number of bins (i.e., rows in S r A) that contain a heavy row. Note that the expected number of rows in S r A that do not contain any heavy row is kHence, the number of rows in S r A that contain a heavy row of A is at most k(1 -e -s k-1 ). Thus, at least s -k(1 -e -s k-1 ) heavy rows are not mapped to an isolated bin (i.e., they collide with some other heavy rows). Then, it is straightforward to show that the squared loss of each such row is at least ℓ -n -O(1) .Proof.▷ Lemma F.14F.2 ZIPFIAN ON SQUARED ROW NORMS.Each matrix A ∈ R n×d ∼ A zipf has rows which are uniformly random and orthogonal. Each A has 2 i+1 rows of squared norm n 2 /2 2i for i ∈ [1, . . . , O(log(n))]. We also assume that each row has the same squared norm for all members of A zipf .In this section, the s rows with largest norm are called the heavy rows and the remaining are the light rows. For convenience, we number the heavy rows 1, . . . , s; however, the heavy rows can appear at any indices, as long as any row of a given index has the same norm for all members of A zipf . Also, we assume that s ≤ k/2 and, for simplicity, s = hs i=1 2 i+1 for some h s ∈ Z + . That means the minimum squared norm of a heavy row is n 2 /2 2hs and the maximum squared norm of a light row is n 2 /2 2hs+2 . The analysis of the greedy algorithm ordered by non-increasing row norms on this family of matrices is similar to our analysis for the spiked covariance model. Here we analyze the case in which rows are orthogonal. By continuity, if the rows are close enough to being orthogonal, all decisions made by the greedy algorithm will be the same.As a first step, by Lemma F.8, at the end of iteration k the first k rows are assigned to different bins. Then, via a similar inductive proof, we show that none of the light rows are mapped to a bin that contains one of the top s heavy rows.Lemma F.17. At each iteration k + r, the greedy algorithm picks the position of the non-zero value in the (k + r)-th column of the CountSketch matrix S so that the light row A k+r is mapped to a bin that does not contain any of top s heavy rows.Proof. We prove the statement by induction. The base case r = 0 trivially holds as the first k rows are assigned to distinct bins. Next we assume that in none of the first k + r -1 iterations a light row is assigned to a bin that contains a heavy row. Now, we consider the following cases:1. If A k+r is assigned to a bin that only contains light rows. Without loss of generality we can assume that A k+r is assigned to b k . Since the vectors are orthogonal, we only need to bound the difference in the projection of A k+r and the light rows that are assigned to b k onto the direction of w k before and after adding A k+r to b k . In this case, the total squared loss corresponding to rows in b k and A k+r before and after adding A k+1 are respectively before adding A k+r to b k :) loss in the solution returned by S g is at mostNext, we bound the squared loss of the best rank-k-approximate solution constructed by the standard CountSketch with a randomly chosen sparsity pattern. Observation F.19. Let us assume that the orthogonal rows A r1 , • • • , A rc are mapped to the same bin and for eachHence, the total projection loss ofIn particular, Observation F.19 implies that whenever two rows are mapped into the same bin, the squared norm of the row with smaller norm fully contributes to the total squared loss of the solution. Lemma F.20. For k > 2 10 -2, the expected squared loss of the best rank-k approximate solution in the rowspace of S r A for A n×d ∼ A zipf , where S r is the sparsity pattern of a CountSketch chosen uniformly at random, is at least 1.095n 2 2 h k -2 .Proof. In light of Observation F.19, we need to compute the expected number of collisions between rows with "large" norm. We can interpret the randomized construction of the CountSketch as a "balls and bins" experiment.For each 0 ≤ j ≤ h k , let A j denote the set of rows with squared normNext, for a row A r in A j (0 ≤ j < h k ), we compute the probability that at least one row in A >j collides with A r .

Pr[at least one row in A >j collides with

Hence, by Observation F.19, the contribution of rows in A j to the total squared loss is at leastThus, the contribution of rows with "large" squared norm, i.e., A >0 , to the total squared loss is at least 4Corollary F.21. Let S g be a CountSketch whose sparsity pattern is learned over a training set drawn from A sp via the greedy approach. Let S r be a CountSketch whose sparsity pattern is picked uniformly at random. Then, for an n × d matrix A ∼ A zipf , for a sufficiently large value of k, the expected loss of the best rank-k approximation of A returned by S r is worse than the approximation loss of the best rank-k approximation of A returned by S g by at least a constant factor.Proof. The proof follows from Lemma F.20 and Corollary F.18.Remark F.22. We have provided evidence that the greedy algorithm that examines the rows of A according to a non-increasing order of their norms (i.e., greedy with non-increasing order) results in a better rank-k solution compared to the CountSketch whose sparsity pattern is chosen at random. However, still other implementations of the greedy algorithm may result in a better solution compared to the greedy algorithm with non-increasing order. To give an example, in the following simple instance the greedy algorithm that checks the rows of A in a random order (i.e., greedy with random order) achieves a rank-k solution whose cost is a constant factor better than the solution returned by the greedy with non-increasing order.Let A be a matrix with four orthogonal rows u, u, v, w where ∥u∥ 2 = 1 and ∥v∥ 2 = ∥w∥ 2 = 1 + ϵ and suppose that the goal is to compute a rank-2 approximation of A. Note that in the greedy algorithm with non-decreasing order, v and w will be assigned to different bins and by a simple calculation we can show that the copies of u also will be assigned to different bins. Hence, the squared loss in the computed rank-2 solution is 1 + (1+ϵ) 2 2+(1+ϵ) 2 . However, the optimal solution will assign v and w to one bin and the two copies of u to the other bin which results in a squared loss of (1 + ϵ) 2 which is a constant factor smaller than the solution returned by the greedy algorithm with non-increasing order for sufficiently small values of ϵ.On the other hand, in the greedy algorithm with a random order, with a constant probability of ( 13 + 1 8 ), the computed solution is the same as the optimal solution. Otherwise, the greedy algorithm with random order returns the same solution as the greedy algorithm with a non-increasing order. Hence, in expectation, the solution returned by the greedy with random order is better than the solution returned by the greedy algorithm with non-increasing order by a constant factor.

G EXPERIMENT DETAILS G.1 LOW-RANK APPROXIMATION

In this section, we describe the experimental parameters in our experiments. We first introduce some parameters in Stage 2 of our approach proposed in Section 3.• bs: batch size, the number of training samples used in one iteration.• lr: learning rate of gradient descent.• iter: the number of iterations of gradient descent. For a given m, the dimensions of the four sketches were:Parameters of the algorithm: bs = 1, lr = 1.0, 10.0 for hyper and video respectively, num_it = 1000. For our algorithm 4, we use the average of all training matrix as the input to the algorithm. Parameters of the algorithm: bs = 1, lr = 1.0, 10.0 for hyper and video respectively, num_it = 1000. For our algorithm 4, we use the sum of all training matrix as the input to the algorithm.

G.2 SECOND-ORDER OPTIMIZATION

As we state in Section 6, when we fix the positions of the non-zero entries (uniformly chosen in each column or sampled according to the heavy leverage score distribution), we aim to optimize the values by gradient descent, as mentioned in Section 3. Here the loss function is given in Section 6. In our implementation, we use PyTorch (Paszke et al. (2019) ), which can compute the gradient automatically (here we can use torch.qr() and torch.svd() to define our loss function). For a more nuanced loss function, which may be beneficial, one can use the package released in Agrawal et al. (2019) , where the authors studied the problem of computing the gradient of functions which involve the solution to certain convex optimization problem.As mentioned in Section 2, each column of the sketch matrix S has exactly one non-zero entry. Hence, the i-th coordinate of p can be seen as the non-zero position of the i-th column of S. In the implementation, to sample p randomly, we can sample a random integer in {1, . . . , m} for each coordinate of p. For the heavy rows mentioned in Section 6, we can allocate positions 1, . . . , k to the k heavy rows, and for the other rows, we randomly sample an integer in {k + 1, . . . , m}. We note that once the vector p, which contains the information of the non-zero positions in each column of S, is chosen, it will not be changed during the optimization process in Section 3.Next, we introduce the parameters for our experiments:• bs: batch size, the number of training samples used in one iteration.• lr: learning rate of gradient descent.• iter: the number of iterations of gradient descentIn our experiments, we set bs = 20, iter = 1000 for all datasets. We set lr = 0.1 for the Electric dataset.

