LEARNING-AUGMENTED SKETCHES FOR HESSIANS

Abstract

Sketching is a dimensionality reduction technique where one compresses a matrix by often random linear combinations. A line of work has shown how to sketch the Hessian to speed up each iteration in a second order method, but such sketches usually depend only on the matrix at hand, and in a number of cases are even oblivious to the input matrix. One could instead hope to learn a distribution on sketching matrices that is optimized for the specific distribution of input matrices. We show how to design learned sketches for the Hessian in the context of second order methods, where we learn potentially different sketches for the different iterations of an optimization procedure. We show empirically that learned sketches, compared with their "non-learned" counterparts, improve the approximation accuracy for important problems, including LASSO, SVM, and matrix estimation with nuclear norm constraints. Several of our schemes can be proven to perform no worse than their unlearned counterparts.

1. INTRODUCTION

Large-scale optimization problems are abundant and solving them efficiently requires powerful tools to make the computation practical. This is especially true of second order methods which often are less practical than first order ones. Although second order methods may have many fewer iterations, each iteration could involve inverting a large Hessian, which is cubic time; in contrast, first order methods such as stochastic gradient descent are linear time per iteration. In order to make second order methods faster in each iteration, a large body of work has looked at dimensionality reduction techniques, such as sampling, sketching, or approximating the Hessian by a low rank matrix. See, for example, (Gower et al., 2016; Xu et al., 2016; Pilanci & Wainwright, 2016; 2017; Doikov & Richtárik, 2018; Gower et al., 2018; Roosta-Khorasani & Mahoney, 2019; Gower et al., 2019; Kylasa et al., 2019; Xu et al., 2020; Li et al., 2020) . Our focus is on sketching techniques, which often consist of multiplying the Hessian by a random matrix chosen independently of the Hessian. Sketching has a long history in theoretical computer science (see, e.g., (Woodruff, 2014) for a survey), and we describe such methods more below. A special case of sketching is sampling, which in practice is often uniform sampling, and hence oblivious to properties of the actual matrix. Other times the sampling is non-uniform, and based on squared norms of submatrices of the Hessian or on the so-called leverage scores of the Hessian. Our focus is on sketching techniques, and in particular, we follow the framework of (Pilanci & Wainwright, 2016; 2017) which introduce the iterative Hessian sketch and Newton sketch, as well as the high accuracy refinement given in (van den Brand et al., 2020) . If one were to run Newton's method to find a point where the gradient is zero, in each iteration one needs to solve an equation involving the current Hessian and gradient to find the update direction. When the Hessian can be decomposed as A A for an n × d matrix A with n d, then sketching is particularly suitable. The iterative Hessian sketch was proposed in (Pilanci & Wainwright, 2016) , where A is replaced with S • A, for a random matrix S which could be i.i.d. Gaussian or drawn from a more structured family of random matrices such as the Subsampled Randomized Hadamard Transforms or COUNT-SKETCH matrices; the latter was done in (Cormode & Dickens, 2019) . The Newton sketch was proposed in (Pilanci & Wainwright, 2017) , which extended sketching methods beyond constrained least-squares problems to any twice differentiable function subject to a closed convex constraint set. Using this sketch inside of interior point updates has led to much faster algorithms for an extensive body of convex optimization problems Pilanci & Wainwright (2017) . By instead using sketching as a preconditioner, an application of the work of (van den Brand et al., 2020) (see Appendix E) was able to improve the dependence on the accuracy parameter to logarithmic. In general, the idea behind sketching is the following. One chooses a random matrix S, drawn from a certain family of random matrices, and computes SA. If A is tall-and-skinny, then S is short-and-fat, and thus SA is a small, roughly square matrix. Moreover, SA preserves important properties of A. One typically desired property is that S is a subspace embedding, meaning that simultaneously for all x, one has SAx 2 = (1 ± ) Ax 2 . An observation exploited in (Cormode & Dickens, 2019) , building off of the COUNT-SKETCH random matrices S introduced in randomized linear algebra in (Clarkson & Woodruff, 2017) , is that if S contains O(1) non-zero entries per column, then SA can be computed in O(nnz(A)) time, where nnz(A) denotes the number of nonzeros in A. This is sometimes referred to as input sparsity running time. Each iteration of a second order method often involves solving an equation of the form A Ax = A b, where A A is the Hessian and b is the gradient. For a number of problems, one has access to a matrix A ∈ R n×d with n d, which is also an assumption made in Pilanci & Wainwright (2017) . Therefore, the solution x is the minimizer to a constrained least squares regression problem: min x∈C 1 2 Ax -b 2 2 , ( ) where C is a convex constraint set in R d . For the unconstrained case (C = R d ), various classical sketches that attain the subspace embedding property can provably yield high-accuracy approximate solutions (see, e.g., (Sarlos, 2006; Nelson & Nguyên, 2013; Cohen, 2016; Clarkson & Woodruff, 2017) ); for the general constrained case, the Iterative Hessian Sketch (IHS) was proposed by Pilanci & Wainwright (2016) as an effective approach and Cormode & Dickens (2019) employed sparse sketches to achieve input-sparsity running time for IHS. All sketches used in these results are dataoblivious random sketches. Learned Sketching. In the last few years, an exciting new notion of learned sketching has emerged. Here the idea is that one often sees independent samples of matrices A from a distribution D, and can train a model to learn the entries in a sketching matrix S on these samples. When given a future sample B, also drawn from D, the learned sketching matrix S will be such that S • B is a much more accurate compression of B than if S had the same number of rows and were instead drawn without knowledge of D. Moreover, the learned sketch S is often sparse, therefore allowing S • B to be applied very quickly. For large datasets B this is particularly important, and distinguishes this approach from other transfer learning approaches, e.g., (Andrychowicz et al., 2016) , which can be considerably slower in this context. Learned sketches were first used in the data stream context for finding frequent items (Hsu et al., 2019) and have subsequently been applied to a number of other problems on large data. For example, Indyk et al. (2019) showed that learned sketches yield significantly small errors for low rank approximation. In (Dong et al., 2020) , significant improvements to nearest neighbor search were obtained via learned sketches. More recently, Liu et al. (2020) extended learned sketches to several problems in numerical linear algebra, including least-squares and robust regression, as well as k-means clustering. Despite the number of problems that learned sketches have been applied to, they have not been applied to convex optimization in general. Given that such methods often require solving a large overdetermined least squares problem in each iteration, it is hopeful that one can improve each iteration using learned sketches. However, a number of natural questions arise: (1) how should we learn the sketch? (2) should we apply the same learned sketch in each iteration, or learn it in the next iteration by training on a data set involving previously learned sketches from prior iterations? Our Contributions. In this work we answer the above questions and derive the first learned sketches for a wide number of problems in convex optimization. Namely, we apply learned sketches to constrained least-squares problems, including LASSO, support vector machines (SVM), and matrix regression with nuclear norm constraints. We show empirically that learned sketches demonstrate superior accuracy over random oblivious sketches for each of these problems. Specifically, compared with three classical sketches (Gaussian, COUNT-SKETCH and Sparse Johnson-Lindenstrauss Transforms; see definitions in Section 2), the learned sketches in each of the first few iterations • improve the LASSO error f (x) -f (x * ) by 80% to 87% in two real-world datasets, where f (x) = 1 2 Ax -b 2 2 + x 1 ; • improve the dual SVM error f (x) -f (x * ) by 10-30% for a synthetic and a real-world dataset, as well as by 30%-40% for another real-world dataset, where f (x) = Bx 2 2 ; • improve the matrix estimation error f (X) -f (X * ) by at least 30% for a synthetic dataset and at least 95% for a real-world data set, where f (X) = AX -B 2 F . Therefore, the learned sketches attain a smaller error within the same number of iterations, and in fact, within the same limit on the maximum runtime, since our sketches are extremely sparse (see below). We also study the general framework of convex optimization in (van den Brand et al., 2020) , and show that also for sketching-based preconditioning, learned sketches demonstrate considerable advantages. More precisely, by using a learned sketch with the same number of rows as an oblivious sketch, we are able to obtain a much better preconditioner with the same overall running time. All of our learned sketches S are extremely sparse, meaning that they contain a single non-zero entry per column. Following the previous work of (Indyk et al., 2019) , we choose the position of the nonzero entry in each column to be uniformly random, while the value of the nonzero entry is learned. This already demonstrates a significant advantage over non-learned sketches, and has a fast training time. Importantly, because of such sparsity, our sketches can be applied in input sparsity time given a new optimization problem. We also provide several theoretical results, showing how to algorithmically use learned sketches in conjunction with random sketches so as to do no worse than random sketches.

2. PRELIMINARIES

Classical Sketches. Below we review several classical sketches that have been used for solving optimization problems. • Gaussian sketch: S = 1 √ m G, where G is an m × n Gaussian random matrix. • COUNT-SKETCH: Each column of S has only a single non-zero entry. The position of the nonzero entry is chosen uniformly over the m entries in the column and the value of the entry is either +1 or -1, each with probability 1/2. Further, the columns are chosen independently. • Sparse Johnson-Lindenstrauss Transform (SJLT): S is the vertical concatenation of s independent COUNT-SKETCH matrices, each of dimension m/s × n. COUNT-SKETCH-type Sketch. A COUNT-SKETCH-type sketch is characterized by a tuple (m, n, p, v), where m, n are positive integers and p, v are n-dimensional real vectors, defined as follows. The sketching matrix S has dimensions m × n and S pi,i = v i for all 1 ≤ i ≤ n while all the other entries of S are 0. When m and n are clear from context, we may characterize such a sketching matrix by (p, v) only. Subspace Embeddings. For a matrix A ∈ R n×d , we say a matrix S ∈ R m×n is a (1 ± )-subspace embedding for the column span of A if (1 -) Ax 2 ≤ SAx 2 ≤ (1 + ) Ax 2 for all x ∈ R d . The classical sketches above, with appropriate parameters, are all subspace embedding matrices with at least a constant probability; our focus is on COUNT-SKETCH which can be applied in input sparsity running time. We summarize the parameters needed for a subspace embedding below: & Woodruff, 2017) . Although the number of rows is quadratic in d/ , the sketch matrix S is sparse and computing SA takes only O(nnz(A)) time. • SJLT: m = O(d/ 2 ) and has s = O(1/ ) non-zeros per column (Nelson & Nguyên, 2013; Cohen, 2016) . • Gaussian sketch: m = O(d/ 2 ). It is a dense matrix and computing SA costs O(m • nnz(A)) = O(nnz(A)d/ 2 ) time. • COUNT-SKETCH: m = O(d 2 / 2 ) (Clarkson Computing SA takes O(s nnz(A)) = O(nnz(A)/ ) time. Iterative Hessian Sketch. The Iterative Hessian Sketching (IHS) method (Pilanci & Wainwright, 2016) solves the constrained least-squares problem (1) by iteratively performing the update x t+1 = arg min x∈C 1 2 S t+1 A(x -x t ) 2 2 -A (b -Ax t ), x -x t , where S t+1 is a sketching matrix. It is not difficult to see that for the unsketched version (S t+1 is the identity matrix) of the minimization above, the optimal solution x t+1 coincides with the optimal solution to the constrained least square problem (1). The IHS approximates the Hessian A A by a sketched version (S t+1 A) (S t+1 A) to improve runtime, as S t+1 A typically has very few rows. Unconstrained Convex Optimization. Consider an unconstrained convex optimization problem min x f (x), where f is smooth and strongly convex, and its Hessian ∇ 2 f is Lipschitz continuous. This problem can be solved by Newton's method, which iteratively performs the update x t+1 = x t -arg min z (∇ 2 f (x t ) 1/2 ) (∇ 2 f (x t ) 1/2 )z -∇f (x t ) 2 , provided it is given a good initial point x 0 . In each step, it requires solving a regression problem of the form min z A Az -y 2 , which, with access to A, can be solved with a fast regression solver in (van den Brand et al., 2020) . The regression solver first computes a preconditioner R via a QR decomposition such that SAR has orthonormal columns, where S is a sketching matrix, then solves z = arg min z (AR) (AR)z -y 2 by gradient descent and returns R z in the end. Here, the point of sketching is that the QR decomposition of SA can be computed much more efficiently than the QR decomposition of A since S has only a small number of rows. Algorithm 1 LEARN-SKETCH: Gradient descent algorithm for learning the sketch values Require: A train = {A 1 , ..., A N } (A i ∈ R n×d ), learning rate α 1: Randomly initialize p, v for a Count-Sketch- type sketch 2: for t = 0 to steps do 3: Form S using p, v 4: Sample batch A batch from A train 5: v ← v -α ∂L(S,A batch ) ∂v 6: end for Learning a Sketch. We use the same learning algorithm in (Liu et al., 2020) , given in Algorithm 1. The algorithm aims to minimize the mean loss function L(S, A) = 1 N N i=1 L(S, A i ), where S is the learned sketch, L(S, A) is the loss function of S applied to a data matrix A, and A = {A 1 , . . . , A N } is a (random) subset of training data.

3. HESSIAN SKETCH

Algorithm 2 Solver for (3) 1: S 1 ← learned sketch, S 2 ← random sketch 2: ( Z i,1 , Z i,2 ) ← ESTIMATE(S i , A), i = 1, 2 3: if Z 1,2 / Z 1,1 < Z 2,2 / Z 2,1 then 4: x ← solution of (3) with S = S 1 5: else 6: x ← solution of (3) with S = S 2 7: end if 8: return x 9: function ESTIMATE(S, A) 10: T ← sparse (1 ± η)-subspace embedding matrix for d-dimensional subspaces 11: (Q, R) ← QR(T A) 12: Z 1 ← σ min (SAR -1 ) 13: Z 2 ← (1 ± η)-approximation to (SAR -1 ) (SAR -1 ) -I op 14: return ( Z 1 , Z 2 ) 15: end function In this section, we consider the minimization problem min x∈C 1 2 SAx 2 2 -A y, x , which is used as a subroutine for the IHS (cf. ( 2)). We present an algorithm with the learned sketch in Algorithm 2. To analyze its performance, we let R be the column space of A ∈ R m×n and define the following quantities (corresponding exactly to the unconstrained case in Pilanci & Wainwright (2016) ) Z 1 (S) = inf v∈R∩S n-1 Sv 2 2 , Z 2 (S) = sup u,v∈R∩S n-1 u, (S S -I n )v , where S n-1 denotes the Euclidean unit sphere in R n . The following is the estimation guarantee of Z 1 and Z 2 . The proof is postponed to Appendix A. Lemma 3.1. Suppose that η ∈ (0, 1 3 ) is a small constant, A is of full rank and S has O(d 2 ) rows. The function ESTIMATE(S, A) returns in O((nnz(A) log(1/η)+poly(d/η)) time Z 1 , Z 2 which with probability at least 0.99 satisfy that Z1(S) 1+η ≤ Z 1 ≤ Z1(S) 1-η and Z2(S) (1+η) 2 -3η ≤ Z 2 ≤ Z2(S) (1-η) 2 + 3η.

Note for a matrix

A, A op = sup x =0 Ax 2 x 2 is its operator norm. Similar to (Pilanci & Wainwright, 2016 , Proposition 1), we have the following guarantee. The proof is postponed to Appendix B. Theorem 3.2. Let η ∈ (0, 1 3 ) be a small constant. Suppose that A is of full rank and S 1 and S 2 are both COUNT-SKETCH-type sketches with O(d 2 ) rows. Algorithm 2 returns a solution x which, with probability at least 0.98, satisfies that A ( x -x * ) 2 ≤ (1 + η) 4 min Z1,2 Z1,1 , Z2,2 Z2,1 + 4η Ax * 2 in O(nnz(A) log( 1 η ) + poly( d η ) ) time, where x * = arg min x∈C Ax -b 2 is the least-squares solution.

4. HESSIAN REGRESSION

Algorithm 3 Fast Regression Solver for (4) 1: S 1 ← learned sketch, S 2 ← random sketch 2: (Q i , R i ) ← QR(S i A), i = 1, 2 3: (σ i , σ i ) ← EIG(AR -1 i ), i = 1, 2 EIG(B) returns the max and min singular values of B 4: if σ 1 /σ 1 < σ 2 /σ 2 then 5: P ← R -1 1 , η ← 1/(σ 2 1 + (σ 1 ) 2 ) 6: else 7: P ← R -1 2 , η ← 1/(σ 2 2 + (σ 2 ) 2 ) 8: end if 9: z 0 ← 0 10: while A AP z t -y 2 ≥ y 2 do 11: z t+1 ← z t -η(P A AP )(P A AP z t -P y) 12: end while 13: return P z t In this section, we consider the minimization problem min z A Az -y 2 , which is used as a subroutine for the unconstrained convex optimization problem min x f (x) with A A being the Hessian matrix Algorithm 3 returns a solution x such that A Ax - ∇ 2 f (x) (See Section 2). Here A ∈ R n×d , y ∈ R d , y 2 ≤ y 2 in O(nnz(A)) + O(nd • (min{σ 1 /σ 1 , σ 2 /σ 2 }) 2 • log(κ(A)/ ) + poly(d)) time. Remark 4.2. In Algorithm 3, S 2 can be chosen to be a subspace embedding matrix for ddimensional subspaces, in which case, AR -1 2 has condition number close to 1 (see, e.g., (Woodruff, 2014, p38) ) and the full algorithm would run faster than the trivial O(nd 2 )-time solver to (4). Remark 4.3. For the original unconstrained convex optimization problem min x f (x), one can run the entire optimization procedure with learned sketches versus the entire optimization procedure with random sketches, compare the objective values at the end, and choose the better of the two. For least-squares f (x) = 1 2 Ax -b 2 2 , the value of f (x) can be approximated efficiently by a sparse subspace embedding matrix in O(nnz(A) + nnz(b) + poly(d)) time.

5. IHS EXPERIMENTS

Training. We learn the sketching matrix in each iteration (also called a round) separately. Recall that in the (t+1)-st round of IHS, the optimization problem we need to solve (2) depends on x t . An issue is that we do not know x t , which is needed to generate the training data for the (t + 1)-st round. Our solution is to use the sketch matrix in the previous round to obtain x t by solving (2), and then use it to generate the training data in the next round. That is, in the first round, we train the sketch matrix S 1 to solve the problem x 1 = arg min x∈C 1 2 S 1 Ax 2 2 -A b, x and use x 1 to generate the training data for the optimization problem for x 2 , and so on. The loss function we use is the unsketched objective function in the (t+1)-st iteration, i.e., L(S t+1 , A) = 1 2 A(x t+1 -x t ) 2 2 -A T (b-Ax t ), x t+1 -x t , where x t+1 is the solution to (2) and thus depends on S t+1 . Comparison. We compare the learned sketch against three classical sketches: Gaussian, COUNT-SKETCH, and SJLT (see Section 2) in all experiments. The quantity we compare is a certain error, defined individually for each problem, in each round of the iteration of the IHS or as a function of the runtime of the algorithm. All of our experiments are conducted on a laptop with 1.90GHz CPU and 16GB RAM. We define an instance of LASSO regression as x * = arg min x∈R d 1 2 Ax -b 2 2 + λ x 1 , ( ) where λ is a parameter. We use two real-world datasets: • CO emissionfoot_0 : the dataset contains 9 sensor measures aggregated over one hour (by means of average or sum), which can help to predict the CO emission. We divide the raw data into 120 We choose λ = 1 in the LASSO regression (5). We choose m = 5d and m = 3d for the CO emission dataset and m = 6d and m = 3.5d for the greenhouse gas dataset. We consider the error (A i , b i ) such that A i ∈ R 300×9 , b i ∈ R 300×1 . ( 1 2 Ax -b 2 2 + x 1 )-( 1 2 Ax * -b 2 2 + x * 1 ) and take an average over five independent trials. We plot in (natural) logarithmic scale the mean errors of the two datasets in Figures 1 to  4 . We can observe that the learned sketches consistently outperform the classical random sketches in all cases. Note that our sketch size is much smaller than the theoretical bound. For a smaller sketch size, classical random sketches converge slowly or do not yield stable results, while the learned sketches can converge to a small error quickly. For larger sketch sizes, all three classical sketches have approximately the same order of error, and the learned sketches reduce the error by a 5/6 to 7/8 factor in all iterations.

5.2. SUPPORT VECTOR MACHINE

In the context of binary classification, a labeled sample is a pair (a i , z i ), where a i ∈ R n is a vector representing a collection of features and z i ∈ {-1, +1} is the associated class label. Given a set of labeled patterns {(a i , z i )} d i=1 , the support vector machine (SVM) estimates the weight vector w * by minimizing the function w * = arg min w∈R n C 2 d i=1 g(z i , w i , a i ) + 1 2 w 2 2 , where C is a parameter. Here we use the squared hinge loss g(z i , w, a i ) := (1 -z i w, a i )foot_1 + . The dual of this problem can be written as a constrained minimization problem (see, e.g., (Li et al., 2009; Pilanci & Wainwright, 2015) ), x * := arg min We choose m = 10d in all experiments and define the error as Bx 2 2 -Bx * 2 2 . For random sketches, we take the average error over five independent trials, and for learned sketches, over three independent trials. For the Gisette dataset, we use the learned sketch in all rounds. We plot in a (natural) logarithmic scale the mean errors of the three datasets in Figures 5 to 7 . For Gisette and random Gaussian datasets, using the learned sketches reduced the error by 10%-30%, and for the Swarm Behavior dataset, the learned sketches reduce the error by about 30%-40%.

5.3. MATRIX ESTIMATION WITH NUCLEAR NORM CONSTRAINT

In many applications, for the problem X * := arg min X∈R d 1 ×d 2 AX -B 2 F , it is reasonable to model the matrix X * as having low rank. Similar to 1 -minimization for compressive sensing, a standard relaxation of the rank constraint is to minimize the nuclear norm of X, defined as X * := min{d1,d2} j=1 σ j (X), where σ j (X) is the j-th largest singular value of X. Hence, the matrix estimation problem we consider here is X * := arg min X∈R d 1 ×d 2 AX -B 2 F such that X * ≤ ρ. where ρ > 0 is a user-defined radius as a regularization parameter. We conduct experiments on the following two datasets: • Synthetic Dataset: We generate the pair  (A i , B i ) as B i = A i X * i + W i , where A i ∈ R n×d1 with i.i.d N (0, 1) entries. X * i ∈ R d1×d2 A i ∈ R 13530×5 , B i ∈ R 13530×6 , |(A, B)| train = 144, |(A, B)| test = 36. The same dataset and parameters were also used in (Liu et al., 2020) for regression tasks. In our nuclear norm constraint, we set ρ = 10. We choose m = 40, 50 for the synthetic dataset and m = 10, 50 for the Tunnel dataset and define the error to be 1 2 ( AX -B 2 F -AX * -B 2 F ). For each data point, we take the average error of five independent trials. The mean errors of the two datasets when m = 50 are plotted in a (natural) logarithmic scale in Figures 8 and 9 . We observe that the classical sketches yield approximately the same order of error, while the learned sketches improve the error by at least 30% for the synthetic dataset and surprisingly by at least 95% for the Tunnel dataset. The huge improvement on the Tunnel dataset may be due to the fact that the matrices A i have many duplicate rows. We defer the results when m = 10 or 40 to Appendix D, which show that the learned sketches yield much smaller errors than the random sketches and the random sketches could converge significantly more slowly with considerably larger errors in the first several rounds. As stated in Section 2, our learned sketch matrices S are all COUNT-SKETCH-type matrices (each column contains a single nonzero entry), the matrix product SA can thus be computed in O(nnz(A)) time and the overall algorithm is expected to be fast. To verify this, we plot in an error-versus-runtime plot for the SVM and matrix estimation with nuclear norm constraint tasks in Figures 10 and 11 (corresponding to the datasets in Figures 7 and 9 ). The runtime consists only of the time for sketching and solving the optimization problem and does not include the time for loading the data. We run the same experiment three times. Each time we take an average over all test data. From the plot we can observe that the learned sketch and COUNT-SKETCH have the fastest runtimes, which are slightly faster than that of the SJLT and significantly faster than that of the Gaussian sketch.

6. FAST REGRESSION EXPERIMENT

We consider the unconstrained least-squares problem, i.e., (5) with λ = 0, using the CO emission, greenhouse gas datasets, and the following Census dataset: • Census datafoot_5 : this dataset consists of annual salary and related features on people who reported that they worked 40 or more weeks in the previous year and worked 35 or more hours per week. We Training. We optimize the learned sketch S 1 by gradient descent (Algorithm 1), where L(S, A) = κ(AR -1 1 ), where R 1 is computed as in Algorithm 3 and κ(M ) denotes the condition number of a matrix M . Next we discuss how to generate the training data. Since we use Newton's method to solve an unconstrained convex optimization problem (see Section 2), in the t-round, we need to solve a regression problem min z (∇ 2 f (x t ) 1/2 ) (∇ 2 f (x t ) 1/2 )z -∇f (x t ) 2 . Reformulating it as min z A Az -y 2 , we see that A and y depend on the previous solution x t . Hence, we take x t to be the solution obtained from Algorithm 3 using the learned sketch S t , and this generates A and y for the (t + 1)-st round. Experiment. For the CO emission dataset, we set m = 70, and for the Census dataset, we set m = 500. For the η in the Algorithm 3, we set η = 1 for the first round and η = 0.2 for the next rounds for the CO emission dataset, and η = 1 for all rounds for the Census dataset. We leave the setting and results on the greenhouse gas dataset to Appendix E. We examine the accuracy of the subproblem (4) and define the error to be A ARz t -y 2 / y 2 . We run the first three subroutines of solving the subproblem for the CO emission dataset and the Census dataset. The average error of three independent trials is plotted in Figures 12, 13 and 16. We observe that for the CO emission dataset, the classical sketches have a similar performance and the learned sketches lead to a fast convergence in the subroutine with the first-round error at least 80% smaller; for the Census dataset, the learned sketch achieves the smallest error for all three rounds, where we reduce about 60% error in the first round and about 50% error in the third round. Note that the learned sketch always considerably outperforms COUNT-SKETCH in all cases. We demonstrated the superiority of using learned sketches, over classical random sketches, in the Iterative Hessian Sketching method and fast regression solvers for unconstrained least-squares. Compared with random sketches, our learned sketches of the same size can considerably reduce the error in the loss function (i.e., f (x) -f (x * ), where x is the output of the sketched algorithm and x * the optimal solution to the unsketched problem) for a given threshold of maximum number of iterations or maximum runtime. Learned sketches also admit a smaller sketch size. When the sketch size is small, the algorithm with random sketches may fail to converge, or converge slowly, while the algorithm with learned sketches converges quickly. A PROOF OF LEMMA 3.1 Suppose that AR -1 = U W , where U ∈ R n×d has orthonormal columns, which form an orthonormal basis of the column space of A. Since T is a subspace embedding of the column space of A with probability 0.99, it holds for all x ∈ R d that 1 1 + η T AR -1 x 2 ≤ AR -1 x 2 ≤ 1 1 -η T AR -1 x 2 . Since T AR -1 x 2 = Qx 2 = x 2 and W x 2 = U W x 2 = AR -1 x 2 (6) we have that 1 1 + η x 2 ≤ W x 2 ≤ 1 1 -η x 2 , x ∈ R d . It is easy to see that Z 1 (S) = min x∈S d-1 SU x 2 = min y =0 SU W y 2 W y 2 , and thus, min y =0 (1 -η) SU W y 2 y 2 ≤ Z 1 (S) ≤ min y =0 (1 + η) SU W y 2 y 2 . Recall that SU W = SAR -1 . We see that (1 -η)σ min (SAR -1 ) ≤ Z 1 (S) ≤ (1 + η)σ min (SAR -1 ). By definition, Z 2 (S) = U T (S S -I n )U op . It follows from ( 7) that (1 -η) 2 W T U T (S T S -I n )U W op ≤ Z 2 (S) ≤ (1 + η) 2 W T U T (S T S -I n )U W op . and from ( 7), ( 6) and (Vershynin, 2012, Lemma 5.36 ) that (AR -1 ) (AR -1 ) -I op ≤ 3η. Since W T U T (S T S -I n )U W op = (AR -1 ) (S T S -I n )AR -1 op and (AR -1 ) S T SAR -1 -I op -(AR -1 ) (AR -1 ) -I op ≤ (AR -1 ) (S T S -I n )AR -1 op ≤ (AR -1 ) S T SAR -1 -I op + (AR -1 ) (AR -1 ) -I op , it follows that (1 -η) 2 (SAR -1 ) SAR -1 -I op -3(1 -η) 2 η ≤ Z 2 (S) ≤ (1 + η) 2 (SAR -1 ) SAR -1 -I op + 3(1 + η) 2 η. We have so far proved the correctness of the approximation and we shall analyze the runtime below. B PROOF OF THEOREM 3.2 In Lemma 3.1, we have with probability at least 0.99 that Z 2 Z 1 ≥ 1 (1+η) 2 Z 2 (S) -3η 1 1-η Z 1 (S) ≥ 1 -η (1 + η) 2 Z 2 (S) Z 1 (S) - 3η Z 1 (S) . When S is random subspace embedding, it holds with probability at least 0.99 that Z 1 (S) ≥ 3/4 and so, by a union bound, it holds with probability at least 0.98 that Z 2 Z 1 ≥ 1 (1 + η) 4 Z 2 (S) Z 1 (S) -4η, or, Z 2 (S) Z 1 (S) ≤ (1 + η) 4 Z 2 Z 1 + 4η . The correctness of our claim then follows from (Pilanci & Wainwright, 2016 , Proposition 1), together with the fact that S 2 is a random subspace embedding. The runtime follows from Lemma 3.1 and (Cormode & Dickens, 2019, Theorem 2.2 ). C PROOF OF THEOREM 4.1 The proof follows almost an identical argument as that in (van den Brand et al., 2020, Lemma B.1) . In (van den Brand et al., 2020) , it is assumed (in our notation) that 3/4 ≤ σ min (AP ) ≤ σ max (AP ) ≤ 5/4 and thus one can set η = 1 in Algorithm 3 and achieve a linear convergence. The only difference is that here we estimate σ min (AP ) and σ max (AP ) and set the step size η in the gradient descent algorithm accordingly. By standard bounds for gradient descent (see, e.g., (Boyd & Vandenberghe, 2004, p468) ), with a choice of step size η = 2/(σ 2 max (AP ) + σ 2 min (AP )), after O((σ max (AP )/σ min (AP )) 2 log(1/ )) iterations, we can find z t such that P A AP (z t -z * ) 2 ≤ P A AP (z 0 -z * ) 2 , where z * = arg min z P A AP z -P y 2 is the optimal least-squares solution. This establishes Eq. ( 11) in the proof in (van den Brand et al., 2020) , and the rest of the proof follows as in there. 



https://archive.ics.uci.edu/ml/datasets/Gas+Turbine+CO+and+NOx+Emission+Data+Set https://archive.ics.uci.edu/ml/datasets/Greenhouse+Gas+Observing+Network https://archive.ics.uci.edu/ml/datasets/Swarm+Behaviour https://www.csie.ntu.edu.tw/ ∼ cjlin/libsvmtools/datasets/binary.html#gisette https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+exposed+to+turbulent+gas+mixtures https://github.com/chocjy/randomized-quantile-regression-solvers/tree/master/matlab/data



and we have access to A. We incorporate a learned sketch into the fast regression solver in(van den Brand et al., 2020) and present the algorithm in Algorithm 3.Here the subroutine EIG(B) applies a (1 + η)-subspace embedding sketch T to B for some small constant η and returns the maximum and the minimum singular values of T B. Since B admits the form of AR, the sketched matrix T B can be calculated as (T A)R and thus can be computed inO(nnz(A) + poly(d)) time if T is a COUNT-SKETCH matrix of O(d 2 ) rows.The extreme singular values of T B can be found by SVD or the Lanczos's algorithm. Similar to Lemma 4.2 in (van den Brand et al., 2020), we have the following guarantee of Algorithm 3. The proof parallels the proof in (van den Brand et al., 2020) and is postponed to Appendix C. Theorem 4.1. Suppose that S 1 and S 2 are both COUNT-SKETCH-type sketches with O(d 2 ) rows.

Figure 1: Test error of LASSO on CO emissions dataset, m = 5d

Figure 3: Test error of LASSO on greenhouse gas dataset, m = 6d

Figure 4: Test error of LASSO on greenhouse gas dataset, m = 3.5d

The data in each matrix is sorted in chronological order.|(A, b) train | = 96, |(A, b) test | = 24. • Greenhouse gas 2 : time series of measured greenhouse gas concentrations in the California atmosphere. Each (A, b) corresponds to a different measurement location. A i ∈ R 327×14 , b i ∈ R 327×1 , and |(A, b) train | = 400, |(A, b) test | = 100.(This dataset was also used in(Liu et al., 2020).)

Figure 5: Test error of SVM on random Gaussian dataset Figure 6: Test error of SVM on swarm behavior dataset Figure 7: Test error of SVM on Gisette dataset

∆ d = {x ∈ R d : x ≥ 0 and x 1 = 1}, the positive simplex in R d . Here B = [(AD) 1 √ C I d ] ∈ R (n+d)×d , where A is an n × d matrix with a i ∈ R n as its i-th column and D = diag(z) is a d × d diagonal matrix. We conduct experiments on the following three datasets: • Random Gaussian (synthetic): We follow the same construction in Pilanci & Wainwright (2016). We generate a two-component Gaussian mixture model, based on the component distributions N (µ 0 , I) and N (µ 1 , I), where µ 0 and µ 1 are uniformly distributed in [-3, 3]. Placing equal weights on each component, we draw d samples from this mixture distribution. • Swarm behavior 3 : Each instance in the dataset has n = 2400 features and the task is to predict whether the instance is flocking or not flocking. We use only the first 6000 instances of the raw data, and divide them into 200 smaller groups of instances. Each group contains d = 30 instances, corresponding to a B i of size 2430 × 30. The training data consists of 160 groups and the test data consist of 40 groups. • Gisette 4 : Gisette is a handwritten digit recognition problem which asks to to separate the highly confusable digits '4' and '9'. Each instance has n = 5000 features. The raw data is divided into 200 smaller groups, where each contains d = 30 instances and corresponds to a B i of size 5030 × 30. The training data consists of 160 groups and the test data consists of 40 groups.

Figure 8: Test error of matrix estimation on synthetic data, m = 50

is a matrix with rank at most r. W i is noise with i.i.d N (0, σ 2 ) entries. Here we set n = 500, d 1 = d 2 = 7, r = 3, ρ = 30. |(A, B)| train = 270, |(A, B)| test = 30. • Tunnel 5 : The data set is a time series of gas concentrations measured by eight sensors in a wind tunnel. Each (A, B) corresponds to a different data collection trial.

Figure 10: Test error of SVM on Gisette dataset

randomly sample 5000 instances to create (A i , b i ), where A ∈ R 5000×11 and b ∈ R 5000×1 , |(A, B)| train = 160, |(A, B)| test = 40.

Figure 12: Test error of fast regression for the CO emission dataset, first three calls in solving an unconstrained least-squares

Figure 13: Test error of fast regression for the Census dataset, first three calls in solving an unconstrained least-squares

S and T are sparse, computing SA and T A takes O(nnz(A)) time. The QR decomposition of T A, which is a matrix of size poly(d/η) × d, can be computed in poly(d/η) time. The matrix SAR -1 can be computed in poly(d) time. Since it has size poly(d) × d, its smallest singular value can be computed in poly(d) time. To approximate Z 2 (S), we can use the power method to estimate (SAR -1 ) T SAR -1 -I op up to a (1 ± η)-factor in O((nnz(A) + poly(d)) log(1/η)) time.

IHS EXPERIMENT: MATRIX ESTIMATION WITH NUCLEAR NORM CONSTRAINT As stated in Section 5.3, the mean errors of the two datasets when m = 40 for the synthetic dataset and m = 10 for the Tunnel dataset are plotted in a (natural) logarithmic scale in Figures 14 and 15.

Figure 14: Test error of matrix estimation on synthetic dataset, m = 40

