LEARNING-AUGMENTED SKETCHES FOR HESSIANS

Abstract

Sketching is a dimensionality reduction technique where one compresses a matrix by often random linear combinations. A line of work has shown how to sketch the Hessian to speed up each iteration in a second order method, but such sketches usually depend only on the matrix at hand, and in a number of cases are even oblivious to the input matrix. One could instead hope to learn a distribution on sketching matrices that is optimized for the specific distribution of input matrices. We show how to design learned sketches for the Hessian in the context of second order methods, where we learn potentially different sketches for the different iterations of an optimization procedure. We show empirically that learned sketches, compared with their "non-learned" counterparts, improve the approximation accuracy for important problems, including LASSO, SVM, and matrix estimation with nuclear norm constraints. Several of our schemes can be proven to perform no worse than their unlearned counterparts.

1. INTRODUCTION

Large-scale optimization problems are abundant and solving them efficiently requires powerful tools to make the computation practical. This is especially true of second order methods which often are less practical than first order ones. Although second order methods may have many fewer iterations, each iteration could involve inverting a large Hessian, which is cubic time; in contrast, first order methods such as stochastic gradient descent are linear time per iteration. In order to make second order methods faster in each iteration, a large body of work has looked at dimensionality reduction techniques, such as sampling, sketching, or approximating the Hessian by a low rank matrix. See, for example, (Gower et al., 2016; Xu et al., 2016; Pilanci & Wainwright, 2016; 2017; Doikov & Richtárik, 2018; Gower et al., 2018; Roosta-Khorasani & Mahoney, 2019; Gower et al., 2019; Kylasa et al., 2019; Xu et al., 2020; Li et al., 2020) . Our focus is on sketching techniques, which often consist of multiplying the Hessian by a random matrix chosen independently of the Hessian. Sketching has a long history in theoretical computer science (see, e.g., (Woodruff, 2014) for a survey), and we describe such methods more below. A special case of sketching is sampling, which in practice is often uniform sampling, and hence oblivious to properties of the actual matrix. Other times the sampling is non-uniform, and based on squared norms of submatrices of the Hessian or on the so-called leverage scores of the Hessian. Our focus is on sketching techniques, and in particular, we follow the framework of (Pilanci & Wainwright, 2016; 2017) which introduce the iterative Hessian sketch and Newton sketch, as well as the high accuracy refinement given in (van den Brand et al., 2020) . If one were to run Newton's method to find a point where the gradient is zero, in each iteration one needs to solve an equation involving the current Hessian and gradient to find the update direction. When the Hessian can be decomposed as A A for an n × d matrix A with n d, then sketching is particularly suitable. The iterative Hessian sketch was proposed in (Pilanci & Wainwright, 2016) , where A is replaced with S • A, for a random matrix S which could be i.i.d. Gaussian or drawn from a more structured family of random matrices such as the Subsampled Randomized Hadamard Transforms or COUNT-SKETCH matrices; the latter was done in (Cormode & Dickens, 2019) . The Newton sketch was proposed in (Pilanci & Wainwright, 2017) , which extended sketching methods beyond constrained least-squares problems to any twice differentiable function subject to a closed convex constraint set. Using this sketch inside of interior point updates has led to much faster algorithms for an extensive body of convex optimization problems Pilanci & Wainwright (2017) . By instead using sketching as a preconditioner, an application of the work of (van den Brand et al., 2020) (see Appendix E) was able to improve the dependence on the accuracy parameter to logarithmic. In general, the idea behind sketching is the following. One chooses a random matrix S, drawn from a certain family of random matrices, and computes SA. If A is tall-and-skinny, then S is short-and-fat, and thus SA is a small, roughly square matrix. Moreover, SA preserves important properties of A. One typically desired property is that S is a subspace embedding, meaning that simultaneously for all x, one has SAx 2 = (1 ± ) Ax 2 . An observation exploited in (Cormode & Dickens, 2019) , building off of the COUNT-SKETCH random matrices S introduced in randomized linear algebra in (Clarkson & Woodruff, 2017) , is that if S contains O(1) non-zero entries per column, then SA can be computed in O(nnz(A)) time, where nnz(A) denotes the number of nonzeros in A. This is sometimes referred to as input sparsity running time. Each iteration of a second order method often involves solving an equation of the form A Ax = A b, where A A is the Hessian and b is the gradient. For a number of problems, one has access to a matrix A ∈ R n×d with n d, which is also an assumption made in Pilanci & Wainwright (2017). Therefore, the solution x is the minimizer to a constrained least squares regression problem: min x∈C 1 2 Ax -b 2 2 , ( ) where C is a convex constraint set in R d . For the unconstrained case (C = R d ), various classical sketches that attain the subspace embedding property can provably yield high-accuracy approximate solutions (see, e.g., (Sarlos, 2006; Nelson & Nguyên, 2013; Cohen, 2016; Clarkson & Woodruff, 2017) ); for the general constrained case, the Iterative Hessian Sketch (IHS) was proposed by Pilanci & Wainwright ( 2016) as an effective approach and Cormode & Dickens (2019) employed sparse sketches to achieve input-sparsity running time for IHS. All sketches used in these results are dataoblivious random sketches. Learned Sketching. In the last few years, an exciting new notion of learned sketching has emerged. Here the idea is that one often sees independent samples of matrices A from a distribution D, and can train a model to learn the entries in a sketching matrix S on these samples. When given a future sample B, also drawn from D, the learned sketching matrix S will be such that S • B is a much more accurate compression of B than if S had the same number of rows and were instead drawn without knowledge of D. Moreover, the learned sketch S is often sparse, therefore allowing S • B to be applied very quickly. For large datasets B this is particularly important, and distinguishes this approach from other transfer learning approaches, e.g., (Andrychowicz et al., 2016) , which can be considerably slower in this context. Learned sketches were first used in the data stream context for finding frequent items (Hsu et al., 2019) and have subsequently been applied to a number of other problems on large data. For example, Indyk et al. (2019) showed that learned sketches yield significantly small errors for low rank approximation. In (Dong et al., 2020) , significant improvements to nearest neighbor search were obtained via learned sketches. More recently, Liu et al. ( 2020) extended learned sketches to several problems in numerical linear algebra, including least-squares and robust regression, as well as k-means clustering. Despite the number of problems that learned sketches have been applied to, they have not been applied to convex optimization in general. Given that such methods often require solving a large overdetermined least squares problem in each iteration, it is hopeful that one can improve each iteration using learned sketches. However, a number of natural questions arise: (1) how should we learn the sketch? (2) should we apply the same learned sketch in each iteration, or learn it in the next iteration by training on a data set involving previously learned sketches from prior iterations? Our Contributions. In this work we answer the above questions and derive the first learned sketches for a wide number of problems in convex optimization. Namely, we apply learned sketches to constrained least-squares problems, including LASSO, support vector machines (SVM), and matrix regression with nuclear norm constraints. We show empirically that learned sketches demonstrate superior accuracy over random oblivious sketches for each of these problems. Specifically, compared with three classical sketches (Gaussian, COUNT-SKETCH and Sparse Johnson-Lindenstrauss Transforms; see definitions in Section 2), the learned sketches in each of the first few iterations

