ADAPTIVE SINGLE-PASS STOCHASTIC GRADIENT DE-SCENT IN INPUT SPARSITY TIME

Abstract

We study sampling algorithms for variance reduction methods for stochastic optimization. Although stochastic gradient descent (SGD) is widely used for large scale machine learning, it sometimes experiences slow convergence rates due to the high variance from uniform sampling. In this paper, we introduce an algorithm that approximately samples a gradient from the optimal distribution for a common finite-sum form with n terms, while just making a single pass over the data, using input sparsity time, and Õ (T d) space. Our algorithm can be implemented in big data models such as the streaming and distributed models. Moreover, we show that our algorithm can be generalized to approximately sample Hessians and thus provides variance reduction for second-order methods as well. We demonstrate the efficiency of our algorithm on large-scale datasets.

1. INTRODUCTION

There has recently been tremendous progress in variance reduction methods for stochastic gradient descent (SGD) methods for the standard convex finite-sum form optimization problem min x∈R d F (x) := 1 n n i=1 f i (x) , where f 1 , . . . , f n : R d → R is a set of convex functions that commonly represent loss functions. Whereas gradient descent (GD) performs the update rule x t+1 = x t -η t ∇F (x t ) on the iterative solution x t at iterations t = 1, 2, . . ., SGD (Robbins & Monro, 1951; Nemirovsky & Yudin, 1983; Nemirovski et al., 2009) picks i t ∈ [n] in iteration t with probability p it and performs the update rule x t+1 = x t -ηt npi t ∇f it (x t ), where ∇f it is the gradient (or a subgradient) of f it and η t is some predetermined learning rate. Effectively, training example i t is sampled with probability p it and the model parameters are updated using the selected example. The SGD update rule only requires the computation of a single gradient at each iteration and provides an unbiased estimator to the full gradient, compared to GD, which evaluates n gradients at each iteration and is prohibitively expensive for large n. However, since SGD is often performed with uniform sampling so that the probability p i,t of choosing index i ∈ [n] at iteration t is p i,t = 1 n at all times, the variance introduced by the randomness of sampling a specific vector function can be a bottleneck for the convergence rate of the iterative process. Thus the subject of variance reduction beyond uniform sampling has been well-studied in recent years (Roux et al., 2012; Johnson & Zhang, 2013; Defazio et al., 2014; Reddi et al., 2015; Zhao & Zhang, 2015; Daneshmand et al., 2016; Needell et al., 2016; Stich et al., 2017; Johnson & Guestrin, 2018; Katharopoulos & Fleuret, 2018; Salehi et al., 2018; Qian et al., 2019) . A common technique to reduce variance is importance sampling, where the probabilities p i,t are chosen so that vector functions with larger gradients are more likely to be sampled. Thus for Var(v) := E v 2 2 -E [v] 2 2 , for a random vector v, then p i,t = 1 n for uniform sampling implies σ 2 t = Var 1 np it,t ∇f it = 1 n 2 n n i=1 ∇f i (x t ) 2 -n 2 • ∇F (x t ) 2 , whereas importance sampling with p i,t = ∇fi(xt) n j=1 ∇fj (xt) gives σ 2 t = Var 1 np it,t ∇f it = 1 n 2   n i=1 ∇f i (x t ) 2 -n 2 • ∇F (x t ) 2   , which is at most 1 n 2 n ∇f i (x t ) 2 -n 2 • ∇F (x t ) 2 , by the Root-Mean Square-Arithmetic Mean Inequality, and can be significantly less. Hence the variance at each step is reduced, possibly substantially, e.g., Example 1.3 and Example 1.4 , by performing importance sampling instead of uniform sampling. In fact, it follows from the Cauchy-Schwarz inequality that the above importance sampling probability distribution is the optimal distribution for variance reduction. However, computing the probability distribution for importance sampling requires computing the gradients in each round, which is too expensive in the first place. Second-Order Methods. Although first-order methods such as SGD are widely used, they do sometimes have issues such as sensitivity to the choice of hyperparameters, stagnation at high training errors, and difficulty in escaping saddle points. By considering second-order information such as curvature, second-order optimization methods are known to be robust to several of these issues, such as ill-conditioning. For example, Newton's method can achieve a locally super-linear convergence rate under certain conditions, independent of the problem. Although naïve second-order methods are generally too slow compared to common first-order methods, stochastic Newton-type methods such as Gauss-Newton have shown to be scalable in the scientific computing community (Roosta-Khorasani et al., 2014; Roosta-Khorasani & Mahoney, 2016a; b; Xu et al., 2019; 2020) . Our Contributions. We give a time efficient algorithm that provably approximates the optimal importance sampling using a small space data structure. Remarkably, our data structure can be implemented in big data models such as the streaming model, which just takes a single pass over the data, and the distributed model, which requires just a single round of communication between parties holding each loss function. For ∇F = 1 n ∇f i (x), where each ∇f i = f ( a i , x ) • a i for some polynomial f and vector a i ∈ R d , let nnz(A) be the number of nonzero entries of A := a 1 •. . .•a n 1 . Thus for T iterations, where d T n, GD has runtime Õ (T • nnz(A)) while our algorithm has runtime T • poly(d, log n) + Õ (nnz(A)), where we use Õ (•) to suppress polylogarithmic terms. Theorem 1.1 Let ∇F = 1 n ∇f i (x), where each ∇f i = f ( a i , x ) • a i for some polynomial f and vector a i ∈ R d and let nnz(A) be the number of nonzero entries of A := a 1 • . . . • a n . For d T n, there exists an algorithm that performs T steps of SGD and at each step samples a gradient within a constant factor of the optimal probability distribution. The algorithm requires a single pass over A and uses Õ (nnz(A)) pre-processing time and Õ (T d) space. Theorem 1.1 can be used to immediately obtain improved convergence guarantees for a class of functions whose convergence rate depends on the variance σ 2 t , such as µ-smooth functions or strongly convex functions. Recall that SGD offers the following convergence guarantees for smooth functions: Theorem 1.2 (Nemirovski et al., 2009; Meka, 2017) Let F be a µ-smooth convex function and x opt = argmin F (x). Let σ 2 be an upper bound for the variance of the unbiased estimator across all iterations and x k = x1+...+x k k . Let each step-size η t be η ≤ 1 µ . Then for SGD with initial position x 0 , E [F (x k ) -F (x opt )] ≤ 1 2ηk x 0 -x opt 2 2 + ησ 2 2 , so that k = O 1 2 σ 2 + µ x 0 -x opt 2 2 2 iterations suffices to obtain an -approximate optimal value by setting η = 1 √ k . In the convergence guarantees of Theorem 1.2, we obtain a constant factor approximation to the variance σ = σ opt from optimal importance sampling, which can be significantly better than the 1 We use the notation a • b to denote the vertical concatenation a b . variance σ = σ unif orm from uniform sampling in standard SGD. We first show straightforward examples where uniform sampling an index performs significantly worse than importance sampling. For example, if ∇f i (x) = a i , x • a i , then for A = a 1 • . . . • a n : Example 1.3 When the nonzero entries of the input A are concentrated in a small number of vectors a i , uniform sampling will frequently sample gradients that are small and make little progress, whereas importance sampling will rarely do so. In an extreme case, the A can contain exactly one nonzero vector a i and importance sampling will always output the full gradient whereas uniform sampling will only find the nonzero row with probability 1 n . Example 1.4 It may be that all rows of A have large magnitude, but x is nearly orthogonal to most of the rows of A and heavily in the direction of row a r . Then a i , x • a i is small in magnitude for most i, but a r , x • a r is large so uniform sampling will often output small gradients while importance sampling will output a r , x • a r with high probability. Thus Example 1.3 shows that naïve SGD with uniform sampling can suffer up to a multiplicative n factor loss in the convergence rate of Theorem 1.2 compared to that of SGD with importance sampling whereas Example 1.4 shows a possible additive n factor loss. Unlike a number of previous variance reduction methods, we do not require distributional assumptions (Bouchard et al., 2015; Frostig et al., 2015; Gopal, 2016; Jothimurugesan et al., 2018) or offline access to the data (Roux et al., 2012; Johnson & Zhang, 2013; Defazio et al., 2014; Reddi et al., 2015; Zhao & Zhang, 2015; Daneshmand et al., 2016; Needell et al., 2016; Stich et al., 2017; Johnson & Guestrin, 2018; Katharopoulos & Fleuret, 2018; Salehi et al., 2018; Qian et al., 2019) . On the other hand, for applications such as neural nets in which the parameters in the loss function can change, we can use a second-order approximation for a number of iterations, then reread the data to build a new second-order approximation when necessary. We complement our main theoretical result with empirical evaluations comparing our algorithm to SGD with uniform sampling for logistic regression on the a9a Adult dataset collected by UCI and retrieved from LibSVM (Chang & Lin, 2011) . Our evaluations demonstrate that for various step-sizes, our algorithm has significantly better performance than uniform sampling across both the number of SGD iterations and surprisingly, wall-clock time. We then show that our same framework can also be reworked to approximate importance sampling for the Hessian, thereby performing variance reduction for second-order optimization methods. (Xu et al., 2016) reduce the bottleneck of many second-order optimization methods to the task of sampling s rows of A = a 1 • . . . • a n so that a row a i is sampled with probability f ( ai,x )•a i ai 2 F n i=1 f ( ai,x )•a i ai 2 F , for some fixed function f so that the Hessian H has the form H := ∇foot_0 F = 1 n ∇f ( a i , x )a i a i . (Xu et al., 2016) show that this finite-sum form arises frequently in machine learning problems such as logistic regression with least squares loss. Theorem 1.5 Let ∇ 2 F = 1 n ∇f i (x) , where each ∇f i = f ( a i , x ) • a i a i for some polynomial f and vector a i ∈ R d and let nnz(A) be the number of nonzero entries of A := a 1 • . . . • a n . For d T n, there exists an algorithm that subsamples T Hessians within a constant factor of the optimal probability distribution. The algorithm requires a single pass over A and uses Õ (nnz(A)) pre-processing time and Õ (T d) space.

2. SGD ALGORITHM

We first introduce a number of algorithms that will be used in our final SGD algorithm, along with their guarantees. We defer all formal proofs to the appendix. L 2 polynomial inner product sketch. For a fixed polynomial f , we first require a constant-factor approximation to n i=1 f ( a i , x ) • a i 2 for any query x ∈ R d ; we call such an algorithm an L 2 polynomial inner product sketch and give such an algorithm with the following guarantee: et al. (1999) ; Mahabadi et al. (2020) and is simple to implement. For intuition, observe that for d = 1 and the identity function f , the matrix A ∈ R n×d reduces to a vector of length n so that estimating n i=1 f ( a i , x ) • a i 2 is just estimating the squared norm of a vector in sublinear space. For a degree p polynomial f , ESTIMATOR generates random sign matrices S 0 , S 1 , . . . , S p with Õ 1 2 rows and maintains S 0 A, . . . , S p A. To estimate n i=1 α q • ( a i , x ) q • a q 2 2 for an integer q ∈ [0, p] and scalar α q on a given query x, ESTIMATOR creates the q-fold tensor Y = y ⊗q for each row y of S q A and the (q -1)-fold tensor X = x ⊗(q-1) . Note that X and Y can be refolded into dimensions R d q-1 and R d×d q-1 so that YX ∈ R d and α q • YX 2 2 is an unbiased estimator of n i=1 α q • ( a i , x ) q • a q 2 2 . We give this algorithm in full in Algorithm 1. Thus, taking the average over O 1 2 instances of the sums of the tensor products for rows y across the sketches S 0 A, . . . , S p A gives a (1 + )-approximation to n i=1 f ( a i , x ) • a i 2 2 with constant probability. The success probability of success can then be boosted to 1 -1 poly(n) by taking the median of O (log n) such outputs. Algorithm 1 Basic algorithm ESTIMATOR that outputs (1 + )-approximation to n i=1 ( a i , x ) p • a i 2 2 , where x is a post-processing vector Input: Matrix A = a 1 • . . . • ∈ R n×d , post-processing vector x ∈ R d , integer p ≥ 0, constant parameter > 0. Output: (1 + )-approximation to n i=1 ( a i , x ) p • a i 2 2 . 1: r ← Θ(log n) with a sufficiently large constant. 2: b ← Ω 1 2 with a sufficiently large constant. 3: Let T be an r × b table of buckets, where each bucket stores an R d vector, initialized to the zeros vector. 4: Let s i ∈ {-1, +1} be 4-wise independent for i ∈ [n]. for each i = 1 to r do 9: Add s j a j to the vector in bucket h i (j) of row i. 10: Let v i,j be the vector in row i, bucket j of T for i ∈ [r], j ∈ [b]. 11: Process x: 12: for i ∈ [r], j ∈ [b] do 13: u i,j ← v ⊗p i,j x ⊗(p-1) 14: return median i∈[r] 1 b j∈[b] u i,j L 2 polynomial inner product sampler. Given a matrix A = a 1 • . . . • a n ∈ R n×d and a fixed function f , a data structure that takes query x ∈ R d and outputs an index i ∈ [n] with probability roughly f ( a i , x ) • a i 2 2 n i=1 f ( a i , x ) • a i 2 2 is called an L 2 polynomial inner product sampler. We give such a data structure in Section A.1: Theorem 2.2 For a fixed > 0 and polynomial f , there exists a data structure SAMPLER that takes any query x ∈ R d and outputs an index i ∈ [n] with probability We remark that T independent instances of SAMPLER provide an oracle for T steps of SGD with importance sampling, but the overall runtime would be T • nnz(A) so it would be just as efficient to run T iterations of GD. The subroutine SAMPLER is significantly more challenging to describe and analyze, so we defer its discussion to Section A.1, though it can be seen as a combination of ESTIMATOR and a generalized CountSketch Charikar et al. (2004) ; Nelson & Nguyen (2013) ; Mahabadi et al. (2020) variant and is nevertheless relatively straightforward to implement. (1± )• f ( ai,x )•ai 2 2 n i=1 f ( ai,x )•ai 2 2 + 1 poly(n) , along with a vector u := f ( a i , x ) • a i + v, where E [v] = 0 and v 2 ≤ • f ( a i , x ) • a i 2 . Leverage score sampler. Although SAMPLER outputs a (noisy) vector according to the desired probability distribution, we also require an algorithm that automatically does this for indices i ∈ [n] that are likely to be sampled multiple times across the T iterations. Equivalently, we require explicitly storing the rows with high leverage scores, but we defer the formal discussion and algorithmic presentation to Section A.2. For our purposes, the following suffices: Theorem 2.3 There exists an algorithm LEVERAGE that returns all indices i ∈ [n] such that (1± )• f ( ai,x )•ai 2 2 n i=1 f ( ai,x )•ai 2 2 ≥ 1 200T d for some x ∈ R n , along with a vector u i := f ( a i , x ) • a i + v i , where v i 2 ≤ • f ( a i , x ) • a i 2 . The algorithm requires a single pass over A = a 1 • . . . • a n (possibly through turnstile updates), uses Õ nnz(A) + d ω 2 runtime (where ω denotes the exponent of square matrix multiplication) and Õ d 2 space, and succeeds with probability 1 -1 poly(n) .

2.1. SGD ALGORITHM AND ANALYSIS

For the finite-sum optimization problem min x∈R d F (x) := 1 n n i=1 f i (x) , where each ∇f i = f ( a i , x )• a i , recall that we could simply an instance of SAMPLER as an oracle for SGD with importance sampling. However, naïvely running T SGD steps requires T independent instances, which uses T • nnz(A) runtime by Theorem 2.2. Thus we use a two level data structure by first implicitly partition the rows of matrix A = a 1 • . . . • a n into β := Θ(T d) buckets B 1 , . . . , B β and creating an instance of ESTIMATOR and SAMPLER for each bucket. The idea is that for a given query x t in SGD iteration t ∈ [T ], we first query x t to each of the ESTIMATOR data structures to estimate i∈Bj f ( a i , x ) • a i 2 2 for each j ∈ [β]. We then sample index j ∈ [β] among the buckets B 1 , . . . , B β with probability roughly i∈B j f ( ai,xt )•ai 2 2 n i=1 f ( ai,xt )•ai 2 2 . Once we have sampled index j, it would seem that querying the instance SAMPLER corresponding to B j simulates SGD, since SAMPLER now performs importance sampling on the rows in B j , which gives the correct overall probability distribution for each row i ∈ [n]. Moreover, SAMPLER has runtime proportional to the sparsity of B j , so the total runtime across the β instances of SAMPLER is Õ (nnz(A)). However, an issue arises when the same bucket B j is sampled multiple times, as we only create a single instance of SAMPLER for each bucket. We avoid this issue by explicitly accounting for the buckets that are likely to be sampled multiple times. Namely, we show that if f ( ai,xt )•ai 2 2 n i=1 f ( ai,xt )•ai 2 2 < 1 200T d for all t ∈ [T ] and i ∈ [n], then by Bernstein's inequality, the probability that no bucket B j is sampled multiple times is at least 99 100 . Thus we use LEVERAGE to separate all such rows a i that violate this property from their respective buckets and explicitly track the SGD steps in which these rows are sampled. We give the algorithm in full in Algorithm 2. The key property achieved by Algorithm 2 in partitioning the rows and removing the rows that are likely to be sampled multiple times is that each of the SAMPLER instances are queried at most once. Lemma 2.4 With probability at least 98 100 , each t ∈ [T ] uses a different instance of SAMPLER j . Proof of Theorem 1.1: Consider Algorithm 2. By Lemma 2.4, each time t ∈ [T ] uses a fresh instance of SAMPLER j , so that independent randomness is used. A possible concern is that each instance ESTIMATOR j is not using fresh randomness, but we observe that ESTIMATOR procedures Algorithm 2 Approximate SGD with Importance Sampling Let q i be the output of ESTIMATOR j on query Input: Matrix A = a 1 • . . . • a n ∈ R n×d , x t-1 for each i ∈ [β]. 11: Sample j ∈ [β] with probability p j = qj i∈[β] qi . 12: if there exists i ∈ L 0 with h(i) = j then 13: Use ESTIMATOR j , LEVERAGE, and SAMPLER j to sample gradient w t = ∇f it (x t ) 14: else 15: Use SAMPLER j to sample gradient w t = ∇f it (x t ) 16:  p i,t ← wt 2 2 j∈[β] qj 17: x t+1 ← x t -ηt n pi,t •

3. SECOND-ORDER OPTIMIZATION

In this section, we repurpose our data stucture that performs importance sampling for SGD to instead perform importance sampling for second-order optimization. Given a second-order optimization algorithm that requires a sampled Hessian H t , possibly along with additional inputs such as the current iterate x t and the gradient g t of F , we model the update rule by an oracle O(H t ), suppressing other inputs to the oracle in the notation. For example, the oracle O corresponding to the canonical second-order algorithm Newton's method can be formulated as x t+1 = O(x t ) := x t -[H t ] -1 g t . By black-boxing the update rule of any second-order optimization algorithm into the oracle, we can focus our attention to the running time of sampling a Hessian with nearly the optimal probability distribution. Thus we prove generalizations of the L 2 polynomial inner product sketch, the L 2 polynomial inner product sampler, and the leverage score sampler for Hessians. Theorem 3.1 For a fixed > 0 and polynomial f , there exists a data structure HESTIMATOR that outputs a (1 + )-approximation to We remark that HSAMPLER and LEVERAGE are generalizations of ESTIMATOR and SAMPLER that simply return an outer product of a noisy vector rather than the noisy vector itself. n i=1 f ( a i , x ) • a i a i 2 F for any query x ∈ R d . (1± )• f ( ai,x )•a i ai 2 F n i=1 f ( ai,x )•a i ai 2 F + 1 poly(n) , along with a matrix U := f ( a i , x ) • a i a i + V, where E [V] = 0 and V F ≤ • f ( a i , x ) • a i a i F . (1± )• f ( ai,x )•a i ai 2 F n i=1 f ( ai,x )•a i ai 2 F ≥ 1 200T d for some x ∈ R n , along with a matrix U i := f ( a i , x )•a i a i +V i , where V i F ≤ • f ( a i , x ) • a i a i F . As before, observe that we could simply run an instance of HSAMPLER to sample a Hessian through importance sampling, but sampling T Hessians requires T independent instances, significantly increasing the total runtime. We thus use the same two level data structure that partitions the rows of matrix A = a 1 • . . . • a n into β := Θ(T d) buckets B 1 , . . . , B β . We then create an instance of HESTIMATOR and HSAMPLER for each bucket. For an iterate x t , we sample j ∈ [β] among the buckets B 1 , . . . , B β with probability roughly i∈B j f ( ai,xt )•a i ai 2 F n i=1 f ( ai,xt )•a i ai 2 F using HESTIMATOR and then querying HSAMPLER j at x t to sample a Hessian among the indices partitioned into bucket B j . As before, this argument fails when the same bucket B j is sampled multiple times, due to dependencies in randomness, but this issue can be avoided by using HLEVERAGE to decrease the probability that each bucket is sampled. We give the algorithm in full in Algorithm 3. We remark that Algorithm 3 can be generalized to handle oracles O corresponding to second-order methods that require batches of subsampled Hessians in each iteration. For example, if we want to run T iterations of a second-order method that requires s subsampled Hessians in each batch, we can simply modify Algorithm 3 to sample s Hessians in each iteration as input to O and thus T s Hessians in total.

4. EMPIRICAL EVALUATIONS

Our primary contribution is the theoretical design of a nearly input sparsity time algorithm that approximates optimal importance sampling SGD. In this section we implement a scaled-down version of our algorithm and compare its performance on large-scale real world datasets to SGD with uniform sampling on logistic regression. We also consider both linear regression and support-vector machines (SVMs) in the supplementary material. Because most rows have roughly uniformly small leverage scores in real-world data, we assume that no bucket contains a row with a significantly large leverage score and thus the implementation of our importance sampling algorithm does not create multiple samplers for any buckets. By similar reasoning, our implementation uniformly samples a number of indices i and estimates n i=1 f ( a i , x ) • a i 2 2 by rescaling. Observe that although these simplifications to our algorithm decreases the wall-clock running time and the total space used by our algorithm, they only decrease the quality of our solution for each SGD iteration. We also consider two hybrid SGD sampling algorithms; the first takes the better gradient obtained at each iteration from both uniform sampling and importance sampling while the second performs 25 iterations of Algorithm 3 Second-Order Optimization with Importance Sampling Let q i be the output of HESTIMATOR j on query x t-1 for each i ∈ [β]. if there exists i ∈ L 0 with h(i) = j then 13: Input: Matrix A = a 1 • . . . • a n ∈ R n×d , Use HESTIMATOR j , HLEVERAGE, and HSAMPLER j to sample Hessian H t . Use HSAMPLER j to sample Hessian H t = ∇f it (x t ). Fig. 1 : Comparison of objective values and runtimes for importance sampling (in blue squares), uniform sampling (in red triangles), hybrid sampling that chooses the better gradient at each step (in purple circles), and hybrid sampling that performs 25 steps of importance sampling followed by uniform sampling (in teal X's) over various step-sizes for logistic regression on a9a Adult dataset from UCI, across 250 iterations, averaged over 10 repetitions. importance sampling before using uniform sampling for the remaining iterations. Surprisingly, our SGD importance sampling implementation not only significantly improves upon SGD with uniform sampling, but are also competitive with the two hybrid algorithms. We do not consider other SGD variants due to either their distributional assumptions or lack of known flexibility to big data models. The experiments were performed in Python 3.6.9 on an Intel Core i7-8700K 3.70 GHz CPU with Average Total Time Average Objective Value Fig. 2 : Comparison of objective values and wall-clock time for importance sampling (in blue squares), uniform sampling (in red triangles), and hybrid sampling that chooses the better gradient at each step (in purple circles) over step-size 0.1 for logistic regression on a9a Adult dataset from UCI, averaged over 3 repetitions across approximately 15 minute total computation time. 12 cores and 64GB DDR4 memory, using a Nvidia Geforce GTX 1080 Ti 11GB GPU. Our code is publicly available at https://github.com/SGD-adaptive-importance/code. Logistic Regression. We performed logistic regression on the a9a Adult data set collected by UCI and retrieved from LibSVM (Chang & Lin, 2011) . The features correspond to responses from the 1994 Census database and the prediction task is to determine whether a person makes over 50K USD a year. We trained using a data batch of 32581 points and 123 features and tested the performance on a separate batch of 16281 data points. For each evaluation, we generated 10 random initial positions shared for importance sampling and uniform sampling. We then ran 250 iterations of SGD for each of the four algorithms, creating only 250 buckets for the importance sampling algorithm and computed the average performance on each iteration across these 10 separate instances. The relative average performance of all algorithms was relatively robust to the step-size. Although uniform sampling used significantly less time overall, our importance sampling SGD algorithm actually had better performance when considering either number of iterations or wall-clock time across all tested step-sizes. For example, uniform sampling had average objective value 20680 at iteration 250 using 0.0307 seconds with step-size 0.1, but importance sampling had average objective value 12917 at iteration 5 using 0.025 seconds. We give our results for logistic regression in Figure 1 . We repeat our experiments in Figure 2 to explicitly compare the objective value of each algorithm with respect to wall-clock time, rather than SGD iterations. Thus our results in Figure 2 empirically demonstrate the advantages of our algorithm across the most natural metrics. For additional experiments, see Section B.

5. CONCLUSION AND FUTURE WORK

We have given variance reduction methods for both first-order and second-order stochastic optimization. Our algorithms require a single pass over the data, which may even arrive implicitly in the form of turnstile updates, and use input sparsity time and Õ (T d) space. Our algorithms are also amenable to big data models such as the streaming and distributed models and are supported by empirical evaluations on large-scale datasets. We believe there are many interesting future directions to explore. For example, can we generalize our techniques to show provable guarantees for other SGD variants and accelerated methods? A very large-scale empirical study of these methods would also be quite interesting.

A DISCUSSION, FULL ALGORITHMS, AND PROOFS

For the sake of presentation, we consider the case where p = 2; higher dimensions follow from the same approach, using tensor representation instead of matrix representation. Instead of viewing the input matrix A = a 1 • . . . • a n ∈ R n×d as a number of rows, we instead view the matrix A = A 1 • . . . • A n ∈ R nd×d , where each matrix A i = a i ⊗ a i is the outer product of the row a i with itself. A.1 L 2 POLYNOMIAL INNER PRODUCT SAMPLER For ease of discussion, we describe in this section a data structure that allows sampling an index i ∈ [n] with probability approximately Aix 1,2,d Ax 1,2,d in linear time and sublinear space, where for a matrix A ∈ R nd×d , we use Ax 1,2,d to denote n i=1 A i x 2 , where each A i ∈ R d×d and A = A 1 •. . .•A n . The generalization to a L 2 polynomial inner product sampler follows immediately. Notably, our data structure can be built simply given access to A, and will still sample from the correct distribution when x is given as a post-processing vector. We first describe in Section A.1.1 some necessary subroutines that our sampler requires. These subroutines are natural generalizations of the well-known frequency moment estimation algorithm of Alon et al. (1999) and heavy hitter detection algorithm of Charikar et al. (2004) . We then give the L 1,2,d sampler in full in Section A.1.2.

A.1.1 FREQUENCY MOMENT AND HEAVY HITTER GENERALIZATIONS

We first recall a generalization to the frequency moment estimation algorithm by Alon et al. (1999) that also supports post-processing multiplication by any vector x ∈ R d . Lemma A.1 (Mahabadi et al., 2020) Given a constant > 0, there exists a one-pass streaming algorithm AMS that takes updates to entries of a matrix A ∈ R n×d , as well as query access to postprocessing vectors x ∈ R d and v ∈ R d that arrive after the stream, and outputs a quantity F such that (1 -) Ax -v 2 ≤ F ≤ (1 + ) Ax -v 2 . The algorithm uses O d 2 log 2 n + log 1 δ bits of space and succeeds with probability at least 1 -δ. Algorithm 4 Basic algorithm COUNTSKETCH that outputs heavy submatrices of Ax 1,2,d , where x is a post-processing vector Input: Matrix A ∈ R nd×d , post-processing vector x ∈ R d , constant parameter > 0. Output: Slight perturbations of the vector A i x for which A i x 2 ≥ Ax 1,2,d . 1: r ← Θ(log n) with a sufficiently large constant. 2: b ← Ω 1 2 with a sufficiently large constant. 3: Let T be an r × b table of buckets, where each bucket stores an R d×d matrix, initialized to the zeros matrix. 4: Let s i ∈ {-1, +1} be 4-wise independent for i ∈ [n]. 5: Let h i : [n] → [b] be 4-wise independent for i ∈ [r]. 6: Process A: 7: Let A = A 1 • . . . • A n , where each A i ∈ R d×d . 8: for each j = 1 to n do 9: for each i = 1 to r do 10: Add s j A j to the matrix in bucket h i (j) of row i. 11: Let M i,j be the matrix in row i, bucket j of T for i ∈ [r], j ∈ [b]. 12: Process x: 13: for i ∈ [r], j ∈ [b] do 14: M i,j ← M i,j x 15: On query k ∈ [n], report median i∈[r] M i,hi(k) 2 . Let A 1 , . . . , A n ∈ R d×d and A = A 1 • . . . • A n ∈ R nd×d . Let x ∈ R d×1 be a post-processing vector that is revealed only after A has been completely processed. For a given > 0, we say a block A i with i ∈ [n] is heavy if A i x 2 ≥ Ax 1,2,d . We show in Algorithm 4 an algorithm that processes A into a sublinear space data structure and identifies the heavy blocks of A once x is given. Moreover, for each heavy block A i , the algorithm outputs a vector y that is a good approximation to A i x. The algorithm is a natural generalization of the CountSketch heavy-hitter algorithm introduced by Charikar et al. (2004) . For a vector v ∈ R nd×1 , we use v tail(b) to denote v with the b blocks of d rows of v with the largest 2 norm set to zeros. Lemma A.2 There exists an algorithm that uses O 1 2 d 2 log 2 n + log 1 δ space that outputs a vector y i for each index i ∈ with sufficiently large constant, E 1 occurs with probability at least 1 12 by a union bound. Let v be the sum of the vectors representing the blocks that are hashed to bucket h α (i) excluding A i x, so that v is the noise for the estimate of A i x in row α. Conditioned on E 1 , we can bound the expected squared norm of the noise in bucket h α (i) for sufficiently large b by E v [n] so that | y i 2 -A i x 2 | ≤ (Ax) tail( 2 2 ) 2 ≤ (Ax) tail( 2 2 ) 1,2,d with probability at least 1 -δ. Moreover if Y = y 1 • . . . • y n , then (Ax) tail( 2 2 ) 2 ≤ Ax -Y 2 ≤ 2 (Ax) tail( 2 2 ≤ 2 9 (Ax) tail( 2 2 ) . Hence we have Var( v i 2 ) ≤ 2 9 (Ax) tail( . Thus from Jensen's inequality, Chebyshev's inequality and conditioning on E 1 , Pr v i 2 ≥ (Ax) tail( 2 2 ) 2 ≤ 1 4 + 1 12 = 2 3 . The first claim then follows from the observation that (Ax) tail( 22 ) 2 ≤ (Ax) tail( 22 ) 1,2,d and noting that we can boost the probability of success to 1 -1 poly(n) by repeating for each of the r = Θ(log n) rows and taking the median. Finally, observe that (Ax) tail( 22 ) 2 ≤ Ax -Y 2 , since Y has at most 2 2 nonzero blocks, while (Ax) tail( 22 ) has all zeros in the 2 2 blocks of Ax with the largest 2 norm. Since Ax -Y alters at most 2 2 rows of Ax, each by at most (Ax) tail( 22 ) 2 , then Ax -Y 2 ≤ 2/ 2 i=1 (Ax) tail( 2 2 ) 2 2 = 2 (Ax) tail( 2 2 ) 2 . 2 A.1.2 SAMPLING ALGORITHM Our approach is similar to p sampling techniques in Andoni et al. (2011); Jowhari et al. (2011) , who consider sampling indices in vectors, and almost verbatim to Mahabadi et al. (2020) , who consider sampling rows in matrices given post-processing multiplication by a vector. The high level idea is to note that if t i ∈ [0, 1] is chosen uniformly at random, then Pr A i 1,2,d t i ≥ A 1,2,d = A i 1,2,d A 1,2,d . Thus if B i = Ai ti and there exists exactly one index i such that B i 1,2,d ≥ A 1,2,d , then the task would reduce to outputting B j that maximizes B j 1,2,d over all j ∈ [n]. In fact, we can show that B i is an O ( )-heavy hitter of B with respect to the L 1,2,d norm. Hence, we use a generalization of COUNTSKETCH to identify the heavy hitters of B, approximate the maximum index i, and check whether B i 1,2,d is at least (an estimate of) A 1,2,d . Unfortunately, this argument might fail due to several reasons. Firstly, there might exists zero or multiple indices i such that B i 1,2,d ≥ A 1,2,d . Then the probability distribution that an index i satisfies B i 1,2,d ≥ A 1,2,d and that B i 1,2,d > B j 1,2 ,d for all other j ∈ [n] is not the same as the desired distribution. Fortunately, we show that this only happens with small probability, slightly perturbing the probability of returning each i ∈ [n]. Another possibility is that the error in COUNTSKETCH is large enough to misidentify whether B i 1,2,d ≥ A 1,2,d . Using a statistical test, this case can usually be identified and so the algorithm will be prevented from outputting a sample in this case. Crucially, the probability that the algorithm is aborted by the statistical test is roughly independent of which index achieves the maximum. As a result, the probability of returning each i ∈ [n] is within a (1 ± ) factor of Ai 1,2,d A 1,2,d when the algorithm does not abort. We show that the probability that algorithm succeeds is Θ( ) so then running O log 1 instances of the algorithm suffices to output some index from the desired distribution with constant probability, or abort otherwise. Because the underlying data structure is a linear sketch, then it is also robust to post-processing multiplication by any vector x ∈ R d . Finally, we note that although our presentation refers to the scaling factors t i as independent random variables, our analysis shows that they only need to be O (1)-wise independent and thus we can generate the scaling factors in small space in the streaming model. We give the L 1,2,d sampler in Algorithm 5. Algorithm 5 L 1,2,d Sampler Input: Matrix A ∈ R nd×d with A = A 1 • . . . • A n , where each A i ∈ R d×d , vector x ∈ R d×1 that arrives after processing A, constant parameter > 0. Output: Noisy A i x of Ax sampled roughly proportional to A i x 2 . 1: Pre-processing Stage: 2: b ← Ω 1 2 , r ← Θ(log n) with sufficiently large constants 3: For i ∈ [n], generate independent scaling factors t i ∈ [0, 1] uniformly at random. 4: Let B be the matrix consisting of matrices B i = 1 ti A i . 5: Let ESTIMATOR and AMS track the L 1,2,d norm of Ax and Frobenius norm of Bx, respectively. 6: Let COUNTSKETCH be an r × b table, where each entry is a matrix in R d×d . 7: for each submatrix A i do 8: Process A: 9: Update COUNTSKETCH with B i = 1 ti A i . 10: Update linear sketch ESTIMATOR with A i . 11: Update linear sketch AMS with B i = 1 ti A i . 12: Post-process x in AMS, COUNTSKETCH, and ESTIMATOR. Process x: 13: Sample a submatrix: Return r = t i r i . 14: Use ESTIMATOR to compute F with Ax 1,2,d ≤ F ≤ 2 Ax 1, We first show that the probability that Algorithm 5 returns FAIL is independent of which index i ∈ [n] achieves argmax i∈[n] 1 ti A i x 2 . Lemma A.3 Let i ∈ [n] and fix a value of t i ∈ [0, 1] uniformly at random. Then conditioned on the value of t i , Pr S > F log 1 = O ( ) + 1 poly(n) . Proof : We first observe that if we upper bound S by 4 (Bx) tail( 22 ) 2 and lower bound F by Ax 1,2,d , then it suffices to show that the probability of 4 (Bx) tail( 22 ) 2 > log 1 Ax 1,2,d is small. Thus we define E 1 as the event that: ( 1) Ax 1,2,d ≤ F ≤ 2 Ax 1,2,d (2) Bx -M F ≤ S ≤ 2 Bx -M F (3) (Bx) tail( 2 2 ) 2 ≤ Bx -M F ≤ 2 (Bx) tail( 2 2 ) 2 Note that by Theorem 2.1, Lemma A.2 and Lemma A.1, E 1 holds with high probability. Let U = Ax 1,2,d . For each block A j x, we define y j to be the indicator variable for whether the scaled block B j x is heavy, so that y j = 1 if B j x 2 > U and y j = 0 otherwise. We also define z j ∈ [0, 1] as a scaled random variable for whether B j x is light and how much squared mass it contributes, z j = U 2 B j x 2 2 (1-y j ). Let Y = j =i y j be the total number of heavy blocks besides B i x and Z = j =i z j be the total scaled squared mass of the small rows. Let h ∈ R nd be the vector that contains the heavy blocks so that coordinates (j -1)d + 1 through jd of h correspond to B j x if y j = 1 and they are all zeros otherwise. Hence, h contains at most Y + 1 nonzero blocks and thus at most (Y + 1)d nonzero entries. Moreover, U 2 Z = Bx -h 2 2 and (Bx) tail( 2 2 ) 2 ≤ U √ Z unless Y ≥ 2 2 . Thus if we define E 2 to be the event that Y ≥ 2 2 and E 3 to be the event that 22 ) 2 ≤ log 1 Ax 1,2,d , so it suffices to bound the probability of the events E 2 and E 3 by O ( ). Intuitively, if the number of heavy rows is small (¬E 2 ) and the total contribution of the small rows is small (¬E 3 ), then the tail estimator is small, so the probability of failure due to the tail estimator is small. Z ≥ 1 16U 2 log 1 Ax 2 1,2,d , then ¬E 2 ∧ ¬E 3 implies 4 (Bx) tail( To analyze E 2 , note that y j = 1 if and only if 1 tj A j x 2 > U , so E [y i ] = Aj x 2 U and thus E [Y ] ≤ 1 since Y = j =i y j and U = Ax 1,2,d = j A j x 2 . We also have Var(Y ) ≤ 1 so that Pr [E 2 ] = O ( ) for sufficiently small , by Chebyshev's inequality. To analyze E 3 , recall that z j = 1 U 2 B j x 2 2 (1 -y j ). Thus z j > 0 only if y j = 0 or equivalently, B j x 2 ≤ U . Since B j x = 1 tj A j x, then z j > 0 only if t j ≥ Aj x 2 Ax 1,2,d . Therefore, E [z j ] ≤ ∞ Aj x 2 / Ax 1,2,d z j dt j = ∞ Aj x 2 / Ax 1,2,d 1 t 2 j 1 U 2 A j x 2 2 dt j ≤ A j x 2 Ax 1,2,d . Since Z = j =i z j , then E [Z] ≤ 1 and similarly Var(Z) ≤ 1. Hence by Bernstein's inequality, Pr Z > 1 16 log 1 = O ( ), so then Pr [E 3 ] = O ( ). Thus Pr [¬E 1 ∨ E 2 ∨ E 3 ] = O ( ) + 1 poly(n) , as desired. 2 We now show that Algorithm 5 outputs a noisy approximation to A i x, where i ∈ [n] is drawn from approximately the correct distribution, i.e., the probability of failure does not correlate with the index that achieves the maximum value. Lemma A.4 For a fixed value of F , the probability that Algorithm 5 outputs (noisy) submatrix A i x is (1 ± O ( )) Aix 2 F + 1 poly(n) . 2013) Given an input matrix A, (Nelson & Nguyen, 2013) randomly samples a sparse matrix Π 1 with Õ d 2 rows and Õ 1 signs per column, setting the remaining entries to be zero. (Nelson & Nguyen, 2013) maintains Π 1 A and post-processing, computes R -1 so that Π 1 AR -1 has orthonormal columns. Previous work of (Drineas et al., 2012) had shown that the squared row norms of AR -1 are (1 + )-approximations to the leverage scores of A. Hence for a JL matrix Π 2 that gives (1 + )-approximations to the row norms of AR -1 , we can compute A(R -1 Π 2 ) and output the row norms of ARΠ 2 as the approximate leverage scores for each row. Due to the sparsity of Π 1 and Π 2 , the total runtime is Õ 1 2 • nnz(A) . Computing R -1 takes additional Õ d ω 2 runtime. Now since the squared row norms of AR -1 are (1 + )-approximations to the leverage scores of A, it suffices to take the rows of AR -1 with large squared norms. To that effect, we randomly sample a CountSketch matrix T and maintain TA. Once R -1 is computed, we can post-processing right multiply to obtain TAR -1 , similar to Algorithm 5. It follows that any row of TAR -1 that is at least 1 200T d -heavy (with respect to squared Frobenius norm) has leverage score at least 1 100T d . Thus we can obtain these rows by querying the CountSketch data structure while using space Õ (T d). Due to the sparsity of the CountSketch matrix, the total runtime is Õ nnz(A) + d ω 2 . Finally, (Mahabadi et al., 2020) show that the error guarantee on each reported heavy row required by Theorem 2.3. By reporting the outer products of each of the heavy rows rather than the heavy rows, we obtain Theorem 3. Otherwise, for all j ∈ [β] so that h(i) = j for any index i ∈ [n] such that A i x t 2 2 < 1 100T d Ax t 2 F , we have i:h(i)=j A i x t 2 2 ≤ 1 100T Ax t 2 F , with probability at least 99 100 by Bernstein's inequality and a union bound over j ∈ [β] for β = Θ(T d) with sufficiently high constant. Intuitively, by excluding the hash indices containing "heavy" matrices, the remaining hash indices contain only a small fraction of the mass with high probability. Then the probability that any j ∈ [β] with i:h(i)=j A i x t 2 ≤ 1 10T Ax t 1,2,d is sampled more than once is at most 1 100T for any t ∈ [T ] provided there is no row in any A i with h(i) = j whose 2 leverage score is at least 1 100T d . Thus, the probability that some bucket j ∈ [β] is sampled twice across T steps is at most β (100T ) 2 ≤ 1 100 . In summary, we maintain T separate instances of L 1,2,d samplers for the heavy matrices and one L 1,2,d sampler for each hash index that does not contain a heavy matrix. With probability at least 98 100 , any hash index not containing a heavy matrix is sampled only once, so each time t ∈ [T ] has access to a fresh L 1,2,d sampler. 2

B EMPIRICAL EVALUATIONS

We again emphasize that our primary contribution is the theoretical design of a nearly input sparsity time streaming algorithm that simulates the optimal importance sampling distribution for variance reduction in stochastic gradient descent without computing the full gradient. Thus our theory is optimized to minimize the number of SGD iterations without asymptotic wall-clock time penalties; we do not attempt to further optimize wall-clock runtimes. Nevertheless, in this section we implement a scaled-down version of our algorithm and compare its performance across multiple iterations on large-scale real world data sets to SGD with uniform sampling on both linear regression and support-vector machines (SVMs). Because most rows have roughly uniformly small leverage scores in real-world data, we assume that no bucket contains a row with a significantly large leverage score and thus the implementation of our importance sampling algorithm does not create multiple samplers for any buckets. By similar reasoning, our implementation uniformly samples a number of indices i and estimates Ax 1,2,d = j A j x 1,2,d by scaling up A i x 1,2,d . Observe that although these simplifications to our algorithm decreases the wall-clock running time and the total space used by our algorithm, they only decrease the quality of our solution for each SGD iteration. Nevertheless, our implementations significantly improve upon SGD with uniform sampling. The experiments in this section were performed on a Dell Inspiron 15-7579 device with an Intel Core i7-7500U dual core processor, clocked at 2.70 GHz and 2.90 GHz, in contrast to the logistic regression experiments that were performed on a GPU. Linear Regression. We performed linear regression on the CIFAR-10 dataset to compare the performance of our importance sampling algorithm to the uniform sampling SGD algorithm. We trained using a data batch of 100000 points and 3072 features and tested the performance on a separate batch of data points. We aggregated the objective values across 10 separate instances. Each instance generated a random starting location as an initial position for both importance sampling and uniform sampling. We then ran 40 iterations of SGD for each algorithm and observed the objective value on the test data for each of these iterations. Finally, we computed the average performance on each iteration across these 10 separate instances. As we ran our algorithm for 40 iterations, we created 1600 buckets that partitioned the data values for the importance sampling algorithm. The sampled gradients were generally large in magnitude for both importance sampling and uniform sampling and thus we required small step-size. For step-sizes η = 1 × 10 -13 , η = 5 × 10 -12 , and η = 1 × 10 -12 , the objective value of the solution output by our importance sampling algorithm quickly and significantly improved over the objective value of the solution output by uniform sampling. Our algorithm performance is much more sensitive to the choice of larger step-sizes, as choices of step-sizes larger than 5 × 10 -11 generally caused the importance sampling algorithm to diverge, while the uniform sampling algorithm still slowly converged. We give our results in Figure 3 . Support-Vector Machines. We also compared the performance of our importance sampling algorithm to the uniform sampling SGD algorithm using support-vector machines (SVM) on the a9a Adult data set collected by UCI and retrieved from LibSVM (Chang & Lin, 2011) . The features correspond to responses from the 1994 Census database and the prediction task is to determine whether a person makes over 50K USD a year. We trained using a data batch of 32581 points and 123 features and tested the performance on a separate batch of 16281 data points. We assume the data is not linearly separable and thus use the hinge loss function so that we aim to minimize 1 n n i=1 max(0, 1 -y i (w • X i -b)) + λ w 2 2 , where X is the data matrix, y i is the corresponding label, and w is the desired maximum-margin hyperplane. For each evaluation, we generated 10 random initial positions shared for both importance sampling and uniform sampling. We then ran 75 iterations of SGD for each algorithm, creating 1125 buckets for the importance sampling algorithm and computed the average performance on each iteration across these 5 separate instances. The sampled gradients were generally smaller than those from linear regression on CIFAR-10 and thus we were able to choose significantly larger step-sizes. Nevertheless, our algorithm performance was sensitive to both the step-size and the regularization parameter. For step-sizes η = 0.25, η = 0.5 and regularization parameters λ = 0, λ = 0.001 and λ = 0.0001, the objective value of the solution output by our importance sampling algorithm quickly and significantly improved over the objective value of the solution output by uniform sampling. We give our results in Figure 5 . Our algorithm performance degraded with larger values of λ, as well step-sizes larger than η = 1. We also compared step-size η = 1 and regularization parameters λ = 0, λ = 0.001 and λ = 0.0001 with a hybrid sampling scheme that selects the better gradient between importance sampling and uniform sampling at each step, as well as a hybrid sampling scheme that uses a few steps of importance sampling, followed by uniform sampling in the remaining steps. Our experiments show that the hybrid sampling algorithms perform better at the beginning and thus our importance sampling algorithm may be used in conjunction with existing techniques in offline settings to accelerate SGD. Surprisingly, the hybrid sampling algorithms do not necessarily remain better than our importance sampling algorithm thus indicating that even if uniform sampling were run for a significantly larger number of iterations, its performance may not exceed our importance sampling algorithm. We give our results in Figure 6 . Finally, we compare wall-clock times of each of the aforementioned sampling schemes with step-size η = 1 and regularization 0 across 100 iterations. Our results in Figure 4 show that as expected, uniform sampling has the fastest running time. However, each iteration of importance sampling takes about 15 iterations of uniform sampling, which empirically shows that even using wall-clock times for comparison, rather than total number of SGD iterations, the performance of our importance sampling algorithm still surpasses that of uniform sampling. Moreover, the runtime experiments reveal the main bottleneck of our experiments: each of the 100 iterations took approximately 70 seconds on average after including the evaluation of the objective on each gradient step. 4 : Runtimes comparison for SVM on a9a Adult dataset from LibSVM/UCI with step size 1.0 and regularization 0, averaged over 100 iterations: importance sampling (in blue squares), uniform sampling (in red triangles), hybrid sampling that chooses the better gradient at each step (in purple circles), and hybrid sampling that performs 25 steps of importance sampling followed by uniform sampling (in teal X's). By comparison, the average total runtime over 100 iterations was 72.5504 seconds, including the computation of the scores. Fig. 6 : Comparison of importance sampling (in blue squares), uniform sampling (in red triangles), hybrid sampling that chooses the better gradient at each step (in purple circles), and hybrid sampling that performs 25 steps of importance sampling followed by uniform sampling (in teal X's) over various step-sizes for SVM on a9a Adult dataset from LibSVM/UCI, averaged over 10 iterations



for any query x ∈ R d . The data structure Turnstile updates are defined as sequential updates to the entries of A.



5: Let h i : [n] → [b] be 4-wise independent for i ∈ [r]. 6: Let u i,j be the all zeros vector for each i ∈ [r], j ∈ [b]. 7: for each j = 1 to n do 8:

The algorithm uses requires a single pass over A = a 1 • . . . • a n (possibly through turnstile updates), uses Õ nnz(A) + d ω 2 runtime (where ω denotes the exponent of square matrix multiplication) and Õ d 2 space, and succeeds with probability 1 -1 poly(n) .

[β]  with probability p j = qj i∈[β] qi .12:

Fig. 3: Comparison of importance sampling (in blue squares) and uniform sampling (in red triangles) over various step-sizes for linear regression on CIFAR-10.

Fig.4: Runtimes comparison for SVM on a9a Adult dataset from LibSVM/UCI with step size 1.0 and regularization 0, averaged over 100 iterations: importance sampling (in blue squares), uniform sampling (in red triangles), hybrid sampling that chooses the better gradient at each step (in purple circles), and hybrid sampling that performs 25 steps of importance sampling followed by uniform sampling (in teal X's). By comparison, the average total runtime over 100 iterations was 72.5504 seconds, including the computation of the scores.

Fig.5: Comparison of importance sampling (in blue squares) and uniform sampling (in red triangles) over various step-sizes for SVM on a9a Adult dataset from LibSVM/UCI, averaging across ten iterations, averaged over 10 iterations.

Theorem 2.1 For a fixed > 0 and polynomial f , there exists a data structure ESTIMATOR that outputs a (1+ )-approximation to n i=1 f ( a i , x ) • a i requires a single pass over A = a 1 • . . . • a n (possibly through turnstile updates 2 ), can be built in Õ nnz(A) + d

The data structure requires a single pass over A = a 1 • . . . • a n (possibly through turnstile updates), can be built in Õ nnz(A) + d

parameter T for number of SGD steps. : Run LEVERAGE to find a set L 0 of row indices and corresponding (noisy) vectors. 7: Gradient Descent Stage: 8: Randomly pick starting location x 0 9: for t = 1 to T do

The data structure requires a single pass over A = a 1 • . . . • a n (possibly through turnstile updates), can be built in Õ nnz(A) + d

The data structure requires a single pass over A = a 1 • . . . • a n (possibly through turnstile updates), can be built in Õ nnz(A) + d There exists an algorithm HLEVERAGE that returns all indices i ∈ [n] such that

parameter T for number of sampled Hessians, oracle O that performs the update rule. Output: T approximate Hessians. : Let B j be the matrix formed by the rows a i of A with h(i) = j, for each j ∈ [β].5: Create an instance HESTIMATOR j and HSAMPLER j for each B j with j ∈ [β] with = 1 2 . 6: Run HLEVERAGE to find a set L 0 of row indices and corresponding (noisy) outer products. 7: Second-Order Optimization Stage: 8: Randomly pick starting location x 0 9: for t = 1 to T do

2 ) 2 with probability at least 1 -δ, whereY = Y -Y tail(22 ) denotes the top 2 2 blocks of Y by 2 norm. Consider the estimate of A i x 2 in row α of the CountSketch table T . Then h α (i) is the bucket of T in row α to which A i x hashes. Let E 1 be the event that the 2

2,d . 15: Extract the 2 2 (noisy) blocks of d rows of Bx with the largest estimated 2 norms by COUNTSKETCH. 16: Let M ∈ R nd×1 be the 2 2 -block sparse matrix consisting of these top (noisy) block. 17: Use AMS to compute S with Bx -M 2 ≤ S ≤ 2 Bx -M 2 . 18: Let r i be the (noisy) block of d rows in COUNTSKETCH with the largest norm. 19: if S > F log 1 or r i 2 < 1 F then

3. A.3 APPROXIMATE SGD WITH IMPORTANCE SAMPLING Proof of Lemma 2.4: For any t ∈ [T ] and i ∈ [n], A i x t in A i whose leverage score is at least 1 100T d , since there are d rows in A i . Algorithm 2 calculates a 2-approximation to each leverage score and maintains T separate instances of the L 1,2,d samplers for any matrix containing a row with approximate leverage score at least 1 100T d . Thus for these indices i ∈ [n], we maintain T separate instances of the L 1,2,d samplers for A i by explicitly maintaining the heavy row.

annex

Proof : Let E be the event that t i < AiP 2 F so that Pr [E] = Aix 2 F. Let E 1 be the event that COUNTSKETCH, AMS, or ESTIMATOR fails so that Pr [E 1 ] = 1 poly(n) by Lemma A.2, Lemma A.1, and Theorem 2.1. Let E 2 be the event that S > F log 1 so that Pr [E 2 ] = O ( ) by Lemma A.3. Let E 3 be the event that multiple rows B j x exceeding the threshold are observed in the CountSketch data structure and E 4 be the event that B i x 2 exceeds the threshold but is not reported due to noise in the CountSketch data structure. Observe that E 3 and E 4 are essentially two sides of the same coin, where error is incurred due to the inaccuracies of CountSketch.To analyze E 3 , note that row j = i can be reported as exceeding the threshold if B j x 2 ≥ 1 F -F log 1 , which occurs with probability at most O Aj x 2

F

. By a union bound over allTo analyze E 4 , we first condition on ¬E 1 and ¬E 2 , so that Bx -M 2 ≤ S ≤ F log 1 . Then by Lemma A.2, the estimate B i x for B i x output by the sampler satisfiesHence, E 4 can only occur forwhich occurs with probability at most O 2 .To put things together, E occurs with probability, in which case the L 1,2,d sampler should output A i x. However, this may not happen due to any of the events. Finally byThus we have the following full guarantees for our L 1,2,d sampler.Theorem A.5 Given > 0, there exists an algorithm that takes a matrix A ∈ R nd×d , which can be written as A = A 1 • . . . • A n , where each A i ∈ R d×d . After A is processed, the algorithm is given a query vector x ∈ R d and outputs a (noisy) vector A i x with probability (1 ± O ( ))bits of space, and succeeds with probability at least 1 -δ.Proof : By Lemma A.4 and Theorem 2.1, then AP 1,2,d ≤ F ≤ 2 AP 1,2,d with high probability and so each vector A i x is sampled with probability (1 + )+ 1 poly(n) , conditioned on the sampler succeeding. The probability that the sampler succeeds is Θ( ), so the sampler can be repeated O 1 log n times to obtain probability of success at least 1 - 

A.2 LEVERAGE SCORE SAMPLER

Our starting point is the input sparsity time algorithm of (Nelson & Nguyen, 2013) for approximating the leverage scores, which is in turn a modification of (Drineas et al., 2012; Clarkson & Woodruff, 

