ADAPTIVE SINGLE-PASS STOCHASTIC GRADIENT DE-SCENT IN INPUT SPARSITY TIME

Abstract

We study sampling algorithms for variance reduction methods for stochastic optimization. Although stochastic gradient descent (SGD) is widely used for large scale machine learning, it sometimes experiences slow convergence rates due to the high variance from uniform sampling. In this paper, we introduce an algorithm that approximately samples a gradient from the optimal distribution for a common finite-sum form with n terms, while just making a single pass over the data, using input sparsity time, and Õ (T d) space. Our algorithm can be implemented in big data models such as the streaming and distributed models. Moreover, we show that our algorithm can be generalized to approximately sample Hessians and thus provides variance reduction for second-order methods as well. We demonstrate the efficiency of our algorithm on large-scale datasets.

1. INTRODUCTION

There has recently been tremendous progress in variance reduction methods for stochastic gradient descent (SGD) methods for the standard convex finite-sum form optimization problem min x∈R d F (x) := 1 n n i=1 f i (x) , where f 1 , . . . , f n : R d → R is a set of convex functions that commonly represent loss functions. Whereas gradient descent (GD) performs the update rule x t+1 = x t -η t ∇F (x t ) on the iterative solution x t at iterations t = 1, 2, . . ., SGD (Robbins & Monro, 1951; Nemirovsky & Yudin, 1983; Nemirovski et al., 2009) picks i t ∈ [n] in iteration t with probability p it and performs the update rule x t+1 = x t -ηt npi t ∇f it (x t ), where ∇f it is the gradient (or a subgradient) of f it and η t is some predetermined learning rate. Effectively, training example i t is sampled with probability p it and the model parameters are updated using the selected example. The SGD update rule only requires the computation of a single gradient at each iteration and provides an unbiased estimator to the full gradient, compared to GD, which evaluates n gradients at each iteration and is prohibitively expensive for large n. However, since SGD is often performed with uniform sampling so that the probability p i,t of choosing index i ∈ [n] at iteration t is p i,t = 1 n at all times, the variance introduced by the randomness of sampling a specific vector function can be a bottleneck for the convergence rate of the iterative process. Thus the subject of variance reduction beyond uniform sampling has been well-studied in recent years (Roux et al., 2012; Johnson & Zhang, 2013; Defazio et al., 2014; Reddi et al., 2015; Zhao & Zhang, 2015; Daneshmand et al., 2016; Needell et al., 2016; Stich et al., 2017; Johnson & Guestrin, 2018; Katharopoulos & Fleuret, 2018; Salehi et al., 2018; Qian et al., 2019) . A common technique to reduce variance is importance sampling, where the probabilities p i,t are chosen so that vector functions with larger gradients are more likely to be sampled. Thus for Var(v) := E v 2 2 -E [v] 2 2 , for a random vector v, then p i,t = 1 n for uniform sampling implies σ 2 t = Var 1 np it,t ∇f it = 1 n 2 n n i=1 ∇f i (x t ) 2 -n 2 • ∇F (x t ) 2 , whereas importance sampling with p i,t = ∇fi(xt) n j=1 ∇fj (xt) gives σ 2 t = Var 1 np it,t ∇f it = 1 n 2   n i=1 ∇f i (x t ) 2 -n 2 • ∇F (x t ) 2   , which is at most 1 n 2 n ∇f i (x t ) 2 -n 2 • ∇F (x t ) 2 , by the Root-Mean Square-Arithmetic Mean Inequality, and can be significantly less. Hence the variance at each step is reduced, possibly substantially, e.g., Example 1.3 and Example 1.4, by performing importance sampling instead of uniform sampling. In fact, it follows from the Cauchy-Schwarz inequality that the above importance sampling probability distribution is the optimal distribution for variance reduction. However, computing the probability distribution for importance sampling requires computing the gradients in each round, which is too expensive in the first place. Second-Order Methods. Although first-order methods such as SGD are widely used, they do sometimes have issues such as sensitivity to the choice of hyperparameters, stagnation at high training errors, and difficulty in escaping saddle points. By considering second-order information such as curvature, second-order optimization methods are known to be robust to several of these issues, such as ill-conditioning. For example, Newton's method can achieve a locally super-linear convergence rate under certain conditions, independent of the problem. Although naïve second-order methods are generally too slow compared to common first-order methods, stochastic Newton-type methods such as Gauss-Newton have shown to be scalable in the scientific computing community (Roosta-Khorasani et al., 2014; Roosta-Khorasani & Mahoney, 2016a; b; Xu et al., 2019; 2020) . Our Contributions. We give a time efficient algorithm that provably approximates the optimal importance sampling using a small space data structure. Remarkably, our data structure can be implemented in big data models such as the streaming model, which just takes a single pass over the data, and the distributed model, which requires just a single round of communication between parties holding each loss function. For ∇F =foot_0 n ∇f i (x), where each ∇f i = f ( a i , x ) • a i for some polynomial f and vector a i ∈ R d , let nnz(A) be the number of nonzero entries of A := a 1 •. . .•a n 1 . Thus for T iterations, where d T n, GD has runtime Õ (T • nnz(A)) while our algorithm has runtime T • poly(d, log n) + Õ (nnz(A)), where we use Õ (•) to suppress polylogarithmic terms. Theorem 1.1 Let ∇F = 1 n ∇f i (x), where each ∇f i = f ( a i , x ) • a i for some polynomial f and vector a i ∈ R d and let nnz(A) be the number of nonzero entries of A := a 1 • . . . • a n . For d T n, there exists an algorithm that performs T steps of SGD and at each step samples a gradient within a constant factor of the optimal probability distribution. The algorithm requires a single pass over A and uses Õ (nnz(A)) pre-processing time and Õ (T d) space. Theorem 1.1 can be used to immediately obtain improved convergence guarantees for a class of functions whose convergence rate depends on the variance σ 2 t , such as µ-smooth functions or strongly convex functions. Recall that SGD offers the following convergence guarantees for smooth functions: In the convergence guarantees of Theorem 1.2, we obtain a constant factor approximation to the variance σ = σ opt from optimal importance sampling, which can be significantly better than the



We use the notation a • b to denote the vertical concatenation a b .



Theorem 1.2(Nemirovski et al., 2009; Meka, 2017)  Let F be a µ-smooth convex function and x opt = argmin F (x). Let σ 2 be an upper bound for the variance of the unbiased estimator across all iterations and x k = x1+...+x k k . Let each step-size η t be η ≤ 1 µ . Then for SGD with initial position x 0 , E [F (x k ) -F (x opt )] obtain an -approximate optimal value by setting η = 1 √ k .

