ADAPTIVE SINGLE-PASS STOCHASTIC GRADIENT DE-SCENT IN INPUT SPARSITY TIME

Abstract

We study sampling algorithms for variance reduction methods for stochastic optimization. Although stochastic gradient descent (SGD) is widely used for large scale machine learning, it sometimes experiences slow convergence rates due to the high variance from uniform sampling. In this paper, we introduce an algorithm that approximately samples a gradient from the optimal distribution for a common finite-sum form with n terms, while just making a single pass over the data, using input sparsity time, and Õ (T d) space. Our algorithm can be implemented in big data models such as the streaming and distributed models. Moreover, we show that our algorithm can be generalized to approximately sample Hessians and thus provides variance reduction for second-order methods as well. We demonstrate the efficiency of our algorithm on large-scale datasets.

1. INTRODUCTION

There has recently been tremendous progress in variance reduction methods for stochastic gradient descent (SGD) methods for the standard convex finite-sum form optimization problem min x∈R d F (x) := 1 n n i=1 f i (x) , where f 1 , . . . , f n : R d → R is a set of convex functions that commonly represent loss functions. Whereas gradient descent (GD) performs the update rule x t+1 = x t -η t ∇F (x t ) on the iterative solution x t at iterations t = 1, 2, . . ., SGD (Robbins & Monro, 1951; Nemirovsky & Yudin, 1983; Nemirovski et al., 2009) picks i t ∈ [n] in iteration t with probability p it and performs the update rule x t+1 = x t -ηt npi t ∇f it (x t ), where ∇f it is the gradient (or a subgradient) of f it and η t is some predetermined learning rate. Effectively, training example i t is sampled with probability p it and the model parameters are updated using the selected example. The SGD update rule only requires the computation of a single gradient at each iteration and provides an unbiased estimator to the full gradient, compared to GD, which evaluates n gradients at each iteration and is prohibitively expensive for large n. However, since SGD is often performed with uniform sampling so that the probability p i,t of choosing index i ∈ [n] at iteration t is p i,t = 1 n at all times, the variance introduced by the randomness of sampling a specific vector function can be a bottleneck for the convergence rate of the iterative process. Thus the subject of variance reduction beyond uniform sampling has been well-studied in recent years (Roux et al., 2012; Johnson & Zhang, 2013; Defazio et al., 2014; Reddi et al., 2015; Zhao & Zhang, 2015; Daneshmand et al., 2016; Needell et al., 2016; Stich et al., 2017; Johnson & Guestrin, 2018; Katharopoulos & Fleuret, 2018; Salehi et al., 2018; Qian et al., 2019) . A common technique to reduce variance is importance sampling, where the probabilities p i,t are chosen so that vector functions with larger gradients are more likely to be sampled. Thus for Var(v) := E v 2 2 -E [v] 2 2 , for a random vector v, then p i,t = 1 n for uniform sampling implies σ 2 t = Var 1 np it,t ∇f it = 1 n 2 n n i=1 ∇f i (x t ) 2 -n 2 • ∇F (x t ) 2 , 1

