GRADIENT DESCENT CONVERGES LINEARLY FOR LO-GISTIC REGRESSION ON SEPARABLE DATA Anonymous authors Paper under double-blind review

Abstract

We show that running gradient descent on the logistic regression objective guarantees loss f (x ) ≤ 1.1 • f (x * ) + ε, where the error ε decays exponentially with the number of iterations. This is in contrast to the common intuition that the absence of strong convexity precludes linear convergence of first-order methods, and highlights the importance of variable learning rates for gradient descent. For separable data, our analysis proves that the error between the predictor returned by gradient descent and the hard SVM predictor decays as poly(1/t), exponentially faster than the previously known bound of O(log log t/ log t). Our key observation is a property of the logistic loss that we call multiplicative smoothness and is (surprisingly) little-explored: As the loss decreases, the objective becomes (locally) smoother and therefore the learning rate can increase. Our results also extend to sparse logistic regression, where they lead to an exponential improvement of the sparsity-error tradeoff.

1. INTRODUCTION

Logistic regression is one of the most widely used classification methods because of its simplicity, interpretability, and good practical performance. Yet, the convergence behavior of first-order methods on this task is not well understood: In practice gradient descent performs much better than what the theory predicts. In particular, a general analysis of gradient descent for smooth functions implies convergence with the error in function value decaying as O(1/T ). Analyses with stronger, linear convergence guarantees generally require the function to satisfy the strong convexity property, which, in contrast to other losses such as the 2 loss, the logistic loss only satisfies in a bounded set of solutions around zero. As a result, this introduces an exponential runtime dependency on the magnitude of the optimal solution Rätsch et al. (2001); Freund et al. (2018) , which is undesirable in practice. This poses a serious obstacle to obtaining favorable error rates for logistic regression that lead to high-precision solutions. A deeper study into the structure of the exponential and logistic losses was done in Telgarsky & Singer (2012) , who showed that, for linearly separable data, greedy coordinate descent achieves linear convergence with a rate that depends on the maximum linear classification margin (i.e. hard SVM margin). Unfortunately, for logistic regression, it also has a 2 m dependence on the number of examples, making it inefficient for any real-world task. The significance of the separability of the data for convergence has also been observed in Telgarsky (2013); Freund et al. (2018) , who present convergence results based on quantitative measures of separability. Telgarsky (2013) also refines the results of Telgarsky & Singer (2012) for the exponential loss, but still suffers from an exponential overhead originating the multiplicative discrepancy between the exponential and the logistic loss. Interestingly Telgarsky (2013) points out that logistic regression experiments paint a much more favorable picture than the theory predicts. For separable data, Soudry et al. (2018) showed that the gradient descent logistic regression estimator converges to the maximum margin estimator at a rate of O(log log T / log T ), which implies function value convergence at a rate of O(1/T ). Interestingly, Nacson et al. (2019) experimentally observed that these rates seem to be exponentially improvable if one uses variable step sizes, in the case of logistic regression and shallow neural networks. However, as shown in Ji & Telgarsky (2018) , the separability assumption is important, and the poly(1/T ) bound of function value convergence is tight for gradient descent on arbitrary data. Another approach to obtain high-precision solutions is by using second order methods, which in addition to first order (gradient) information, use second order (Hessian) information about the function. These make use of second order stability properties, such as quasi-self-concordance Bach (2010) combined with Newton's method Karimireddy et al. (2018) , or ball oracles Carmon et al. (2020) ; Adil et al. (2021) . Such approaches are generally not suitable for large-scale applications because of their reliance on repeated calls to large linear system solvers. Our work. In this paper, we show that (under appropriate assumptions) we can get the best of both worlds of first and second order methods, thus giving a partial explanation for the excellent performance that first-order methods have on logistic regression in practice. In particular, given a binary classification instance (A ∈ {-1, 1} m×n , b ∈ {-1, 1} m ) with associated logis- tic loss f (x ) = i log(1 + exp(-b i (Ax ) i )), we show that simple variants of gradient descent return a solution with f (x ) ≤ (1 + δ) • f (x * ) + ε after O K 1 δ + log f (0) ε iterations, where K = poly(n, x * ). Even though the error still decays as 1/T in the worst case because of the 1 δ dependence, the additive error is now δf (x * ) instead of δf (0), allowing for much faster convergence when the optimal loss f (x * ) is smaller (which is our measure of linear separability of the data). For linearly separable data, i.e. as f (x * ) approaches 0, the convergence becomes linear. We also show that the distance to the maximum margin estimator x x 2 -x * x * 2 2 decays as 1/T , exponentially improving over the log log T / log T bound of Soudry et al. (2018) . Instead of properties like Lipschitzness, smoothness, strong convexity that are commonly used in the study of first order methods, we find that there are two properties that are more relevant to the structure of the logistic regression problem. The first one is second order robustness, which means that the Hessian is stable (in a spectral sense) in any small enough norm ball Cohen et al. (2017) . This is closely related to quasi-self-concordance, a property that has been previously used in the analysis of second order algorithms Bach (2010) . The second property is what we call multiplicative smoothness, which means that the function is locally smooth, with the smoothness constant being proportional to the function value (loss). Together, these properties show that, as the loss decreases, the objective becomes (locally) smoother and therefore the learning rate can increase. This motivates a variable step size schedule that is inversely proportional to the loss, thus making larger steps as the solution approaches optimality. This in fact agrees with the observations of Soudry et al. (2018) ; Nacson et al. (2019) on the importance of a variable learning rate. As can be seen in the toy example from Soudry et al. (2018) in Figure 1 , simply replacing the fixed learning rate η used in Soudry et al. (2018) by an increasing learning rate η • f (x 0 )/f (x T ) yields an exponential improvement, both in loss and distance to the maximum margin estimator. Figure 1 : Comparison between fixed and increasing step sizes in the toy example from Figure 1 of Soudry et al. (2018) . The fixed step size is set to β -1 := A -2 2 , and the increasing to β -1 f (x 0 )/f (x T ). The estimator error is defined as 2010) In practice, it is often important to force the solution of a logistic regression problem to be sparse, i.e. have only a few non-zero entries, which is a form of feature selection. This is because most of the features might only be marginally useful, and thus one can drastically reduce the size of the model while not significantly sacrificing the predictive performance. Apart from computational efficiency, feature selection is also important to improve interpretability and avoid overfitting. x t / x t 2 -x * / x * 2 2 . Algorithm Order Guarantee Runtime Error Dependence Gradient descent First f (x ) ≤ f (x * ) + ε m/ε Accelerated gradient descent First f (x ) ≤ f (x * ) + ε m/ε Newton/Trust region Second f (x ) ≤ f (x * ) + ε log(m/ε) This paper First f (x ) ≤ (1 + δ) • f (x * ) + ε δ -1 + log(m/ε) f (x ) ≤ f (x * ) + ε x * 2 1 m/ε First This paper f (x ) ≤ (1 + δ) • f (x * ) + ε x * 2 1 (δ -1 + log(m/ε)) First Most progress in sparse optimization has focused on objective functions with condition number bounded by some κ > 0. Results in this line of work guarantee a solution with relaxed sparsity s ≥ s, where s is the target sparsity, and algorithms include lasso, orthogonal matching pursuit (OMP), and iterative hard thresholding (IHT) Natarajan (1995); Blumensath & Davies (2009) ; Shalev-Shwartz et al. (2010) ; Jain et al. (2011; 2014) ; Axiotis & Sviridenko (2021; 2022) . The state of the art result by Axiotis & Sviridenko (2022) gives a sparsity of s = O(κ) • s using a variant of the IHT algorithm. However, the condition number of the logistic loss is unbounded, because it is not strongly convex. Therefore, these results do not directly apply, although they do apply to 2 -regularized logistic regression. Some works Van de Geer (2008) ; Bunea (2008) to achieve a loss of f (x ) ≤ f (x * ) + ε. The most practical of these is a forward greedy selection algorithm, which is also known as greedy coordinate descent. Our work. Using the second order stability and multiplicative smoothness properties, we show that a slight variation of greedy coordinate descent gives a sparsity of O x * 2 1 (δ -1 + log(m/ε)) and a loss of f (x ) ≤ (1 + δ) • f (x * ) + ε. As long as the 1 + δ approximation in front of f (x * ) is tolerated, as is the case when f (x * ) m, this implies an exponential improvement in the ε dependence from m ε to log m ε . In addition, our analysis does not require (but is also not affected by) fully corrective steps, in which the function is fully re-optimized over the support of the current solution.

2. PRELIMINARIES

Notation. We denote [n] = {1, 2, . . . , n}. We will use bold to refer to vectors or matrices. We denote by 0 the all-zero vector, 1 the all-one vector, O the all-zero matrix, and by I the identity matrix (with dimensions understood from the context). Additionally, we will denote by 1 i the i-th basis vector, i.e. the vector that is 0 everywhere except at position i. In order to ease notation and where not ambiguous for two vectors x , y ∈ R n , we denote by x y ∈ R n a vector with elements (x y ) i = x i y i , i.e. the element-wise multiplication of two vectors x and y . In contrast, we denote their inner product by x , y or x y . Similarly, x 2 ∈ R n will be the element-wise square of vector x . For any vector x ∈ R n and set S ⊆ [n], we denote by x S the vector that results from x after zeroing out all the entries except those in positions given by indices in S. We will also use the notation ∇ S f (x ) := (∇f (x )) S to denote the restriction of a gradient to S. We use the notation O (•) to hide poly log(n, m) factors in O-notation, where n is the dimension of the problem and m is the number of examples. Norms. For any p ∈ (0, ∞) and weight vector w ≥ 0, we define the weighted p norm of a vector x ∈ R n as: x p,w = i w i x p i 1/p . For p = 0, we denote x 0 = |{i | x i = 0}| to be the sparsity of x . For p = ∞, we denote x ∞ = max i |x i | to be the maximum absolute value of x .

Smoothness and convexity. A differentiable function

f : R n → R is called convex if for any x , y ∈ R n we have f (y ) ≥ f (x ) + ∇f (x ), y -x . Furthermore, f is called β-smooth (with respect to some norm • ) for some real number β > 0 if for any x , y ∈ R n we have f (y ) ≤ f (x ) + ∇f (x ), y -x + (β/2) y -x 2 . If f is only β-smooth along s-sparse directions (i.e. only for x , y ∈ R n such that yx 0 ≤ s), then we call f β-smooth at sparsity level s and denote the smallest such β by β s and call it the restricted smoothness constant (at sparsity level s).

3. LOGISTIC REGRESSION ANALYSIS VIA MULTIPLICATIVE SMOOTHNESS

In the logistic regression problem, our goal is to minimize the function f (x ) = m i=1 log(1+e -(Ax )i ), where A ∈ R m×n is a data matrixfoot_0 Our starting point, as is usually the case with first-order methods, will be the second order Taylor expansion of f : f (x + x ) = f (x ) + ∇f (x ), x + 1 2 x , ∇ 2 f (x ) x , where, by the mean value theorem for twice continuously differentiable functions, x is entry-wise between x and x , and ∇ 2 f (x ) is the Hessian of f at x . In fact, as long as the step x is not too large, the Hessian at x will not differ much (spectrally) from the Hessian at x . This is because of the following property of the logistic function called second order robustness Cohen et al. ( 2017), which is also very closely related to quasi-self-concordance Bach (2010). Definition 3.1 (Second-order robustness). A twice differentiable function f : R n → R is called q-second order robust with respect to a norm • if its Hessian is stable in any (1/q)-sized • -ball, i.e. for any x , x ∈ R n such that x -x ≤ 1/q, we have 1 2 ∇ 2 f (x ) ∇ 2 f (x ) 2∇ 2 f (x ). It is not hard to see that f is 2M -second order robust with respect to the 1 norm, where M is a upper bound on the entries of A in absolute value. Because of this, (1) implies the much simpler f (x + x ) = f (x ) + ∇f (x ), x + x , ∇ 2 f (x ) x , as long as x 1 ≤ 1/(2M ). We can easily calculate that ∇f (x ) = -A (1 -σ(Ax )), where σ(t) = 1/(1 + e -t ) is the sigmoid function, and ∇ 2 f (x ) = A diag(w (x ))A, where w (x ) = σ(x )(1 -σ(x ) ) are diagonal weights. Now, we should note that the second order term of (1) can be re-written as 1 2 w (x ), (A x ) 2 . This term, whose magnitude is what will determine the step size of the algorithm and in turn the bound on the total number of iterations, becomes smaller as the weights w (x ) become smaller. The crucial observation is that these weights are bounded in a way that depends on the loss of x , concretely: m i=1 (w (x )) i ≤ f (x ) . (3) In other words, as the loss decreases, f becomes smoother (in an appropriate sense). This is the main observation on which our analysis is based, and is what allows the algorithm to employ a step size that is inversely proportional to the loss. Multiplicative smoothness. The above discussion motivates the following definition of multiplicative smoothness. This is related to the usual definition of smoothness but also incorporates the property that the function becomes smoother as the loss decreases. Definition 3.2 (Multiplicative smoothness). We call a twice differentiable function f : R n → R >0 µ-multiplicatively smooth with respect to a norm • , if for any x , x ∈ R n we have x ∇ 2 f (x ) x f (x ) ≤ µ x 2 . Our use of a general norm is not an over-generalization, since as we will see the 1 norm is more suitable for sparse logistic regression, and the 2 norm is more suitable for the unrestricted case. In fact, it can be proved that f is M 2 -multiplicatively smooth with respect to the 1 norm, where we remind that M is a bound on the entries of A in absolute value. In the following sections, we will see how the second order robustness and multiplicative smoothness properties play into the design and analysis of algorithms for sparse and general logistic regression.

4. SPARSE LOGISTIC REGRESSION

As we saw, the logistic loss is 2M -second order robust and M 2 -multiplicatively smooth with respect to the 1 norm. This is an ideal norm for sparse logistic regression, where in addition to minimizing the loss we want to restrict the solution to have few non-zero entries. In particular, it yields a variant of the 1 gradient descent algorithm (aka greedy coordinate descent), which is presented in Algorithm 1. Algorithm 1 Greedy Coordinate Descent 1: procedure GREEDYCOORDINATEDESCENT(x 0 , T, M, B) 2: Let f (x ) := m i=1 log(1 + e -bi(Ax )i ) 3: for t = 0, . . . , T -1 do 4: For all i ∈ [n] define ζ i =    λ t if x t i = 0 0 if x t 1 ≥ B and ∇ i f (x t ) • x t i < 0 1 otherwise 5: i ← argmax i {ζ i |∇ i f (x t )|} 6: η ← (2M max{M f (x t ), |∇ i f (x t )|}) -1 7: x t+1 i ← x t i -η∇ i f (x t ) return x T The first thing that should be noted about this algorithm is the crucial parameters λ t . These parameters offer a quantitative threshold between sparsity and speed of convergence. In particular, when λ t is 1, then all entries (regardless of whether they are zero or not) are treated the same. When λ t 1, on the other hand, the gradient entries corresponding to zero entries are discounted by a factor 1, thus making the algorithm less eager to update these as opposed to non-zero entries, whose update doesn't increase sparsity. A practical consideration about Algorithm 1 is order. The second condition in line 4 is to make sure that the 1 norm of the solution never exceeds a given bound on the 1 norm of the optimal solution. This check is useful for the theoretical analysis but should likely be removed in any practical implementation. We are ready for the main theorem of this section. In the proof, which can be found in Appendix A.2.2, we present an analysis of Algorithm 1 for sparse logistic regression. In addition to an upper bound B ≥ x * ∞ , it also requires an approximation B 1 of x * 1 . One possible approach is to approximate it by B, but in practice this would be a learning rate hyperparameter to be tuned. Theorem 4.1 (Sparse logistic regression). Given a binary classification instance (A ∈ [-M, M ] m×n , b ∈ {1, -1} m ) and for any solution x * ∈ [-B, B] n with M ≥ max{ x * -1 ∞ , B -1 } 2 and a known parameter B 1 ∈ 1 C x * 1 , x * 1 for some C ≥ 1, Algo- rithm 1 with λ t = min{B 1 / x t 1 , 1}, initial solution x 0 ∈ R n , and error tolerance 0 < ε < m/2 returns a solution x with f (x ) ≤ (1 + δ) • f (x * ) + ε and sparsity  s := x 0 = O x * 2 1 M 2 1 δ + log f (x 0 ) -f (x * ) ε in T = O x 2 0 + x * 2 0 M 2 B 2 C 2 1 δ + log f (x 0 ) -f (x * ) ε iterations, s := x 0 = O s 2 log 1 ε in T = O s 4 log 3 1 ε iterations. It is useful to compare these results to the results of Shalev-Shwartz et al. ( 2010) for sparse optimization of general smooth convex functions. Even though they achieve the stronger error bound of f (x ) ≤ f (x * ) + ε, the sparsity of the final solution is in the order of sfoot_1 m ε , which has an exponentially worse error dependence than s 2 log m ε . Therefore, if the approximation rate (1 + δ) is tolerable in front of f (x * ), then one can obtain exponentially faster sparsity and convergence. If we are willing to perform fully corrective steps as described in Algorithm 2, then we can get a cleaner and slightly simpler analysis. This is presented in Theorem 4.3 and proved in Appendix A.2.3. Fully corrective steps can be useful when there is an efficient (dense) optimization algorithm and one wishes to use it as a black box for sparse optimization. In practice, one does not need to perform a full correction, but only a small number of corrective (usually gradient) steps over the current support of the solution. Algorithm 2 Greedy coordinate descent with fully corrective steps 1: procedure FULLYCORRECTIVEGREEDYCOORDINATEDESCENT(x 0 , T, M, B) 2: Let f (x ) := m i=1 log(1 + e -bi(Ax )i ) 3: S 0 ← supp(x 0 ) 4: for t = 0, . . . , T -1 do 5: i ← argmax i {|∇ i f (x t )|} 6: S t+1 ← S t ∪ {i} 7: x t+1 ← argmin x :supp(x )⊆S t+1 f (x ) return x T Theorem 4.3 (Sparse logistic regression with fully corrective steps). Given a binary classification instance (A ∈ [-M, M ] m×n , b ∈ {1, -1} m ) and for any solution x * ∈ R n , Algorithm 2 with error tolerance 0 < ε < m/2 and initial solution x 0 returns a solution x with f (x ) ≤ (1 + δ) • f (x * ) + ε and sparsity s := x 0 = x 0 0 + O x * 2 1 M 2 1 δ + log f (x 0 ) -f (x * ) ε in T = x 0 iterations, for any choice of δ ∈ (0, 1). Each iteration consists of evaluating the logistic regression gradient ∇f , solving a logistic regression problem on s variables, plus O(m+n) additional time.

5. DENSE LOGISTIC REGRESSION

In this section, our goal is to minimize the logistic function f without any constraint on the sparsity of the solution. The results of the Section 4 applied to a full sparsity of n already imply Corollary 5.1. Corollary 5.1 (Dense logistic regression). Given a binary classification instance (A ∈ [-M, M ] m×n , b ∈ {-1, 1} m ) and for any solution x * ∈ [-B, B] n with M ≥ max x * -1 ∞ , B -1 , Algorithm 1 with λ t = 1 for all t, initial solution x 0 ∈ R n , and error tolerance 0 < ε < m/2 returns a solution x with f (x ) ≤ (1 + δ) • f (x * ) + ε in T = O n 2 M 2 B 2 1 δ + log f (x 0 ) -f (x * ) ε . iterations, for any choice of δ ∈ (0, 1). Additionally, x ∞ ≤ B + 1 2M . Each iteration consists of evaluating the logistic regression gradient ∇f plus O(m + n) additional time. Even though Corollary 5.1 has the same favorable convergence in terms of δ and ε as Theorem 4.1, based on practical intuitions we would expect ( 2 -based) gradient descent to perform better than greedy coordinate descent, which only updates one coordinate at a time, while having access to the full gradient. In fact, we can verify that the logistic loss does have the multiplicative smoothness condition with respect to the 2 norm, albeit in an almost trivial sense: w (x ), (Ax ) 2 ≤ w (x ) 1 Ax 2 ∞ ≤ f (x ) A 2 2→∞ x 2 2 ≤ f (x )β x 2 2 . Here, using the inequality A 2 2→∞ ≤ A 2 2 := β implies β-multiplicative smoothness with respect to the 2 norm. Unfortunately, this is not significantly better than the 1 case: The number of iterations will be proportional to β x * 2 2 , which can be m. Interestingly, real logistic regression instances exhibit the 2 multiplicative smoothness property with significantly better constants. In our experiments we found that along the path of gradients encountered by gradient descent in a variety of instances, the following property was true: w (x ), (A∇f (x )) 2 ≤ f (x )βm -1 ∇f (x ) 2 2 This is an effective βm -1 -multiplicative smoothness property, because it is only assumed to be true for x 's encountered by the gradient descent algorithm. As such, it is an empirical property. In order to check our hypothesis, we have run the gradient descent algorithm with the step sizes that are implied by Theorem 5.2, which we will see later. For each of the 15 experiments, we have run gradient descent for 1000 iterations, and calculated the maximum of the following quantity, over all iterations: w (x ), (A∇f (x )) 2 f (x )m -1 A∇f (x ) 2 2 . If this is bounded by 1, and using the fact that A∇f (x ) 2 2 ≤ β ∇f (x ) 2 , this implies that f is effectively βm -1 -multiplicatively smooth with respect to the 2 norm. Indeed, as we can see in Table 3 , these values are indeed less than 1 for all datasets and all iterations. In the following, our plan is to prove convergence, assuming that f has the multiplicative smoothness property with the constants in our hypothesis above. Under this assumption, we can now prove a much stronger convergence theorem (here we are also using the fact that M 2 ≤ β to replace 2Mby 2 √ β-second order robustness): Theorem 5.2. Let f : R n → R be a convex function that is 2 √ β-second order robust with respect to the 1 norm and βm -1 -multiplicatively smooth with respect to the 2 norm. Let x 0 ∈ R n be an initial solution and x * ∈ R n be an arbitrary solution, where R := x 0 -x * 2 and R ≥ √ n. 3 Then, gradient descent with step size η t = 0.5 min 1 βm -1 f (x ) , 1 √ β ∇f (x ) 1 returns a solution with f (x ) ≤ (1 + δ)f (x * ) + ε after T = O βR 2 m 1 δ + log f (x 0 ) -f (x * ) ε iterations.

6. MAXIMUM MARGIN SOLUTIONS

It is known that running gradient descent on the logistic loss on linearly separable data converges to the hard SVM (maximum margin) classifier Soudry et al. ( 2018), yet at the slow rate of x t x t 2 - x * x * 2 2 ≤ O log log t log t . As all our results work best when the data is separable, it is natural to ask about what they imply for margin maximization. We consider the constrained logistic regression problem min x 2 ≤1 f p (x ) := i log(1 + e -pbi(Ax )i ) (4) We start by observing that Corollary 5.1 and Theorem 5.2 can be modified to solve (4), with a blowup of p 2 in the number of iterations. In particular, the number of iterations will be O p 2 X 1 δ + log 1 ε , where X depends on whether we use Corollary 5.1 or Theorem 5.2, but is beyond the point of this section, since here we are interested in the error dependence. Picking δ = 1, ε = me -pα , and p = log(6m) α ε for some target error ε ∈ (0, 1), we get the following theorem: Theorem 6.1. Consider a linearly separable binary classification instance (A ∈ R m×n , b ∈ {1, -1} m ), and a solution x * that maximizes min i bi(Ax * )i x * 2 := α. Then, we can obtain a solu- tion x with x 2 ≤ 1 and f p (x ) ≤ 3f p (x * ) in O X 1 α 2 ε 3 iterations of gradient descent, where p = log(6m) α ε . Furthermore, x has (1 -ε)-optimal margins: min i b i (Ax ) i x 2 ≥ α(1 -ε) and is close to the maximum margin classifier: x x 2 - x * x * 2 2 ≤ 2 √ ε It is not hard to see that Theorem 6.1 gives an exponential improvement in the error dependence compared to Soudry et al. (2018) .

7. NUMERICAL EXAMPLE

In order to numerically validate our algorithm, we run logistic regression on the well known UCI adult binary classification dataset. In order to simulate a separable dataset, we first run gradient descent on the whole data, and then discard the misclassified data points. This gives us a separable dataset. Then, we run two variants of gradient descent: One with constant step size given by β -1 , and one with increasing step size given by η t = β -1 f (x 0 )/f (x t ), with no other change. This is motivated by our findings, which suggest that the step size should increase proportionally to the decrease of the loss. As we can see in Figure 2 , the error in the case of fixed step size decays as poly(1/t), while in the case of increasing step size we have linear convergence (albeit with a low rate because the margins are in the order of 10 -6 ). we define ζ i =    λ if x i = 0 0 if |x i | ≥ B and ∇ i f (x ) • x i < 0 1 otherwise where 0 < λ ≤ 1, and let i * = argmax i {ζ i |∇ i f (x )|}. Then, at least one of the following is true: • |∇ i * f (x )| ≥ f (x ) -f (x * ) x * 1 + λ x 1 • |∇ i * f (x )| ≥ f (x ) -f (x * ) λ -1 x * 1 + x 1 and x i = 0 . Proof. Let S = {i | x i = 0} and F = {i | |x i | < B or ∇ i l(x ) • x i ≥ 0}. By convexity of f , we have f (x * ) ≥ f (x ) + ∇f (x ), x * -x ≥ f (x ) + ∇ F f (x ), x * -x = f (x ) + ∇ F f (x ), x * -∇ S∩F f (x ), x ≥ f (x ) -∇ F f (x ) ∞ x * 1 -∇ S∩F f (x ) ∞ x 1 , therefore ∇ F f (x ) ∞ x * 1 + ∇ S∩F f (x ) ∞ x 1 ≥ f (x ) -f (x * ) . Now, if i * / ∈ S, by definition of the ζ i 's and i * we have λ ∇ F f (x ) ∞ = λ |∇ i * f (x )| ≥ ∇ S∩F f (x ) ∞ . and so (5) implies |∇ i * f (x )| x * 1 + λ |∇ i * f (x )| x 1 ≥ f (x ) -f (x * ) ⇒ |∇ i * f (x )| ≥ f (x ) -f (x * ) x * 1 + λ x 1 . Otherwise if i * ∈ S, we have ∇ S∩F f (x ) ∞ = |∇ i * f (x )| ≥ λ ∇ F f (x ) ∞ , and so (5) implies λ -1 |∇ i * f (x )| x * 1 + |∇ i * f (x )| x 1 ≥ f (x ) -f (x * ) ⇒ |∇ i * f (x )| ≥ f (x ) -f (x * ) λ -1 x * 1 + x 1 . Lemma A.2 (Coordinate update). Let f : R n → R ≥0 be a twice continuously differentiable convex function that is 2γ-second order robust and γ 2 -multiplicatively smooth with respect to the 1 norm, for some γ > 0. Let x ∈ [-B , B ] n be a suboptimal solution such that f (x ) ≥ f (x * ), where x * ∈ [-B, B] n is some unknown solution with γ x * 1 ≥ 1, and B ≥ B > 0 are some parameters. We make the update x = x -η∇ i f (x )1 i , where i is picked as in Lemma A.1 for some parameter λ ∈ (0, 1) and η = 0.5 min 1 γ 2 f (x ) , 1 γ|∇if (x )| is a step size. Then, at least one of the following is true about the progress in decreasing f : • f (x ) -f (x ) ≥ (f (x ) -f (x * )) 2 4γ 2 f (x ) ( x * 1 + λ x 1 ) 2 • f (x ) -f (x ) ≥ (f (x ) -f (x * )) 2 4γ 2 f (x ) (λ -1 x * 1 + x 1 ) 2 and x i = 0 , and the norm of the new solution is bounded as x ∞ ≤ max {B , B + 1 2γ }. In the case that f (x ) < f (x * ) we have f (x ) ≤ f (x ). Proof. We first consider a generic update x = x + x . By Taylor's theorem and the fact that f is twice continuously differentiable, we have f (x ) = f (x ) + ∇f (x ), x + 1 2 x , ∇ 2 f (x ) x , for some x that is entrywise between x and x . Since f is 2γ-second-order-robust and γ 2 -multiplicatively-smooth with respect to the 1 norm, as long as the update is bounded in 1 norm as x 1 ≤ 1/(2γ) (6) we have f (x ) ≤ f (x ) + ∇f (x ), x + x , ∇ 2 f (x ) x ≤ f (x ) + ∇f (x ), x + γ 2 f (x ) x 1 . Note that the right hand side is minimized for x = - H 1 (∇f (x )) 2γ 2 f (x ) , where H 1 is the hard thresholding operator that zeroes out all but the top entry in absolute value. This is a coordinate descent step. Our step will be slightly more careful so that it doesn't unnecessarily increase the sparsity of x . We consider the following coordinate step x = -η∇ i f (x ) , where η > 0 and i are as defined in the lemma statement. We now have f (x ) -f (x ) ≥ η -η 2 γ 2 f (x ) (∇ i f (x )) 2 The term η -η 2 γ 2 f (x ) is maximized at η = 1 2γ 2 f (x ) . In addition, to stay in the 1 neighborhood where the Hessian in stable, we need to satisfy (6) by making sure that η ≤ 1 2γ|∇if (x )| . Based on these requirements, we pick η = min 1 2γ 2 f (x ) , 1 2γ|∇ i f (x )| and conclude that f (x ) -f (x ) ≥ min 1 4γ 2 f (x ) , 1 4γ|∇ i f (x )| (∇ i f (x )) 2 = min (∇ i f (x )) 2 4γ 2 f (x ) , |∇ i f (x )| 4γ . Note that this is always ≥ 0 and so we have f (x ) ≤ f (x ) even if f (x ) < f (x * ). We now take two cases and use the two bullets of Lemma A.1 accordingly. Case 1: x i = 0. The first bullet of Lemma A.1 has to be true, i.e. |∇ i f (x )| ≥ f (x ) -f (x * ) x * 1 + λ x 1 . Therefore, f (x ) -f (x ) ≥ min (f (x ) -f (x * )) 2 4γ 2 f (x ) ( x * 1 + λ x 1 ) 2 , f (x ) -f (x * ) 4γ ( x * 1 + λ x 1 ) = (f (x ) -f (x * )) 2 4γ 2 f (x ) ( x * 1 + λ x 1 ) 2 , where we used the facts that f (x ) -f (x * ) ≤ f (x ) and γ x * 1 ≥ 1. Case 2: x i = 0. If the first bullet of Lemma A.1 is true, we can proceed as in the previous case. Otherwise, we use the second bullet of Lemma A.1 and similarly get f (x ) -f (x ) ≥ (f (x ) -f (x * )) 2 4γ 2 f (x ) (λ -1 x * 1 + x 1 ) 2 . Finally, in order to bound x ∞ , we first note that x ∞ ≤ B . Now, by our choice of i we have that either |x i | < B, or ∇ i f (x ) • x i > 0. In the first case, we have |x i | ≤ |x i | + | x i | < B + 1 2γ , where we used (6). Otherwise, we have that |x i | ≥ B and ∇ i f (x ) • x i > 0. This implies that x i and x i have different signs, so |x i | = |x i + x i | ≤ max {|x i |, | x i |} ≤ max B , 1 2γ . Therefore, in any case we have |x i | ≤ max B , B + 1 2γ .

A.2 PROOFS OF THEOREMS

A.2.1 PROOF OF COROLLARY 5.1 Proof. We will apply Lemma A.2 for T iterations to obtain solutions x 0 , . . . , x T , where for some T that will be defined later. The logistic function f is 2M -second order robust and M 2 -multiplicatively smooth with respect to the 1 norm, so Lemma A.2 can be applied with γ = M and B = B + 1 2M . Based on the guarantee of Lemma A.2, we get the following bound on the 1 norm of x t at all times: x t 1 ≤ n x t ∞ ≤ n B + 1 2M ≤ (3/2)nB . Let t be the smallest t ≥ 0 for which f (x t) ≤ 2f (x * ) or f (x t) ≤ f (x * ) + ε, and let t = ∞ if this never happens. Therefore, for all t < t we have f (x t ) ≥ 2f (x * ) ⇒ f (x t )-f (x * ) f (x t ) ≥ 1 2 , and so the statement of Lemma A.2 gives: f (x t ) -f (x t+1 ) ≥ f (x t ) -f (x * ) 8M 2 ( x * 1 + x t 1 ) 2 ≥ f (x t ) -f (x * ) 8n 2 M 2 (B + (3/2)B) 2 ≥ f (x t ) -f (x * ) 50n 2 M 2 B 2 , where we used the fact that x * 1 ≤ n x * ∞ ≤ nB. Equivalently, f (x t+1 ) -f (x * ) ≤ 1 - 1 50n 2 M 2 B 2 (f (x t ) -f (x * )) , and summing up these for t ∈ {0, 1, . . . , t -1}, we get f (x t) -f (x * ) ≤ 1 - 1 50n 2 M 2 B 2 t (f (x 0 ) -f (x * )) ≤ ε , as long as t ≥ 50n 2 M 2 B 2 log f (x 0 ) -f (x * ) ε , therefore we conclude that t is at most this quantity. Now we consider the iterations t ≥ t. If f (x t) ≤ f (x * ) + ε there are no such iterations and we are done. Therefore we have that f (x t) ≤ 2f (x * ). We again use Lemma A.2 for all such t, which gives f (x t ) -f (x t+1 ) ≥ (f (x t ) -f (x * )) 2 4M 2 f (x t )( x * 1 + x t 1 ) 2 ≥ (f (x t ) -f (x * )) 2 25f (x t )n 2 M 2 B 2 ≥ (f (x t ) -f (x * )) 2 50f (x * )n 2 M 2 B 2 . By known convergence results, this recurrence leads to the bound f (x T ) ≤ f (x * ) + 100f (x * )n 2 M 2 B 2 T - t ≤ f (x * ) 1 + 100n 2 M 2 B 2 T -t , implying that f (x T ) ≤ (1 + δ)f (x * ) after T -t = O n 2 M 2 B 2 1 δ additional iterations after t. Therefore, the total number of iterations to achieve f (x T ) ≤ (1 + δ) • f (x * ) + ε is O n 2 M 2 B 2 1 δ + log f (x 0 ) -f (x * ) ε . A.2.2 PROOF OF THEOREM 4.1 Proof. Similarly to the proof of Corollary 5.1, we apply Lemma A.2 for T iterations to obtain solutions x 0 , . . . , x T , but now we also have to account for the sparsity increase of x t . For this reason, we use λ t < 1, which disincentivizes updating zero entries of the solution vector. Compared to Corollary 5.1, we have the differences that λ -1 t x * 1 = max c -1 x t 1 , x * 1 , and that we have the following tighter bounds because of sparsity: x * 1 ≤ sB x t 1 ≤ x t 0 x t ∞ ≤ x t 0 (3/2)B . We first bound the sparsity. Note that the sparsity increases by at most 1 every time the first bullet of Lemma A.2 is true, and does not increase when the second bullet is true. Therefore, the progress in each sparsity-increasing iteration is f (x t ) -f (x t+1 ) ≥ (f (x t ) -f (x * )) 2 4f (x * )M 2 ( x * 1 + λ t x t 1 ) 2 ≥ (f (x t ) -f (x * )) 2 4(1 + c) 2 f (x * )M 2 x * 2 1 . B MISSING PROOFS FROM SECTION 5 B.1 GRADIENT UPDATE LEMMA Lemma B.1 (Gradient update). Let f : R n → R >0 be a twice continuously differentiable convex function that is 2γ-second order robust with respect to the 1 norm and µ-multiplicatively smooth with respect to the 2 norm for some γ, µ > 0. Let x ∈ R n be a solution such that f (x ) > f (x * ), where x * ∈ R n is an unknown solution with xx * 2 ≤ x * 2 . We make the update x = x -η∇f (x ) , where η = 0.5 min 1 µf (x ) , 1 γ ∇f (x ) 1 is a step size. Then, the progress in decreasing f is: f (x ) -f (x ) ≥ min (f (x ) -f (x * )) 2 4µf (x ) x * 2 2 , f (x ) -f (x * ) 4γ √ n x * 2 . Additionally, as long as x is still suboptimal with respect to x * , i.e. f ) > f (x * ), the distance to x * decreases: x -x * 2 ≤ x -x * 2 . Finally, if f (x ) ≤ f (x * ), then f (x ) ≤ f (x ). Proof. We first consider a generic update x = x + x . By Taylor's theorem and the fact that f is twice continuously differentiable, we have f (x ) = f (x ) + ∇f (x ), x + 1 2 x , ∇ 2 f (x ) x , for some x that is entrywise between x and x . Since f is 2γ-second-order-robust with respect to 1 and and µ-multiplicatively-smooth with respect to the 2 norm, as long as the update is bounded in 1 norm as x 1 ≤ 1/(2γ) we have f (x ) ≤ f (x ) + ∇f (x ), x + x , ∇ 2 f (x ) x ≤ f (x ) + ∇f (x ), x + µf (x ) x 2 . Note that the right hand side is minimized for x = - 1 2µf (x ) ∇f (x ) . In addition, to stay in the 1 neighborhood where the Hessian in stable, we need to satisfy (8). Based on these requirements, we make the update x = -η∇f (x ), where η = min 1 2µf (x ) , 1 2γ ∇f (x ) 1 . We thus have f (x ) -f (x ) ≥ η -η 2 µf (x ) ∇f (x ) 2 2 ≥ η 2 ∇f (x ) 2 2 and so f (x ) -f (x ) ≥ min 1 4µf (x ) , 1 4γ ∇f (x ) 1 ∇f (x ) 2 2 ≥ min ∇f (x ) 2 2 4µf (x ) , ∇f (x ) 2 4γ √ n . This takes care of the case f (x ) ≤ f (x * ), since it shows that f (x ) ≤ f (x ). Now we deal with the case f (x ) > f (x * ). By convexity we have f (x * ) ≥ f (x ) + ∇f (x ), x * -x ≥ f (x ) -∇f (x ) 2 x * -x 2 ≥ f (x ) -∇f (x ) 2 x * 2 , which gives ∇f (x ) 2 2 ≥ (f (x ) -f (x * )) 2 x * 2 2 , and so f (x ) -f (x ) ≥ min (f (x ) -f (x * )) 2 4µf (x ) x * 2 2 , f (x ) -f (x * ) 4γ √ n x * 2 . For the norm bound, we suppose that f (x ) > f (x * ) (otherwise we are done). We have Proof. We repeatedly use Lemma B.1 to obtain iterates x 0 , x 1 , . . . , x T . Note that as long as f (x t ) > f (x * ), we have x tx * 2 ≤ x 0 -x * 2 := R. Let t be the smallest t ≥ 0 for which f (x t) ≤ 2f (x * ) or f (x t) ≤ f (x * ) + ε, and let t = ∞ if this never happens. Therefore, for all t < t we have f (x t ) ≥ 2f (x * ) ⇒ f (x t )-f (x * ) f (x t ) ≥ 1 2 , and so the statement of Lemma B.1 gives: x -x * 2 2 -x -x * 2 2 = x -x 2 2 + 2 x -x * , x -x = η 2 ∇f (x ) 2 2 -2η x -x * , ∇f (x ) . Now, note that η 2 ∇f (x ) 2 2 ≤ f (x ) -f (x ) ≤ f (x ) -f (x * ) and by convexity x -x * , ∇f (x ) ≥ f (x ) -f (x * ), so x -x * 2 2 -x -x * 2 2 = η 2 ∇f (x ) f (x t ) -f (x t+1 ) ≥ min 1 8µ x * 2 2 , 1 4γ √ n x * 2 • (f (x t ) -f (x * )) ≥ 1 8µ x * 2 2 + 4γ √ n x * 2 • (f (x t ) -f (x * )) . Equivalently, f (x t+1 ) -f (x * ) ≤ 1 - 1 8µ x * 2 2 + 4γ √ n x * 2 (f (x t ) -f (x * )) , and summing up these for t ∈ {0, 1, . . . , t -1}, we get f (x t) -f (x * ) ≤ 1 - 1 8µ x * 2 2 + 4γ √ n x * 2 t (f (x 0 ) -f (x * )) ≤ ε , as long as t ≥ 8µR 2 + 4γ √ nR log f (x 0 ) -f (x * ) ε , therefore we conclude that t is at most this quantity. Now we consider the iterations t ≥ t. If f (x t) ≤ f (x * ) + ε there are no such iterations and we are done. Therefore we have that f (x t) ≤ 2f (x * ). We again use Lemma B.1 for all such t, which gives f (x t ) -f (x t+1 ) ≥ (f (x t ) -f (x * )) 2 4µf (x t )R 2 ≥ (f (x t ) -f (x * )) 2 8µf (x * )R 2 . By known convergence results, this recurrence leads to the bound f (x T ) ≤ f (x * ) + 16µf (x * )R 2 T - t = f (x * ) 1 + 16µR 2 T -t , implying that f (x T ) ≤ (1 + δ)f (x * ) after T -t = O µR 2 1 δ additional iterations after t. Therefore, the total number of iterations to achieve f (x T ) ≤ (1 + δ) • f (x * ) + ε is O µR 2 + γ √ nR 1 δ + log f (x 0 ) -f (x * ) ε . For µ = βm -1 , γ = √ β, and using the fact that R ≥ √ n, we get Now, re-arranging and using the fact that 3me -λα ≤ 2 implies e 3me -λα ≤ 1 + 6me -λα , we have that b i (Ax ) i ≥ -λ -1 log e 3me -λα -1 ≥ -λ -1 log 6me -λα = α -λ -1 log (6m) ≥ α(1 -ε) , where the last equality follows by our setting of λ ≥ log(6m) α ε . Therefore, the number of iterations is O(λ 3 α) ≤ O 1 α 2 ε 3 . Additionally, x 2 ≤ x * 2 . To bound the distance from the classifier, we note that E 2 := x x 2 -x * 2 2 = x x 2 -x * 2 2 = 2 -2 x x 2 , x * . On the other hand, we let x = 1 2 x x 2 + 1 2 x * and compute its smallest margin as α ≥ b i (Ax ) i x 2 ≥ α(1-ε) 2 x 2 + α 2 1 4 + 1 4 + 1 2 x x 2 , x * ≥ α(1-ε) 2 + α 2 1 -E 2 /4 . Re-arranging, we get that 1 -E 2 /4 ≥ 1 4 (2 -ε) 2 = 1 -ε + ε 2 /4 E 2 ≤ 4 ε(1 -ε) ≤ 4 ε . Therefore, we have x x 2 -x * 2 ≤ 2 √ ε , or in other words x x 2 -x * 2 ≤ E after O 1 α 2 E 6 iterations.



This formulation is without loss of generality, because we can incorporate the binary ±1 labels into the matrix A and assume that all the labels are positive. the theorem can be stated without this additional constraint, but we include it because it makes the bounds considerably simpler The assumption R ≥ √ n is not necessary but simplifies the bounds.



for any choice of δ ∈ (0, 1) and parameter c > 0. Each iteration consists of evaluating the logistic regression gradient ∇f plus O(m + n) additional time. Corollary 4.2. If M, B, C ≤ O (1) and x * is s-sparse, then Algorithm 1 with λ t = min {1/ x t 1 , 1} returns a solution x with f (x ) ≤ 1.1 • f (x * ) + ε and sparsity

Figure 2: Comparison of fixed vs increasing step size on logistic regression on adult dataset

2η xx * , ∇f (x ) ≤ 0 .B.2 PROOF OF THEOREM 5.2

Let us consider a classifier x * with x * 2 = 1 and margins ≥ α, i.e. b i (Ax * ) i ≥ α for all i ∈ [m]. Now, Corollary 5.1 and Theorem 5.2 imply that we can compute a solutionf λ (x ) ≤ 2f λ (x * ) + ε after T = O λ 2 X log m ε iterations. Now, note that i log(1 + e -λbi(Ax )i ) ≤ 2 i log(1 + e -λbi(Ax * )i ) + ε ≤ 2m log(1 + e -λα ) + ε ≤ 2me -λα + ε ≤ 3me -λα ,after setting ε = me -λα .

Algorithms for logistic regression and dependence on m/ε (omitting extra polylog(m, n) factors). Algorithms with exponential dependences on any problem parameter are ommitted.



Upper bounds on the quantity w (x ), (A∇f (x )) 2 / f (x )m -1 A∇f (x ) 2 2 . Shown here is the maximum of this over x being one one of the first 1000 iterates.

annex

Completely analogously to the proof of Corollary 5.1, this implies that the total number of such iterations (and thus total sparsity) isNow it remains to bound the total number of iterations. We haveAs a result, the progress bound of Lemma A.2 becomesand, analogously to the proof of Corollary 5.1 and using the fact that x * 1 ≤ x * 0 B, the total number of iterations is bounded byA.2.3 PROOF OF THEOREM 4.3Proof. We move similarly to the proof of Theorem 4.1, but now we can strengthen Lemma A.2 because x t is fully corrected for all t, i.e. ∇ i f (x t ) = 0 for all i ∈ supp(x t ). As in the proof of Lemma A.2, we can lower bound the amount of progress as a function of ∇f (x t ) ∞ as follows:Now, by convexity of f we haveBecause of fully corrective steps we have ∇f (x t ), x t = 0, and so the left hand side of ( 7) is upper bounded by ∇f (x t ) ∞ x * 1 . As a result, we haveand so we get the progress bound ofSimilarly to the proof of Theorem 4.1, this progress bound leads to a sparsity ofand the same number of iterations.

