GRADIENT DESCENT CONVERGES LINEARLY FOR LO-GISTIC REGRESSION ON SEPARABLE DATA Anonymous authors Paper under double-blind review

Abstract

We show that running gradient descent on the logistic regression objective guarantees loss f (x ) ≤ 1.1 • f (x * ) + ε, where the error ε decays exponentially with the number of iterations. This is in contrast to the common intuition that the absence of strong convexity precludes linear convergence of first-order methods, and highlights the importance of variable learning rates for gradient descent. For separable data, our analysis proves that the error between the predictor returned by gradient descent and the hard SVM predictor decays as poly(1/t), exponentially faster than the previously known bound of O(log log t/ log t). Our key observation is a property of the logistic loss that we call multiplicative smoothness and is (surprisingly) little-explored: As the loss decreases, the objective becomes (locally) smoother and therefore the learning rate can increase. Our results also extend to sparse logistic regression, where they lead to an exponential improvement of the sparsity-error tradeoff.

1. INTRODUCTION

Logistic regression is one of the most widely used classification methods because of its simplicity, interpretability, and good practical performance. Yet, the convergence behavior of first-order methods on this task is not well understood: In practice gradient descent performs much better than what the theory predicts. In particular, a general analysis of gradient descent for smooth functions implies convergence with the error in function value decaying as O(1/T ). Analyses with stronger, linear convergence guarantees generally require the function to satisfy the strong convexity property, which, in contrast to other losses such as the 2 loss, the logistic loss only satisfies in a bounded set of solutions around zero. As a result, this introduces an exponential runtime dependency on the magnitude of the optimal solution Rätsch et al. (2001); Freund et al. (2018) , which is undesirable in practice. This poses a serious obstacle to obtaining favorable error rates for logistic regression that lead to high-precision solutions. A deeper study into the structure of the exponential and logistic losses was done in Telgarsky & Singer (2012), who showed that, for linearly separable data, greedy coordinate descent achieves linear convergence with a rate that depends on the maximum linear classification margin (i.e. hard SVM margin). Unfortunately, for logistic regression, it also has a 2 m dependence on the number of examples, making it inefficient for any real-world task. The significance of the separability of the data for convergence has also been observed in Telgarsky (2013); Freund et al. (2018) , who present convergence results based on quantitative measures of separability. Telgarsky (2013) also refines the results of Telgarsky & Singer (2012) for the exponential loss, but still suffers from an exponential overhead originating the multiplicative discrepancy between the exponential and the logistic loss. Interestingly Telgarsky (2013) points out that logistic regression experiments paint a much more favorable picture than the theory predicts. For separable data, Soudry et al. (2018) showed that the gradient descent logistic regression estimator converges to the maximum margin estimator at a rate of O(log log T / log T ), which implies function value convergence at a rate of O(1/T ). Interestingly, Nacson et al. (2019) experimentally observed that these rates seem to be exponentially improvable if one uses variable step sizes, in the case of logistic regression and shallow neural networks. However, as shown in Ji & Telgarsky (2018), the separability assumption is important, and the poly(1/T ) bound of function value convergence is tight for gradient descent on arbitrary data. Another approach to obtain high-precision solutions is by using second order methods, which in addition to first order (gradient) information, use second order (Hessian) information about the function. These make use of second order stability properties, such as quasi-self-concordance Bach (2010) combined with Newton's method Karimireddy et al. (2018) , or ball oracles Carmon et al. (2020); Adil et al. (2021) . Such approaches are generally not suitable for large-scale applications because of their reliance on repeated calls to large linear system solvers. Our work. In this paper, we show that (under appropriate assumptions) we can get the best of both worlds of first and second order methods, thus giving a partial explanation for the excellent performance that first-order methods have on logistic regression in practice. In particular, given a binary classification instance (A ∈ {-1, 1} m×n , b ∈ {-1, 1} m ) with associated logis- tic loss f (x ) = i log(1 + exp(-b i (Ax ) i )), we show that simple variants of gradient descent return a solution with f (x ) ≤ (1 + δ) • f (x * ) + ε after O K 1 δ + log f (0) ε iterations, where K = poly(n, x * ). Even though the error still decays as 1/T in the worst case because of the 1 δ dependence, the additive error is now δf (x * ) instead of δf (0), allowing for much faster convergence when the optimal loss f (x * ) is smaller (which is our measure of linear separability of the data). For linearly separable data, i.e. as f (x * ) approaches 0, the convergence becomes linear. We also show that the distance to the maximum margin estimator x x 2 -x * x * 2 2 decays as 1/T , exponentially improving over the log log T / log T bound of Soudry et al. (2018) . Instead of properties like Lipschitzness, smoothness, strong convexity that are commonly used in the study of first order methods, we find that there are two properties that are more relevant to the structure of the logistic regression problem. The first one is second order robustness, which means that the Hessian is stable (in a spectral sense) in any small enough norm ball Cohen et al. (2017) . This is closely related to quasi-self-concordance, a property that has been previously used in the analysis of second order algorithms Bach (2010). The second property is what we call multiplicative smoothness, which means that the function is locally smooth, with the smoothness constant being proportional to the function value (loss). Together, these properties show that, as the loss decreases, the objective becomes (locally) smoother and therefore the learning rate can increase. This motivates a variable step size schedule that is inversely proportional to the loss, thus making larger steps as the solution approaches optimality. This in fact agrees with the observations of Soudry et al. ( 2018 2 , and the increasing to β -1 f (x 0 )/f (x T ). The estimator error is defined as x t / x t 2 -x * / x * 2 2 .



); Nacson et al. (2019) on the importance of a variable learning rate. As can be seen in the toy example from Soudry et al. (2018) in Figure 1, simply replacing the fixed learning rate η used in Soudry et al. (2018) by an increasing learning rate η • f (x 0 )/f (x T ) yields an exponential improvement, both in loss and distance to the maximum margin estimator.

Figure 1: Comparison between fixed and increasing step sizes in the toy example from Figure 1 of Soudry et al. (2018). The fixed step size is set to β -1 := A -2

