GRADIENT DESCENT CONVERGES LINEARLY FOR LO-GISTIC REGRESSION ON SEPARABLE DATA Anonymous authors Paper under double-blind review

Abstract

We show that running gradient descent on the logistic regression objective guarantees loss f (x ) ≤ 1.1 • f (x * ) + ε, where the error ε decays exponentially with the number of iterations. This is in contrast to the common intuition that the absence of strong convexity precludes linear convergence of first-order methods, and highlights the importance of variable learning rates for gradient descent. For separable data, our analysis proves that the error between the predictor returned by gradient descent and the hard SVM predictor decays as poly(1/t), exponentially faster than the previously known bound of O(log log t/ log t). Our key observation is a property of the logistic loss that we call multiplicative smoothness and is (surprisingly) little-explored: As the loss decreases, the objective becomes (locally) smoother and therefore the learning rate can increase. Our results also extend to sparse logistic regression, where they lead to an exponential improvement of the sparsity-error tradeoff.

1. INTRODUCTION

Logistic regression is one of the most widely used classification methods because of its simplicity, interpretability, and good practical performance. Yet, the convergence behavior of first-order methods on this task is not well understood: In practice gradient descent performs much better than what the theory predicts. In particular, a general analysis of gradient descent for smooth functions implies convergence with the error in function value decaying as O(1/T ). Analyses with stronger, linear convergence guarantees generally require the function to satisfy the strong convexity property, which, in contrast to other losses such as the 2 loss, the logistic loss only satisfies in a bounded set of solutions around zero. As a result, this introduces an exponential runtime dependency on the magnitude of the optimal solution Rätsch et al. (2001); Freund et al. (2018) , which is undesirable in practice. This poses a serious obstacle to obtaining favorable error rates for logistic regression that lead to high-precision solutions. A deeper study into the structure of the exponential and logistic losses was done in Telgarsky & Singer (2012), who showed that, for linearly separable data, greedy coordinate descent achieves linear convergence with a rate that depends on the maximum linear classification margin (i.e. hard SVM margin). Unfortunately, for logistic regression, it also has a 2 m dependence on the number of examples, making it inefficient for any real-world task. The significance of the separability of the data for convergence has also been observed in Telgarsky (2013); Freund et al. (2018) , who present convergence results based on quantitative measures of separability. Telgarsky (2013) also refines the results of Telgarsky & Singer (2012) for the exponential loss, but still suffers from an exponential overhead originating the multiplicative discrepancy between the exponential and the logistic loss. Interestingly Telgarsky (2013) points out that logistic regression experiments paint a much more favorable picture than the theory predicts. For separable data, Soudry et al. (2018) showed that the gradient descent logistic regression estimator converges to the maximum margin estimator at a rate of O(log log T / log T ), which implies function value convergence at a rate of O(1/T ). Interestingly, Nacson et al. (2019) experimentally observed that these rates seem to be exponentially improvable if one uses variable step sizes, in the case of logistic regression and shallow neural networks. However, as shown in Ji & Telgarsky (2018) , the separability assumption is important, and the poly(1/T ) bound of function value convergence is tight for gradient descent on arbitrary data.

