FAST CONVERGENCE OF STOCHASTIC SUBGRADIENT METHOD UNDER INTERPOLATION

Abstract

This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized nonsmooth optimization problems that satisfy an interpolation condition. By leveraging the composite structure of the empirical risk minimization problems, we prove that SSGD converges, respectively, with rates O(1/ ) and O(log(1/ )) for convex and strongly-convex objectives when interpolation holds. These rates coincide with established rates for the stochastic gradient descent (SGD) method applied to smooth problems that also satisfy an interpolation condition. Our analysis provides a partial explanation for the empirical observation that sometimes SGD and SSGD behave similarly for training smooth and nonsmooth machine learning models. We also prove that the rate O(1/ ) is optimal for the subgradient method in the convex and interpolation setting.

1. INTRODUCTION

Gradient descent (GD) and subgradient descent (subGD) methods are simple and effective first-order optimization algorithms for training machine learning models. The convergence-rate analyses for these methods depend crucially on the smoothness of the objective function. It is well understood that there is a fundamental gap between the convergence rates of gradient-like methods for smooth and nonsmooth problems (Shor, 1984; Nemirovski & Yudin, 1983; Nesterov, 2005; Bubeck, 2015; Beck, 2017) . Table 1 summarizes the rates for the main problem classes. However, in the practice of training machine learning models, the nonsmootheness from the model is not causing much trouble (Glorot et al., 2011; Goodfellow et al., 2016) . Neural networks with nonsmooth activation function such as ReLU activation can usually be trained as fast as the one with smooth activation function such as softplus, see Figure 1 .1 as a motivating experiment from us to compare the convergence of gradient and subgradient based methods for smooth and nonsmooth neural networks on the MNIST datasetfoot_0 to classify digits 0 and 1. We can see that the gradient and subgradient based methods, either batch or stochastic version, has similar convergence behaviour for smooth and nonsmooth neural networks. Therefore there is a discrepancy between theory and practice. The success of overparameterized models-such as deep and wide neural networks-has instigated a trend in the analysis of stochastic variants of gradient descent in the interpolation setting, in which the model achieves zero training loss, and therefore fits all of the training data (Schmidt & Le Roux, 2013; Bassily et al., 2018; Ma et al., 2018b; Jain et al., 2018; Vaswani et al., 2019a; b; Wu et al., 2019; Liu & Belkin, 2020) . This recent series of papers offer insight into the fast convergence of SGD and new approaches for algorithm design. This line of analysis, however, focus exclusively on smooth objective function, and cannot explain the effectiveness of SSGD for training nonsmooth neural networks.  f (x) smooth nonsmooth convex O(1/ ) O(1/ 2 ) strongly convex O(log(1/ )) O(1/ ) Table 1 : Worst-case iteration complexity of batch gradient and subgradient methods. • Proof that the iteration bound O(1/ ) is optimal in the convex and interpolation setting. In contrast to the case with a smooth objective function, subgradient-based methods cannot be further accelerated for nonsmooth model-even with the interpolation assumption.

2.1. FIRST-ORDER METHODS FOR NONSMOOTH OPTIMIZATION

The subgradient method and its convergence analysis for general convex and nonsmooth problems was first described by Shor in the late 1960s and 70s; see Shor (1984) . Nemirovski & Yudin (1983) subsequently established that the iteration complexity O(1/ 2 ) described by Shor (1984) is optimal for methods that can only access a subgradient oracle. Subsequent works for minimizing general convex and nonsmooth problems include stochastic subgradient methods (Polyak, 1987; Shalev-Shwartz et al., 2007; Shamir & Zhang, 2013) , dual averaging (Nesterov, 2009) , and acceleration via smoothing (Nesterov, 2005; Beck & Teboulle, 2012) . More recently, Zhang et al. (2020) and Shamir (2020), among others, described the tractability and complexity of getting an approximate stationary point for general nonsmooth and nonconvex problems. A related line of work involves the analysis of optimization algorithms for partially nonsmooth objectives that take the form f (x) + g(x), where the function f is smooth, and the function g, which usually represents a regularizer, is convex and nonsmooth. Many models in feature selection and compressed sensing fall into this category of problems, which are usually solved by variations of the proximal-gradient method (Nesterov, 2007; Beck & Teboulle, 2009; Xiao, 2010; Parikh & Boyd, 2014; Defazio et al., 2014; Allen-Zhu, 2017) . These approaches typically use special properties of the regularization function g, which generally do not apply in the context we consider in this paper. Another related line of works focus on compositional optimization (Drusvyatskiy & Paquette, 2019; Davis & Drusvyatskiy, 2018; Duchi & Ruan, 2018) . These works consider the objective as the compositions of convex functions and smooth maps, which is different from the objective formulation in this work, see section 3 for details.

2.2. INTERPOLATION HELPS OPTIMIZATION

The interpolation condition implies that the residual between the prediction of the model and data vanishes. In the context of nonlinear least-squares, for example, it is known that interpolation



http://yann.lecun.com/exdb/mnist/



objective

Figure 1.1: The convergence of gradientbased (GD and SGD) and subgradientbased (subGD and SSGD) methods for smooth and nonsmooth neural networks • A description of semi-smoothness properties of a function useful for the iteration complexity analysis of convex objectives in the interpolation context. Under mild conditions, semismoothness under interpolation allows us to prove that SSGD has iteration complexity O(1/ ) for convex objectives, and O(log(1/ )) for strongly-convex objectives. These rates improved the classic bounds O(1/ 2 ) and Õ(1/ ) for convex and strongly-convex objectives and match the convergence rates of SGD for convex and smooth objective under interpolation.

