ON THE CONVERGENCE OF SGD UNDER THE OVER-PARAMETER SETTING

Abstract

With the improvement of computing power, over-parameterized models get increasingly popular in machine learning. This type of model is usually with a complicated, non-smooth, and non-convex loss function landscape. However, when we train the model, simply using the first-order optimization algorithm like stochastic gradient descent (SGD) could acquire some good results, in both training and testing, albeit that SGD is known to not guarantee convergence for nonsmooth and non-convex cases. Theoretically, it was previously proved that in training, SGD converges to the global optimum with probability 1 -ϵ, but only for certain models and ϵ depends on the model complexity. It was also observed that SGD tends to choose a flat minimum, which preserves its training performance in testing. In this paper, we first prove that SGD could iterate to the global optimum almost surely under arbitrary initial value and some mild assumptions on the loss function. Then, we prove that if the learning rate is larger than a value depending on the structure of a global minimum, the probability of converging to this global optimum is zero. Finally, we acquire the asymptotic convergence rate based on the local structure of the global optimum.

1. INTRODUCTION

With the improvement of the computing power of computer hardware, an increasing number of over-parameterized models are deployed in the domain of machine learning. 2022)). Meanwhile, deep neural networks are large in scale and have an optimization landscape that is in general non-smooth and non-convex (Wu et al., 2019; Brutzkus & Globerson, 2017) . Training such a model should have been concerning. However, people could usually acquire very good results just through using first-order methods such as stochastic gradient descent (SGD). A large theoretical gap persists in understanding this process. Two main questions arise. 1. Due to the over-parametrization and the highly complex loss landscape of deep neural networks, optimizing the deep networks to the global optimum is likely NP-hard (Brutzkus & Globerson, 2017; Blum & Rivest, 1992) . Nevertheless, in practice, simple first-order methods, which does not have a convergence guarantee in the non-smooth and non-convex case (Liu et al., 2022a; b) , are capable of finding a global optimum. This happens even more often on the training data (Zhang et al., 2021; Brutzkus & Globerson, 2017; Wu et al., 2019) . It has been an open problem (Goodfellow et al., 2014) that, in this case, does SGD provably find the global optimum? Does the result generalize to more general model structures beyond neural networks? 2. In general, over-parametrized models offer many global optimums. These global optimums have the same training loss of zero, and meanwhile drastically different test performance (Wu et al., 2018; Feng & Tu, 2021) . Interestingly, studies find that SGD tends to converge to those generalizable ones (Zhang et al., 2021) . In fact, it is observed empirically that SGD could usually find flat minima, which subsequently enjoys better generalization (Kramers, 1940; Dziugaite & Roy, 2017; Arpit et al., 2017; Kleinberg et al., 2018; Hochreiter & Schmidhuber, 1997; 1994) . Why and how does SGD find a flat global minimum? The empirical finding has yet to be theoretically validated.



One of the most representative and successful models is what we called deep neural network (LeCun et al. (2015); Amodei et al. (2015); Graves et al. (2013); He et al. (2016); Silver et al. (2017)), which has achieved great empirical success in various application areas (Wu et al. (2016); Krizhevsky et al. (2017); Silver et al. (2017); Halla et al. (

