ON THE CONVERGENCE OF SGD UNDER THE OVER-PARAMETER SETTING

Abstract

With the improvement of computing power, over-parameterized models get increasingly popular in machine learning. This type of model is usually with a complicated, non-smooth, and non-convex loss function landscape. However, when we train the model, simply using the first-order optimization algorithm like stochastic gradient descent (SGD) could acquire some good results, in both training and testing, albeit that SGD is known to not guarantee convergence for nonsmooth and non-convex cases. Theoretically, it was previously proved that in training, SGD converges to the global optimum with probability 1 -ϵ, but only for certain models and ϵ depends on the model complexity. It was also observed that SGD tends to choose a flat minimum, which preserves its training performance in testing. In this paper, we first prove that SGD could iterate to the global optimum almost surely under arbitrary initial value and some mild assumptions on the loss function. Then, we prove that if the learning rate is larger than a value depending on the structure of a global minimum, the probability of converging to this global optimum is zero. Finally, we acquire the asymptotic convergence rate based on the local structure of the global optimum.

1. INTRODUCTION

With the improvement of the computing power of computer hardware, an increasing number of over-parameterized models are deployed in the domain of machine learning. One of the most representative and successful models is what we called deep neural network (LeCun et al. (2015) ; Amodei et al. (2015) ; Graves et al. (2013) ; He et al. (2016); Silver et al. (2017) ), which has achieved great empirical success in various application areas (Wu et al. (2016) ; Krizhevsky et al. (2017); Silver et al. (2017) ; Halla et al. (2022) ). Meanwhile, deep neural networks are large in scale and have an optimization landscape that is in general non-smooth and non-convex (Wu et al., 2019; Brutzkus & Globerson, 2017) . Training such a model should have been concerning. However, people could usually acquire very good results just through using first-order methods such as stochastic gradient descent (SGD). A large theoretical gap persists in understanding this process. Two main questions arise. 1. Due to the over-parametrization and the highly complex loss landscape of deep neural networks, optimizing the deep networks to the global optimum is likely NP-hard (Brutzkus & Globerson, 2017; Blum & Rivest, 1992) . Nevertheless, in practice, simple first-order methods, which does not have a convergence guarantee in the non-smooth and non-convex case (Liu et al., 2022a; b) , are capable of finding a global optimum. This happens even more often on the training data (Zhang et al., 2021; Brutzkus & Globerson, 2017; Wu et al., 2019) . It has been an open problem (Goodfellow et al., 2014) that, in this case, does SGD provably find the global optimum? Does the result generalize to more general model structures beyond neural networks? 2. In general, over-parametrized models offer many global optimums. These global optimums have the same training loss of zero, and meanwhile drastically different test performance (Wu et al., 2018; Feng & Tu, 2021) . Interestingly, studies find that SGD tends to converge to those generalizable ones (Zhang et al., 2021) . In fact, it is observed empirically that SGD could usually find flat minima, which subsequently enjoys better generalization (Kramers, 1940; Dziugaite & Roy, 2017; Arpit et al., 2017; Kleinberg et al., 2018; Hochreiter & Schmidhuber, 1997; 1994) . Why and how does SGD find a flat global minimum? The empirical finding has yet to be theoretically validated. Related Works For the first question, in recent years, there have been a number of theoretical results that target to explain this phenomenon. Many of them focus on concrete neural network models, like two-layer networks with linear active function (Bartlett et al., 2018; Hardt & Ma, 2016) . Several works need the inputs to be random Gaussian variables (Ge et al., 2018; Tian, 2017; Du et al., 2017; Zhong et al., 2017) . Authors in Wu et al. (2019) ; Allen-Zhu et al. ( 2019) consider the non-smooth case, but its techniques is depending on the structure of the network. They prove when the number of nodes is enough large, the objective is "almost convex" and "semi-smooth". The techniques unfortunately do not generalize to more general models. Another commonly used technique is to ignore the non-smoothness and apply the chain rule anyway on the non-smooth points (Bartlett et al., 2018) . The derivation does provide some intuitions but they do not offer any rigorous guarantees, as the chain rule does not hold (Liu et al., 2022a; b) . Even with these kinds of restrictions, existing works (Ge et al., 2018; Tian, 2017; Du et al., 2017; Bartlett et al., 2018; Vaswani et al., 2019; Chizat & Bach, 2018) only manage to find a high probability convergence result to the global optimum. The difference between this probability and 1 could depend on the structure of the model, like the number of nodes in the neural network, which raises further concerns on the tightness of the probability bound. It is currently lacking to analyze SGD for general models to obtain an almost surely convergence to the global optimum. For the second question, most works investigate the flat minima in a qualitative way. A recent work is by Xie et al. ( 2020), which views the SGD process as a stochastic differential equation (SDE), and uses SDE to describe the process of the iteration escaping from the sharp minimum. Similar techniques are also used in the works by Wu et al. (2019); Feng & Tu (2021) . Unfortunately, SGD can be viewed as an SDE only when the learning rate is sufficiently small, and for a normal learning rate trajectories formed by SGD and SDE could be arbitrarily different. Another technique used to study this problem is to use the linear stability (Wu et al., 2018; Feng & Tu, 2021) , which considers a linear system near a global minimum. The behavior of SGD near some global minimum can then be characterized by the linear system of this global minimum. However, different from a deterministic system where the property near one point can be quantitative determine by the linearized system of this point, a stochastic system property near one point is determined by all points in R d . Using this linearized function to fully represent SGD near some global minimum is thus not a rigorous argument.

Contributions

1. Under several mild assumptions about the non-smooth and non-convex loss function, we provide the first proof that from an arbitrary initialization SGD could make the iteration converge to the global optimum almost surely, i.e., P (θ n converges to a global optimum) = 1. 2. Under the same set of assumptions and the same setting of SGD, we prove that if the learning rate is larger than a threshold, which depends on the sharpness of a global minimum, the probability which the iteration converges to this global optimum is strictly 0. 3. With similar assumptions and the same setting, we acquire the asymptotic convergence rate of the iteration converging to the global optimum. By this result, we know that SGD achieves an arbitrary accuracy in polynomial time. Technical Insight The basic intuition is as follows. We first understand the SGD as a Markov chain with the continuous state space. Then we aim to prove that the global optimum is the only absorbing state of this Markov chain. Concretely, due to the property of the sampling noise, this noise enjoys 0 variance when the optimization variable θ reaches the global optimum (Claim 2.1), i.e., E ξn ∥ ∇g(θ, ξ n )-∇g(θ)∥ 2 = 0 (notations are defined in the next section), which guarantees that once θ n reaches the global optimum, it will not escape from the optimum. Meanwhile, in other local optimums, the positive variance makes θ n jump out to this local optimum. Otherwise, as this Markov chain is a continuous state space Markov chain, an absorbing state with the measure 0 cannot become the real absorbing state (the probability of the θ n reaching this absorbing state in every epoch is 0). Based on this, we need this absorbing state to have a flat-enough neighborhood (Assumption 2.2 in the new version), which deduces that θ n that fall on this neighborhood tend to move closer to this absorbing state. Combining this absorbing state and this neighborhood statement, we can prove the distribution of θ n will concentrate on the global optimum when as the iteration goes. Finally, this distribution will degenerate to the global optimum, that is, θ n will converge to the global optimum.

