ON THE CONVERGENCE OF SGD UNDER THE OVER-PARAMETER SETTING

Abstract

With the improvement of computing power, over-parameterized models get increasingly popular in machine learning. This type of model is usually with a complicated, non-smooth, and non-convex loss function landscape. However, when we train the model, simply using the first-order optimization algorithm like stochastic gradient descent (SGD) could acquire some good results, in both training and testing, albeit that SGD is known to not guarantee convergence for nonsmooth and non-convex cases. Theoretically, it was previously proved that in training, SGD converges to the global optimum with probability 1 -ϵ, but only for certain models and ϵ depends on the model complexity. It was also observed that SGD tends to choose a flat minimum, which preserves its training performance in testing. In this paper, we first prove that SGD could iterate to the global optimum almost surely under arbitrary initial value and some mild assumptions on the loss function. Then, we prove that if the learning rate is larger than a value depending on the structure of a global minimum, the probability of converging to this global optimum is zero. Finally, we acquire the asymptotic convergence rate based on the local structure of the global optimum.

1. INTRODUCTION

With the improvement of the computing power of computer hardware, an increasing number of over-parameterized models are deployed in the domain of machine learning. One of the most representative and successful models is what we called deep neural network (LeCun et al. (2015) ; Amodei et al. (2015) ; Graves et al. (2013) ; He et al. (2016) ; Silver et al. (2017) ), which has achieved great empirical success in various application areas (Wu et al. (2016) ; Krizhevsky et al. (2017) ; Silver et al. (2017) ; Halla et al. (2022) ). Meanwhile, deep neural networks are large in scale and have an optimization landscape that is in general non-smooth and non-convex (Wu et al., 2019; Brutzkus & Globerson, 2017) . Training such a model should have been concerning. However, people could usually acquire very good results just through using first-order methods such as stochastic gradient descent (SGD). A large theoretical gap persists in understanding this process. Two main questions arise. 1. Due to the over-parametrization and the highly complex loss landscape of deep neural networks, optimizing the deep networks to the global optimum is likely NP-hard (Brutzkus & Globerson, 2017; Blum & Rivest, 1992) . Nevertheless, in practice, simple first-order methods, which does not have a convergence guarantee in the non-smooth and non-convex case (Liu et al., 2022a; b) , are capable of finding a global optimum. This happens even more often on the training data (Zhang et al., 2021; Brutzkus & Globerson, 2017; Wu et al., 2019) . It has been an open problem (Goodfellow et al., 2014) that, in this case, does SGD provably find the global optimum? Does the result generalize to more general model structures beyond neural networks? 2. In general, over-parametrized models offer many global optimums. These global optimums have the same training loss of zero, and meanwhile drastically different test performance (Wu et al., 2018; Feng & Tu, 2021) . Interestingly, studies find that SGD tends to converge to those generalizable ones (Zhang et al., 2021) . In fact, it is observed empirically that SGD could usually find flat minima, which subsequently enjoys better generalization (Kramers, 1940; Dziugaite & Roy, 2017; Arpit et al., 2017; Kleinberg et al., 2018; Hochreiter & Schmidhuber, 1997; 1994) . Why and how does SGD find a flat global minimum? The empirical finding has yet to be theoretically validated. Related Works For the first question, in recent years, there have been a number of theoretical results that target to explain this phenomenon. Many of them focus on concrete neural network models, like two-layer networks with linear active function (Bartlett et al., 2018; Hardt & Ma, 2016) . Several works need the inputs to be random Gaussian variables (Ge et al., 2018; Tian, 2017; Du et al., 2017; Zhong et al., 2017) . Authors in Wu et al. (2019) ; Allen-Zhu et al. (2019) consider the non-smooth case, but its techniques is depending on the structure of the network. They prove when the number of nodes is enough large, the objective is "almost convex" and "semi-smooth". The techniques unfortunately do not generalize to more general models. Another commonly used technique is to ignore the non-smoothness and apply the chain rule anyway on the non-smooth points (Bartlett et al., 2018) . The derivation does provide some intuitions but they do not offer any rigorous guarantees, as the chain rule does not hold (Liu et al., 2022a; b) . Even with these kinds of restrictions, existing works (Ge et al., 2018; Tian, 2017; Du et al., 2017; Bartlett et al., 2018; Vaswani et al., 2019; Chizat & Bach, 2018) only manage to find a high probability convergence result to the global optimum. The difference between this probability and 1 could depend on the structure of the model, like the number of nodes in the neural network, which raises further concerns on the tightness of the probability bound. It is currently lacking to analyze SGD for general models to obtain an almost surely convergence to the global optimum. For the second question, most works investigate the flat minima in a qualitative way. A recent work is by Xie et al. (2020) , which views the SGD process as a stochastic differential equation (SDE), and uses SDE to describe the process of the iteration escaping from the sharp minimum. Similar techniques are also used in the works by Wu et al. (2019) ; Feng & Tu (2021) . Unfortunately, SGD can be viewed as an SDE only when the learning rate is sufficiently small, and for a normal learning rate trajectories formed by SGD and SDE could be arbitrarily different. Another technique used to study this problem is to use the linear stability (Wu et al., 2018; Feng & Tu, 2021) , which considers a linear system near a global minimum. The behavior of SGD near some global minimum can then be characterized by the linear system of this global minimum. However, different from a deterministic system where the property near one point can be quantitative determine by the linearized system of this point, a stochastic system property near one point is determined by all points in R d . Using this linearized function to fully represent SGD near some global minimum is thus not a rigorous argument.

Contributions

1. Under several mild assumptions about the non-smooth and non-convex loss function, we provide the first proof that from an arbitrary initialization SGD could make the iteration converge to the global optimum almost surely, i.e., P (θ n converges to a global optimum) = 1. 2. Under the same set of assumptions and the same setting of SGD, we prove that if the learning rate is larger than a threshold, which depends on the sharpness of a global minimum, the probability which the iteration converges to this global optimum is strictly 0. 3. With similar assumptions and the same setting, we acquire the asymptotic convergence rate of the iteration converging to the global optimum. By this result, we know that SGD achieves an arbitrary accuracy in polynomial time.

Technical Insight

The basic intuition is as follows. We first understand the SGD as a Markov chain with the continuous state space. Then we aim to prove that the global optimum is the only absorbing state of this Markov chain. Concretely, due to the property of the sampling noise, this noise enjoys 0 variance when the optimization variable θ reaches the global optimum (Claim 2.1), i.e., E ξn ∥ ∇g(θ, ξ n )-∇g(θ)∥ 2 = 0 (notations are defined in the next section), which guarantees that once θ n reaches the global optimum, it will not escape from the optimum. Meanwhile, in other local optimums, the positive variance makes θ n jump out to this local optimum. Otherwise, as this Markov chain is a continuous state space Markov chain, an absorbing state with the measure 0 cannot become the real absorbing state (the probability of the θ n reaching this absorbing state in every epoch is 0). Based on this, we need this absorbing state to have a flat-enough neighborhood (Assumption 2.2 in the new version), which deduces that θ n that fall on this neighborhood tend to move closer to this absorbing state. Combining this absorbing state and this neighborhood statement, we can prove the distribution of θ n will concentrate on the global optimum when as the iteration goes. Finally, this distribution will degenerate to the global optimum, that is, θ n will converge to the global optimum. This neighborhood is the key insight of proving the convergence of SGD. The neighborhood cannot be very sharp (have at most quadratic growth), which is the reason we made Assumption 2.2, item 1. It is actually reflected in Equation (8). A flat enough neighborhood can make the coefficient of the third term of (8) negative, which in turn makes the R(θ n ) (the Lyapunov function) to decrease with high probability (θ n close to global optimum). Otherwise, if the neighborhood is sharp, this coefficient will become positive, which makes R(θ n ) increasing (θ n away from global optimum).

2. PROBLEM FORMULATION

We investigate SGD under the over-parametrization setting, under a few mild assumptions on the objective function. The setting and the assumptions, as well as some preliminaries that are relevant to the results, are provided in Section 2.1. We then present the sampling schemes in Section 2.2.

2.1. OPTIMIZATION UNDER OVER-PARAMETRIZATION

In this paper, given a dataset D = {(x i , y i )}, x i , y i ∈ R d , we consider a model ŷi = f (θ, x i ), and the mean-square error (MSE) loss, i.e., g(θ) = 1 N N i=1 g(θ, x i ), g(θ, x i ) = f (θ, x i ) -y i 2 . ( ) The goal of an optimization method, like SGD, is to obtain an optimum θ ∈ J * , where J * = arg min θ∈R d g(θ). In the over-parametrization setting, this optimum is zero. To handle the non-smoothness, we recall the definition of Clarke subdifferential (Clarke, 1990) , which is an important tool to design and operate SGD algorithms. Definition 1 (Clarke subdifferential (Clarke, 1990) ). Let x ∈ Ω be given. The Clarke subdifferential of f at x is defined by ∂f (x) = co lim x→x ∇f (x) : f is smooth at x , where co represents the convex hull. If f is furthermore smooth, it holds that ∂f (x) = {∇f (x)}. We use ∇f (x) to denote an arbitrary element in ∂f (x), and for convenient, we call ∇ as subgradient. The Clarke subdifferential does not enjoy the chain rule and several techniques involved in regular gradient cannot be reused in our case. We provide a counterexample to illustrate this in Claim A.1. This property and a few assumptions to eliminate pathological cases are described in the below assumption. Assumption 2.1. The loss function g(θ) satisfies the following conditions: 1. g(θ) is continuous and smooth almost everywhere; 2. The global optimum value of g(θ) is 0; 3. The set of global optimum points J * is composed of countably connected components J i , i.e., J * = +∞ i=1 J i (J i ∩ J j = ∅); 4. There is a scalar c > 0, such that whenever g is smooth on θ 1 , θ 2 then for any data point (x i , y i ), ∇g(θ 1 , x i ) -∇g(θ 2 , x i ) ≤ c max{∥θ 1 -θ 2 ∥, 1} . This assumption describes the overall structure of the loss function g(θ). All 4 items in this Assumption are quite mild and are commonly used in optimization and learning. Items 1 and 2 are true under the MSE loss and the over-parametrization setting. Item 3 describes that the optimum is composed of countably many connected components and this item holds for almost all functions unless one delicately constructs a pathological counterexample Jin et al. (2022) . In this paper, to make the presentation clear, we continue with the countably many points assumption J * = +∞ i=1 {θ * i } to avoid the tedious arguments on continuum of optimums. Item 4 can be seen as a non-smooth extension of the traditional L-smooth condition, i.e., ∥∇g(x) -∇g(y)∥ ≤ L∥x -y∥. It can be satisfied by many non-smooth functions, like ReLU and leaky-ReLU. Similar to the regular gradient, the subgradient is also zero at the optimum. Claim 2.1. For the MSE loss function (1), if the global optimum is 0, i.e., min θ∈R d g(θ) = 0. Then the subgradient at the optimum points J * is 0. Proof. For any θ 0 ∈ {θ | g is smooth at θ}, we can get that ∇g(θ 0 ) = ∇g(θ 0 ) = 1 N N i=1 f (θ 0 , x i ) -y i ∇f (θ 0 , x i ) . Then for any θ * ∈ J * , we have lim θ0→θ * ∇g(θ 0 ) = lim θ0→θ * 1 N N i=1 f (θ 0 , x i ) -y i ∇f (θ 0 , x i ) = 0 , where g is smooth at θ 0 . Then, ∂g(θ * ) = co lim θ0→θ * ∇g(θ 0 ) : f is smooth at z = co{0} . This concludes that ∇g(θ * ) = 0. Notice that despite that g is non-smooth in general, in our setting, it is smooth on the optimum as described in the above claim. This distinguishes our setting from the line of literature on non-smooth optimization. To make a global convergence, we need at least one θ * ∈ J * to be not very "sharp". That is, at the δ θ *neighboring of θ * the loss function holds L-smooth condition with the coefficient β θ * and an assumption as follow: Assumption 2.2. There exist θ * ∈ J * , r θ * ≥ 1, δ > 0, a neighboring area U (θ * , δ θ * ) of θ * , such that for those θ ∈ U (θ * , δ θ * ) that ∇g(θ) holds 1. For any mini-batch C i , g Ci (θ) holds the local one point L-smooth condition, i.e. ∥ ∇g(θ)∥ < β θ * ∥θ -θ * ∥ (∀ θ ∈ U (θ * , δ θ * )).

2.. The loss function holds

∇g(θ) T (θ -θ * ) > α θ * ∥θ -θ * ∥ r θ * +1 (∀θ ∈ U (θ * , δ θ * ))), for some constant α θ * > 0. The first item of this assumption is very mild. Due to Claim 2.1, we know g(θ) is smooth in θ * , that is, lim θ→θ * ∇g(θ) = ∇g(θ * ) = 0. Then item 1 is just to bound the speed of subgradient tend to 0 is not slower than a linear function (not too sharp as O( ∥θ -θ * ∥) or O(∥θ -θ * ∥ 0.9 )). The second item of this assumption is very close to the local Kurdyka-Lojasiewicz condition, i.e. ∥∇g(θ  )∥ 2r ≥ g(θ) -g(θ * ) (r ≥ 1)(θ ∈ U (θ * , δ θ * ))

2.2. TWO TYPES OF NOISE OF SGD

In the rest of this section we describe two types of SGD algorithms, by different sampling noise. The first type is with the traditional sampling noise while the second type is SGD with the sampling noise with global stable guarantee. They involve slightly different assumptions and the analysis of SGD also varies by the type of noise. Nevertheless, they conclude similar results as we will present in the next section.

2.2.1. REGULAR SAMPLING NOISE

We start with the iterations of an (regular) SGD algorithm, that v n = ϵ 0 ∇g(θ n , ξ n ) , θ n+1 = θ n -v n , where {ξ n } represents the sampling noise. That is, we have the noised sampling ∇g(θ, ξ n ) = 1 |C i | x, ȳ∈Ci ∇ f (θ, x) -ȳ 2 , where C i is a randomly selected mini-batch from the original data set. The next statement assumes that the subdgradient can be sampled without the sampling error being too large. It is necessary for an algorithm to use the gradient: Assumption 2.3. Let ξ n be the sampling noise involved in the n-th iteration of SGD and ∇g(θ, ξ n ) be the noised sampling of the subgradient. For any θ ∈ R d , it holds lim inf θ→∞ ∥∇g(θ)∥ > 0, and lim sup θ→+∞ E ξn ∇g(θ, ξ n ) 2 ∇g(θ) 2 < M 0 , where M 0 ≥ 0 is a constants decided by g. Meanwhile, we need lim inf θ→∞ ∥ ∇g(θ)∥ > max{4c √ M 0 , 4c √ K 0 }. First of this assumption is milder than the widely used bounded variance assumption, i.e., & Yuan, 2017; Kleinberg et al., 2018) . Second part is to combine the {θ n } tend to ∞. For example, for a very simple loss functions g(θ E ξn ∇g(θ, ξ n ) -∇g(θ) 2 ≤ a (Li ) = 1 3 ∥θ -θ 1 ∥ 2 + ∥θ -θ 2 ∥ 2 + ∥θ -θ 3 ∥ 2 , It hold our Assumption 2.3 but not hold bounded variance assumption. Meanwhile, this sampling immediately implies the below bound. Claim 2.2. For any bounded set Q that include J * , it holds E ξn ∇g(θ, ξ n ) 2 ≤ G Q g(θ) (∀ θ ∈ Q) , where G Q is a constant decided by Q. Proof. For any smooth point in Q, the mini-batch gradient norm satisfies ∇g Ci (θ) 2 = 4 |N 0 | 2 N xc∈Ci f (θ, x c ) -y c ∇f (θ, x c ) 2 ≤ 4 |N 0 | 2 N xc∈Ci f (θ, x c ) -y c 2 ∇f (θ, x c ) 2 ≤ 4N N i=1 ∇f (θ, x i ) 2 N 2 0 g(θ), where N 0 is the size of the mini-batch. Define h Ci (θ) = 4N N i=1 ∇f (θ, x i ) 2 N 2 0 . Through Assumption 2.1, we know that h(θ) is bounded on smooth points. Then we have ∇g Ci (θ, ξ n ) 2 ≤ 4N G Q N 2 0 g(θ) (when g is smooth at θ) . Then, E ξn ∇g(θ, ξ n ) 2 = C N0-1 N -1 C N0 N all Ci ∇g Ci (θ) 2 ≤ 4N G Q C N0-1 N -1 N 2 0 C N0 N g(θ) := G Q g(θ) . For the non-smooth point θ, we can prove for any sequence θ 0 → θ (g is smooth at θ 0 ), through Equation ( 4), there is lim θ0→θ ∇g Ci (θ 0 , ξ n ) 2 = lim θ0→θ ∇g Ci (θ 0 , ξ n ) 2 ≤ 4N G Q N 2 0 lim θ0→θ g(θ 0 ) = 4N G Q N 2 0 g(θ) . Recall the following fact: If ∥a 1 ∥ 2 < s 0 , ∥a 2 ∥ 2 < s 0 , . . . , ∥a n ∥ 2 < s 0 , the norm of their any convex combination ∥ā∥ 2 := n i=1 λ i a i 2 < n i=1 λ 2 i s 0 ≤ s 0 . Then we obtain ∥ ∇g Ci (θ) 2 ≤ 4N G Q N 2 0 g(θ) . This concludes that E ξn ∇g(θ, ξ n ) 2 ≤ G Q g(θ) . We could observe that the noise variance E ξn ∥ ∇g(θ, ξ n ) -∇g(θ)∥ 2 = 0 at the global optimum (Claim 2.1). Intuitively, the zero variance makes the θ n stable in the global optimum, while for a local minimum or a saddle point the variance is nonzero in general. This is intuitively how SGD escapes from local minimum and saddle points. We have to notice that the global optimum is a subset of the set where the noise variance equals 0. It is easy to prove that J * ⊆ θ | E ξn ∥ ∇(θ, ξ n ) -∇(θ)∥ 2 = 0 = J * * , where J * * is equivalent to J * * = Ci θ | ∇ f (θ, x) -ȳ 2 = 0 . Our techniques will eventually prove that the SGD with regular sampling noise converges to J * * . This could be different than J * in theory, but intuitively, for the over-parameter model and a large amount of data the model f (θ, x) is complex enough to make sure that other stationary points are sensitive to the mini-batch batch selection. As such making a point, that is not the global optimum, stationary to all batches simultaneously is almost impossible, i.e., J * * /J * = ∅. Nevertheless, in order to insure the rigor of the theory, we make an additional assumption only for the regular sampling noise. This assumption is lifted in the sampling noise with global stable guarantee. Additional assumption for regular sampling noise For the sampling noise {ξ n }, points that are stationary to all mini-batches must be in J * , i.e., J * = J * * . Meanwhile, for every mini-batch loss function g Ci , the stationary point set of g Ci is countable. If one slightly modifies SGD by adding an additional Gaussian noise, we will prove that such sampling noise will enjoy a global stable guarantee. With this variant of SGD, the above assumption could be lifted. We now present our proposed variant of SGD.

2.2.2. SAMPLING NOISE WITH GLOBAL STABLE GUARANTEE

The sampling noise we propose in this section is the regular noise in SGD plus an extra Gaussian noise, as v n = ϵ 0 ∇g(θ n , ξ n ) + min{g(θ n ), K 0 }τ n N n , θ n+1 = θ n -v n , where {ξ n } again represents the sampling noise, K 0 is a constant to prevent the noise from approaching infinity, {N n } represents a mutually independent standard Gaussian noise, {τ n } is a mutually independent Bernoulli variable, i.e., P (τ n = 0) = p 0 , P (τ n = 1) = 1 -p 0 , and {τ n }, {ξ n }, {N n } are also mutually independent. The coefficient min{g(θ n ), K 0 } is to make sure the algorithm hold a positive noise variance E ξn,τn,Nn ∥v n ∥ 2 > 0 in non-optimal stationary points. We use {τ n } to reduce the scale of the problem, making the scale of the new noise equal to the scale of the mini-batch gradient ∇g(θ n , ξ n ) and as the original sampling noise. For example, if the batch size is 100 and the scale of the original data set is 10000, then we can set p 0 = 1 -0.01, which makes the average scale of the noise min{g(θ n ), K 0 }τ n N n also 100. The tail term min{g(θ n ), K 0 }τ n N n guarantees that this algorithm has a positive variance in R d /J * .

3. MAIN RESULTS

Our first main result states that SGD must converge to a global optimum with probability 1. This is a large improvement from previous results with only 1 -δ probability, where δ depends on the model. Our theorem answers the question raised in the introduction, affirmatively, that SGD could indeed obtain a global optimum even in this non-smooth non-convex over-parameter setting. The next two theorems discuss the cases of r θ * > 1 (higher than second-order local structure) and r θ * = 1 (second-order local structure) respectively. Theorem 3.1. Consider the SGD iteration in Equation ( 5), or alternatively Equation (2) with J * = J * * , and the MSE loss function (1). If Assumptions 2.1, 2.3 hold, and Assumption 2.2 holds with r θ * > 1, then for any 0 < ϵ 0 < min{1/2cM 0 , 1/4cK 0 (1 -p 0 )}, and for any initialization θ 1 ∈ R d , {θ n } converges to the set J * almost surely, i.e., lim n→∞ d(θ n , J * ) = 0 a.s. , where d(x, J * ) = inf y {∥x -y∥, y ∈ J * } denotes the distance between point x and set J * . Meanwhile the value of the loss function converges to 0 almost surely, i.e., lim n→∞ g(θ n ) = 0 a.s. . For each main result, we provide a proof sketch to illustrate our idea in deriving the result. A rigorous argument is deferred to the appendix. Proof sketch. Our proof mainly relies on two techniques. The first technique is the Lyapunov method. It transfers the convergence of a high dimension vector θ n to a one dimensional Lyapunov function R(θ n ). The second technique is to use the idea of Markov process. We sketch these two steps and an additional step as follows. Step 1: In this step, we aim to prove that there exists at least one bounded set S 0 such that there is no limit point of {θ n } is in it almost surely. Through the Borel-Cantelli Lemma, it amounts to proving +∞ n=1 P θ n ∈ S 0 < +∞ . In order to prove Equation ( 6), we use the Lyapunov method, constructing a Lyapunov function R(θ) which holds a unique zero R(θ * ) = 0 and an open set Ŝ0 which include θ * (exact forms of R(θ) and Ŝ0 are provided in the appendix). We assign I n as the characteristic function of the event {θ n ∈ Ŝ0 }. Then we obtain the inequality I ( Ŝ0) n+1 R(θ n+1 ) -I ( Ŝ0) n R(θ n ) ≤ -I ( Ŝ0) n R 2r r+1 (θ n ) + u n , where u n is defined in (12) with +∞ n=1 E(u n ) < +∞. Summing up Equation (6) yields +∞ n=1 E I ( Ŝ0) n R 2r r+1 (θ n ) < E I ( Ŝ0) 1 R 2r r+1 (θ 1 ) + +∞ n=1 E(u n ) < +∞ . Subsequently we could construct S 0 := Ŝ0 /U (θ * , δ ′ 0 ), for some small enough δ ′ 0 , to make +∞ n=1 E I (S0) n R 2r r+1 (θ n ) < +∞ n=1 E I ( Ŝ0) n R 2r r+1 (θ n ) < +∞. Then, as whenever θ n ∈ Ŝ0 /U (θ * , δ ′ 0 ) we have R 2r r+1 (θ) > ε, we have +∞ n=1 P θ n ∈ S 0 < 1 ε +∞ n=1 E I (S0) n R 2r r+1 (θ n ) < +∞ . As such we conclude Equation ( 6), and through the Borel-Cantelli Lemma, we know that there is no limit point in S 0 almost surely. Step 2: In this step, we aim to prove that for any bounded set S that has no intersection with J * , there is no limit point in it. The way we prove it is different for the two types of noise ( 2) and ( 5). Handling the sampling noise with global stable guarantee ( 5) is relatively simple. The Gaussian noise of (5) guarantees that it forms an irreducible Markov process. Then using the property of the irreducible Markov process directly will prove the statement. For (2), the situation becomes complicated where an argument of the regular sampling noise does not deduce an irreducible Markov process. We prove it using a delicate argument. We first prove that a max positive bounded invariant set D must hold its boundary set ∂D ∩ J * ̸ = ∅, and every trajectory started from this set must almost surely converge to some global optimum. Here a set is max positive invariant if any trajectories started in S 0 will not escape S 0 and for any points θ ′′ / ∈ J * ∪ D, θ ′′ ∪ D is not a positive invariant set. That means, for any point either almost every trajectory started with it converges to J * , or it holds a positive probability transfer to S 0 . For the first situation, this statement is satisfied. For the second situation, we can make a small enough positive measure set, such that for any θ ∈ S, there exists a δ ′ 0 , and some large enough k, P (θ n+k ∈ S 0 | θ n = θ) > υ. Then we can get as desired υ +∞ n=1 P (θ n ∈ S) = υ +∞ n=k+1 S P n-k (dθ) ≤ +∞ n=k+1 S P (θ n+k ∈ S (δ0,l0) | θ n = θ)P n-k (dθ) = +∞ n=k+1 S (δ 0 ,l 0 ) P n (dθ) < +∞ . Step 3: In the previous step we actually proved that almost surely either θ n → J * or θ n → +∞. Through the Kolmogorov 0-1 law, we know {θ n converges} is a tail event. As such, P (θ n → J * ) ∈ {0, 1}. Meanwhile as P (θ n → ∞) = 1 is impossible, P (θ n → J * ) could only take 1. In step 3, we suspect that P (θ n → ∞) = 1 is indeed impossible, even without the assumption lim inf θ→∞ ∥ ∇g(θ)∥ > max{4c √ M 0 , 4c √ K 0 }. In fact, as long as θ n converges to J * for any initialization θ 1 in some neighboring domain of the optimum, it converges for all initialization. This is because for every initialization it either converges to the optimum or it has a positive probability to transfer to an arbitrary set with a positive measure. As the neighboring domain could be arbitrarily small, it is likely to exist. Theorem 3.2. Consider the SGD iteration in Equation ( 5), or alternatively Equation (2) with J * = J * * , and the MSE loss function (1). If Assumptions 2.1, 2.3 hold, and Assumption 2.2 holds with r θ * = 1, then for any 0 < ϵ 0 < min{1/2cM 0 , α θ * /2(2 -p 0 )β 2 θ * , 1/4cK 0 (1 -p 0 )}, where normal sampling noise 2 can be seen as p 0 = 0, and for any θ 1 ∈ R d , θ n converges to J * almost surely, i.e., lim n→∞ d(θ n , J * ) = 0 a.s. , where d(x, J * ) = inf y {∥x -y∥, y ∈ J * } denotes the distance between point x and set J * . Meanwhile the value of the loss function converges to 0 almost surely, i.e., lim n→∞ g(θ n ) = 0 a.s. . Proof sketch. This proof will be similar to the proof of Theorem 3.1. The difference is when r θ * = 1 the convergence towards a global optimum with second-order local structure is conditional on the selection of the initial learning rate ϵ 0 . The reason for this is the inequality I ( Ŝ0) n+1 R(θ n+1 ) -I ( Ŝ0) n R(θ n ) ≤ -α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * I ( Ŝ0) n R(θ n ) + u n (8) holds only when the coefficient α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * > 0. By setting ϵ 0 as the theorem the inequality and other arguments remain valid. This proof also agrees with our intuition that SGD converges to a sharper global optimum not as easy as a flat one (r θ * > 1). Recall the second question raised in SGD was conjecturing if SGD tends to choose the flat minima (and so as to enjoy a better generalization). In the end of the above proof we find that SGD converges to a sharper global optimum not as easy as a flat one. This observation is through positive results only, though. We wonder if the converse is also true, that is, if a global minimum is not flat, then SGD is unlikely to converge to that. In the below theorem we answer the converse affirmatively. Is is proved that if ϵ 0 is large enough, then the iteration will almost surely not converge to this optimum. Theorem 3.3. Consider the SGD iteration in Equation ( 5), or alternatively Equation (2) with J * = J * * , and the MSE loss function (1). If Assumptions 2.1, 2.3 hold, and Assumption 2.2 holds with r θ * = 1, then for any θ 1 ∈ R d , if ϵ 0 > β θ * /2(2 -p 0 )α 2 θ * , where normal sampling noise 2 can be seen as p 0 = 0, the probability that θ n converges to θ * is 0, i.e., P lim n→∞ ∥θ n -θ * ∥ = 0 = 0 . Proof sketch. The main idea is to prove that if the iteration always stays in a neighboring domain of θ * , then the probability that this iteration converges to θ * is zero. The Lyapunov method is helpful in this case. Step 1: In this step, we aim to acquire a reverse inequality of (7). We first construct a Lyapunov function R(θ) and a domain S 1 of θ * , and an event A i,n = {θ n0 ∈ S 1 , n 0 ∈ [i, n]} as well its characteristic function I i,n . Then we can acquire an inequality I i,n R(θ n+1 ) -R(θ n ) ≥ 2(2 -p 0 )ϵ 2 0 α 2 θ * -ϵ 0 β θ * I i,n R(θ n ) + I i,n ζ n , where ζ n is defined by equation 33. Notice that if 2(2-p 0 )ϵ 2 0 α 2 θ * -ϵ 0 β θ * > 0, then this inequality will be a variant of diffusion process. Step 2: In this step, we aim to prove when n approaches infinity, the iteration will transform a fixed part of itself out of S 1 . Through (9), we get E I i,n+1 R(θ n+1 ) ≥ 1 + p0 - E R(θ n+1 )(I i,n -I i,n+1 ) E I i,n R(θ n ) E I i,n R(θ n ) . We know if E R(θ n+1 )(I i,n -I i,n+1 ) E I i,n R(θ n ) < p0 , then E I i,n+1 R(θ n+1 ) will diverge to infinity, which is impossible to happen. As such, it must hold E R(θ n+1 )(I i,n -I i,n+1 ) E I i,n R(θ n ) ≥ p0 . Step 3: In this step, we will show that if E(I i,+∞ ̸ = 0) > 0, then I i,n R(θ n ) will not converge to 0 almost surely. We prove it by contradiction and assume P lim n→+∞ I i,n R(θ n ) = 0 = 1. That means for any ϵ ′ 0 > 0, P I i,n R(θ n ) > ϵ ′ 0 → 0, which concludes P I i,n R(θ n ) ≤ ϵ ′ 0 → 1. Then E R(θ n+1 )(I i,n -I i,n+1 ) E I i,n R(θ n ) → k ′ ϵ ′ 0 . This forms a contradiction. Step 4: In this final step, we will prove P lim n→+∞ θ n = θ * = 0. We inspect the event {θ n → θ * }. If E(I i,+∞ ̸ = 0) > 0, then due to lim n→+∞ I i,n g(θ n ) -I i,+∞ g(θ n ) = 0 a.s., we could get P lim n→+∞ I i,+∞ R(θ n ) = 0 = 0. Then, P {θ n → θ * } ∩ A i,+∞ = P lim n→+∞ I i,+∞ R(θ n ) = 0 = 0 . Otherwise if E(I i,+∞ ̸ = 0) = 0, we have P {θ n → θ * } ∩ A i,+∞ ≤ E(I i,+∞ ̸ = 0) = 0 . Absolutely, we have {θ n → θ * } ⊂ +∞ i=1 A i,+∞ . Subsequently we have P θ n → θ * = P {θ n → θ * } +∞ m=1 A i,+∞ = P +∞ i=1 {θ n → θ * } A i,+∞ ≤ +∞ i=1 P {θ n → θ * } A i,+∞ = 0 . As we have shown the asymptotic convergence of SGD, the natural question is how fast it converges. To provide the convergence rate, we will need a slightly stronger version of Assumption 2.2. We need, instead of just one θ * , all θ * , to satisfy the order r + 1 expansion. In this case, the supremum of the expansion order, among all optimum points, is denoted as r = max θ * ∈J * ∞ r θ * , where J * ∞ := {θ * ∈ J * | P (θ n → θ * ) > 0}. Our next theorem provides the convergence rate of SGD. Theorem 3.4. Consider the SGD iteration in Equation ( 5), or alternatively Equation (2) with J * = J * * , and the MSE loss function (1). If Assumptions 2.1, 2.3 hold, and the variant of Assumption 2.2 described immediately preceding this statement holds with order r + 1, then for any θ 1 ∈ R d , θ n has an asymptotic convergence rate as g(θ n )= O p n 0 a.s. , if r = 1 , O n -2 r-1 a.s. , if r > 1 , where p 0 < 1 is a constant decided by the learning rate ϵ 0 . Proof sketch. The proof of this theorem is based on the proof of Theorem 3.1. We asymptotically bound of martingale difference (Lemma A.1) and with the bound apply the martingale convergence theorem. The asymptotic convergence rate follows. As an immediate consequence of the convergence rate, the SGD algorithm could obtain an arbitrary accuracy in polynomial time. This validates the efficiency of SGD. Corollary 3.1. Consider the same setting as Theorem 3.4. For any θ 1 ∈ R d , the computational time for g(θ n ) to reach an η accuracy is O N 0 d • log( 1 η ) a.s. , if r = 1 , O N 0 d • ( 1 η ) r-1 2 a.s. , if r > 1 , where N 0 is the mini-batch size. A.3 PROOF OF LEMMA A.4. Lemma A.4. Consider the SGD updates specified in equation 2 (or equation 2 with J * = J * * ) and the MSE loss function equation 1. If Assumptions 2.1, 2.3 hold, then for any ϵ 0 < min{1/2cM 0 , 1/4cK 0 (1 -p 0 )}, where normal sampling noise 2 can be seen as p 0 = 0. Then for any θ 1 ∈ R d , the probability of θ n diverge to the infinity is less than 1, i.e., P (θ n → ∞) < 1. Proof. We prove this Lemma by contradiction. We first assume P (θ n → ∞) = 1, which means θ n → ∞ almost surely. By the Lagrange's mean value theorem, we have g(θ n+1 ) -g(θ n ) = ∇g(θ ζn ) T (θ n+1 -θ n ) , where ζ n is a point between θ n and θ n+1 . If ζ n is a non-smooth point, then we can find at least one point in the set of ∇g(θ ζn ). Therefore, we have g(θ n+1 ) -g(θ n ) = ∇g(θ ζn ) T (θ n+1 -θ n ) = -ϵ 0 ∇g(θ n ) T ∇g(θ n , ξ n ) + ∇g(θ ζn ) -∇g(θ n ) T (θ n+1 -θ n ) ≤ -ϵ 0 ∇g(θ n ) T ∇g(θ n , ξ n ) + ∇g(θ ζn ) -∇g(θ n ) ∥θ n+1 -θ n ∥ ≤ -ϵ 0 ∇g(θ n ) T ∇g(θ n , ξ n ) + max{c, c • ϵ 0 ∥ ∇g(θ n , ξ n )∥}ϵ 0 ∥ ∇g(θ n , ξ n )∥ < -ϵ 0 ∇g(θ n ) T ∇g(θ n , ξ n ) + c • ϵ 0 ∥ ∇g(θ n , ξ n )∥ + c • ϵ 2 0 ∥ ∇g(θ n , ξ n )∥ 2 . Through Assumption 2.3, we know that it hold E ξn ∥ ∇g(θ n , ξ n )∥ 2 ≤ M 0 ∥ ∇g(θ n )∥ 2 when θ n → ∞. Then we take an expectation over the sampling noise, we have E g(θ n+1 ) -E g(θ n ) < -ϵ 0 E ∥ ∇g(θ n )∥ 2 + c • ϵ 0 M 0 E ∥ ∇g(θ n )∥ + c • M 0 • ϵ 2 0 E ∥ ∇g(θ n )∥ 2 + c(1 -p 0 )K 0 ϵ 2 0 + c (1 -p 0 )K 0 ϵ 0 < -ϵ 0 -cM 0 ϵ 2 0 E ∥ ∇g(θ n )∥ 2 + cϵ 0 M 0 E ∥ ∇g(θ n )∥ + c(1 -p 0 )K 0 ϵ 2 0 + c (1 -p 0 )K 0 ϵ 0 . Since 1 2 ϵ 0 -cM 0 ϵ 2 0 > 0, and ∥ ∇g(θ n )∥ > max{4c √ M 0 , 4c (1 -p 0 )K 0 } when θ n → ∞, we can get P (∥ ∇g(θ n )∥ > max{4c √ M 0 , 4c (1 -p 0 )K 0 }) → 1. This implies E ∥ ∇g(θ n )∥ 2 ≥ E ∥ ∇g(θ n )∥ 2 With this, we have E g(θ n+1 ) ≤ E g(θ 1 ) -k′ 1 ϵ 0 n k=1 E ∥ ∇g(θ n )∥ 2 → -∞ , which is impossible. We thus conclude that {θ n } can not tend to infinity almost surely, i.e., P (θ n → ∞) < 1. A.4 PROOF OF THEOREM 3.1. Proof. For convenience, we abbreviate r θ * := r. Then we let l 0 := min β θ * + 1 2r(r + 1)G (r θ * ) 0 α θ * ϵ r 0 r+1 r-1 , δ θ * , and construct a function R(θ) =    ∥θ -θ * ∥ r+1 , if ∥θ -θ * ∥ ≤ max{1, δ θ * } ∥θ -θ * ∥ 2 , if ∥θ -θ * ∥ > K0 k(∥θ -θ * ∥), if max{1, δ θ * } < ∥θ -θ * ∥ ≤ K0 , where k(∥θ -θ * ∥) is the smooth connection between ∥θ -θ * ∥ (∥θ -θ * ∥ > K0 ) and ∥θ - θ * ∥ r+1 (∥θ -θ * ∥ ≤ max{1, δ θ * }). Then through choosing feasible k(θ -θ * ) and K0 , we can ensure that the Hessian matrix of R(θ) is bounded in R d . Let the upper bound of the Hessian matrix be r(r + 1), i.e., x T H θθ x ≤ r(r + 1)∥x∥ 2 (∀x ∈ R d , θ ∈ R d ). Next, we construct a set S (l0) = θ 0 ≤ ∥θ -θ * ∥ < l 0 . We also define event A (l0) n = {θ n ∈ S (l0) } and the characteristic function I (l0) n . Through the Lagrange's mean value theorem, we obtain I (l0) n R(θ n+1 ) -R(θ n ) = I (l0) n ∇R(θ ζn ) T (θ n+1 -θ n ) , where θ ζn ∈ [θ n+1 , θ n ]. Note that ∇R(θ ζn ) = ∇R(θ n ) + ∇R(θ ζn ) -∇R(θ n ) , and thus I (l0) n R(θ n+1 ) -R(θ n ) ≤ -I (l0) n ∇R(θ n ) T v n + I (l0) n ∥∇R(θ ζn ) -∇R(θ n )∥∥θ n+1 -θ n ∥ . Hence, for any θ ∈ {θ|∥θ -θ n ∥ ≤ max{1, δ θ * }} we have ∇R(θ) = ∇ ∥θ -θ * ∥ r+1 = (r + 1)∥θ -θ * ∥ r-1 (θ -θ * ) . Moreover, if ∥θ ξn -θ n ∥ < max{1, δ θ * }, we also have ∥∇R(θ ζn ) -∇R(θ n )∥ ≤ r(r + 1)∥θ n+1 -θ n ∥ r , and if ∥θ ξn -θ n ∥ ≥ max{l 0 , 1}, we have ∥∇R(θ ζn ) -∇R(θ n )∥ ≤ r(r + 1)∥θ ζn -θ n ∥ ≤ r(r + 1) ∥θ ζn -θ n ∥ r-1 ∥θ ζn -θ n ∥ r ≤ r(r + 1) 1 ∥θ n+1 -θ n ∥ r . With this, we have ∇R(θ ζn ) -∇R(θ n )∥ ≤ r(r + 1)∥θ n+1 -θ n ∥ r = r(r + 1)∥v n ∥ r , I (l0) n R(θ n+1 ) -R(θ n ) ≤ -I (l0) n ∇R(θ n ) T v n + I (l0) n r(r + 1)∥v n ∥ r+1 . I (l0) n+1 R(θ n+1 ) -I (l0) n R(θ n ) ≤ -I (l0) n ∇R(θ n ) T v n + I (l0) n r(r + 1)∥v n ∥ r+1 -(I (l0) n -I (l0) n+1 )R(θ n+1 ) . ( ) Taking expectation of equation 11, we have E I (l0) n ∇R(θ n ) T v n = E I (l0) n E ∇R(θ n ) T v n F n = E I (l0) n ϵ 0 E ∇R(θ n ) T ∇g(θ n , ξ n ) + I (l0) n ϵ 0 E ∇R(θ n ) T min{g(θ n ), K 0 }τ n N n F n = ϵ 0 E I (l0) n E ∇R(θ n ) T ∇g(θ n ) . Define Ŝ to be the set of θ ′ , such that g(θ ′ ) is not smooth. Then with Assumption 2.1, we have E θn∈ Ŝ (h(θ n )) = 0, where h is an arbitrary measurable function. Hence, when θ n ∈ R d / Ŝ, I (l0) n ∇R(θ n ) T ∇g(θ n ) = I (l0) n (r + 1)∥θ n -θ * ∥ r-1 (θ n -θ * ) T ∇g(θ n ) ≥ I (l0) n (r + 1)∥θ n -θ * ∥ r-1 α θ * ∥θ n -θ * ∥ r+1 = I (l0) n α θ * (r + 1)R 2r r+1 (θ n ). Therefore, we have E I (l0) n ∇R(θ n ) T ∇g(θ n ) = E θn∈R d / Ŝ I (l0) n ∇R(θ n ) T ∇g(θ n ) ≥ E I (l0) n α θ * (r + 1)R 2r r+1 (θ n ) , and through Assumption 2.2, we get E I (l0) n r(r + 1)∥v n ∥ r+1 = r(r + 1)ϵ r+1 0 E I (l0) n E ∥ ∇g(θ n , ξ n )∥ r+1 F n + r(r + 1)ϵ r+1 0 E I (l0) n E ∥ min{g(θ n ), K 0 }τ n N n ∥ r+1 F n + r(r + 1)ϵ r+1 0 E I (l0) n E ∇g(θ n , ξ n ) T min{g(θ n ), K 0 }τ n N n F n = r(r + 1)ϵ r+1 0 E I (l0) n E ∥ ∇g(θ n , ξ n )∥ r+1 F n + r(r + 1)ϵ r+1 0 E I (l0) n E ∥ min{g(θ n ), K 0 }τ n N n ∥ r+1 F n ≤ r(r + 1)(2 -p 0 )ϵ r+1 0 (β θ * + 1)G (r θ * ) 0 E I (l0) n R(θ n ) , where G (r θ * ) 0 is defined in Claim 2.1, and results of equation 2 can be seen the situation which p 0 = 0. Then, E I (l0) n+1 R(θ n+1 ) -E I (l0) n R(θ n ) ≤ -α θ * ϵ 0 E I (l0) n R(θ n ) 2r r+1 + r(r + 1)(2 -p 0 )(β θ * + 1)ϵ r+1 0 G (r θ * ) 0 E I (l0) n R r+1 2 (θ n ) -E (I (l0) n -I (l0) n+1 )R(θ n+1 ) . Due to θ n ∈ S (l0) , we know R(θ n ) < β θ * + 1 2r(r + 1)(2 -p 0 )G (r θ * ) 0 α θ * ϵ r 0 r+1 r-1 . That means α θ * ϵ 0 I (l0) n R 2r r+1 (θ n ) > 2r(r + 1)(2 -p 0 )α θ * ϵ r+1 0 G (r θ * ) 0 I (l0) n R(θ n ) . Hence, E I (l0) n+1 R(θ n+1 ) -E I (l0) n R(θ n ) ≤ - α θ * ϵ 0 2 E I (l0) n R 2r r+1 (θ n ) -E (I (l0) n -I (l0) n+1 )R(θ n+1 ) . For the term E (I (l0) n -I (l0) n+1 )R(θ n+1 ) , we observe that E (I (l0) n -I (l0) n+1 )R(θ n+1 ) = E (I (l0) n -I (l0) n I (l0) n+1 )R(θ n+1 ) -(I (l0) n+1 -I (l0) n I (l0) n )R(θ n+1 ) , and (I (l0) n -I (l0) n I (l0) n+1 )g(θ n+1 ) ≥ l 0 (I (l0) n -I (l0) n I (l0) n+1 ) , (I (l0) n+1 -I (l0) n I (l0) n )g(θ n+1 ) ≤ l 0 (I (l0) n+1 -I (l0) n I (l0) n ) . Taking these into equation 12, we obtain E (I (l0) n -I (l0) n+1 )R(θ n+1 ) ≥ E (I (l0) n -I (l0) n I (l0) n+1 )l 0 -(I (l0) n+1 -I (l0) n I (l0) n )l 0 = l 0 E I (l0) n -I (l0) n+1 . ( ) Taking equation 13 into equation 12, we have E I (l0) n+1 R(θ n+1 ) -E I (l0) n R(θ n ) ≤ - α θ * ϵ 0 2 E I (l0) n R 2r (θ n ) -l 0 E I (l0) n -I (l0) n+1 . ( ) Summing equation 14 over n, we have E I (l0) n+1 R(θ n+1 ) -E I (l0) 1 R(θ 1 ) ≤ - α θ * ϵ 0 2 n k=1 E I (l0) k R 2r r+1 (θ n ) -l 0 E I (l0) 1 -I (l0) n+1 . (15) Rearranging the equation, we have n k=1 E I (l0) k R 2r r+1 (θ n ) ≤ 2(l 0 + g(θ 1 )) α θ * ϵ 0 < +∞ . Next we construct a subset of S (l0) as S (δ0,l0) = {θ 0 < δ 0 ≤ ∥θ -θ * ∥ < l 0 } . Define event A (δ0,l0) n = {θ n ∈ S (δ0,l0) } and the characteristic function be I (δ0,l0) n . Obviously, we have n k=1 E I (δ0,l0) k R 2r r+1 (θ k ) < n k=1 E I (l0) k R 2r r+1 (θ k ) ≤ 2(l 0 + g(θ 1 )) α θ * ϵ 0 < +∞ . Let r 0 := inf θ∈S (δ 0 ,l 0 ) R 2r r+1 (θ) > 0, we have r 0 n k=1 E I (δ0,l0) k < 2(l 0 + g(θ 1 )) α θ * ϵ 0 < +∞ , that is +∞ k=1 P θ k ∈ S (δ0,l0) = +∞ k=1 E I (δ0,l0) k < 2(l 0 + g(θ 1 )) α θ * ϵ 0 r 0 < +∞ . ( ) Then we can obtain P {θ n } ∈ S (δ0,l0) , i.o. = P +∞ n=1 +∞ k=n θ k ∈ S (δ0,l0) (17) = lim n→+∞ P +∞ k=n θ k ∈ S (δ0,l0) (18) ≤ lim n→+∞ +∞ k=n P θ k ∈ S (δ0,l0) = 0 . ( ) Note that equation 17 means the set S (δ0,l0) has no limit point of {θ n } almost surely. Then if we use the SGD update rule equation 5 Since the noise is Gaussian, any θ ∈ R d /J * and for any k > 0, there is P (θ n+k ∈ S (δ0,l0) |θ n = θ) = δ0 > 0. If we use SGD update rule equation 2, for any max positive invariant set D/J * , we know that there must exist a boundary set ∂D. Moreover, ∀θ ′ ∈ ∂D, if θ ′ ∈ R d /D , then for any mini-batch C i , we have ∇g Ci (θ ′ ) = 0. Otherwise we can find a sequence {θ ′′ → θ ′ , /θ ′′ ∈ D}, making the trajectories started from θ ′′ close to the trajectory started from θ ′ . It forms a contradiction. Then due to J * * = J * , we know θ ′ ∈ J * . That means D ∩ J * ̸ = ∅. If θ ′ ∈ D, we can conclude all trajectories started from θ ′ are a subset of ∂D. On the other hand, we can conclude ∂g is a close set. Through Heine˘Borel theorem, it exists a finite open cover M n=1 O n ⊃ ∂D, and every O n holding an arbitrary small diameter. We let θ ′ ∈ O 1 . Then we assign T n as the lone time interval of one trajectory started from θ ′ and back to T n . If T n → +∞, that means this trajectory must stay a infinity time in some O k , that means exists a global optimum in O k . Naturally, the trajectory will converge to this global optimum. If T n is bounded, that means the trajectory will enter into O 1 infinite times. Due to a mass of different mini-batch and the enough small diameter and f (θ ) := P (θ n+k ∈ R d /D|θ n = θ) = δ0 > 0 is a continuous function, We get P (θ n+k ∈ R d /D|θ n ∈ O 1 ) = δ0 > 0, it is contradiction about D is a positive invariant set. That means for any θ ∈ R/J * , either trajectories started from it will converge to some global optimum, either it has a positive probability to make sure it transfers to S (δ0,l0) after k steps. Then for any bounded set Ŝ0 which has no intersection with J * , we first get rid of those points which will converge to J * . We know that f (θ) := P (θ n+k ∈ S (δ0,l0) |θ n = θ) = δ0 > 0 is a continuous function. Then we can get for any bounded closed set Ŝ0 which satisfied Ŝ0 ∩ J * = ∅, there is min θ∈ Ŝ0 P (θ n+k ∈ S (δ0,l0) |θ n = θ) = δ1 > 0. Then we aim to prove there is no limit point in Ŝ0 almost surely by contradiction. We assume +∞ n=1 P θ n ∈ Ŝ0 = +∞ . Then, +∞ n=k+1 P (θ n ∈ S (δ0,l0) ) = +∞ n=k+1 S (δ 0 ,l 0 ) P n (dθ) = +∞ n=k+1 S R d P (θ n+k ∈ S (δ0,l0) |θ n = θ)P n-k (dθ) ≥ +∞ n=k+1 Ŝ0 P (θ n+k ∈ S (δ0,l0) |θ n = θ)P n-k (dθ) ≥ δ1 +∞ n=k+1 Ŝ0 P n-k (dθ) = δ1 +∞ n=1 P (θ n ∈ Ŝ0 ) = + ∞ . Note that this is in contradiction with equation 16 and thus +∞ n=1 P θ n ∈ Ŝ0 < +∞. Then, P {θ n } ∈ Ŝ0 , i.o. = P +∞ n=1 +∞ k=n θ k ∈ Ŝ0 = lim n→+∞ P +∞ k=n θ k ∈ Ŝ0 ≤ lim n→+∞ +∞ k=n P θ k ∈ Ŝ0 = 0 . (20) Combining equation 20 with equation 16, we can see that for any bounded set which does not include J * = {θ|g(θ) = 0} has no limit point almost surely. This implies θ n → J * or θ n → ∞. Since {{θ n } is convergence} is a tail event. Then by the zero-one law, we know P ({θ n } is convergence) = 0 or 1. That means {θ n } either converges to J * almost surely, or diverges to infinity almost surely. Through Lemma A.4, we know P (θ n → ∞) < 1, thus {θ n } can only converge to J * almost surely. A.5 PROOF OF THEOREM 3.2 Proof. We define R(θ) = ∥θ -θ * ∥ 2 , and a set S (l0) = {θ 0 ≤ ∥θ -θ * ∥ < l 0 := δ θ * } . We also define an event A (l0) n = {θ n ∈ S (l0) } and the characteristic function I (l0) n . By Lagrange's mean value theorem, we have I (l0) n R(θ n+1 ) -R(θ n ) = I (l0) n ∇R(θ ζn ) T (θ n+1 -θ n ) , where θ ζn ∈ [θ n+1 , θ n ]. Note that ∇R(θ ζn ) = ∇R(θ n ) + ∇R(θ ζn ) -∇R(θ n ), we have I (l0) n R(θ n+1 ) -R(θ n ) ≤ -I (l0) n ∇R(θ n ) T v n + I (l0) n ∥∇R(θ ζn ) -∇R(θ n )∥∥θ n+1 -θ n ∥ . Moreover, we also have ∥∇R(θ ζn ) -∇R(θ n )∥ ≤ 2∥θ n+1 -θ n ∥ = 2∥v n ∥ I (l0) n R(θ n+1 ) -R(θ n ) ≤ -I (l0) n ∇R(θ n ) T v n + I (l0) n 2∥v n ∥ 2 I (l0) n R(θ n+1 ) -R(θ n ) ≤ -I (l0) n ∇R(θ n ) T v n + I (l0) n 2∥v n ∥ 2 I (l0) n+1 R(θ n+1 ) -I (l0) n R(θ n ) ≤ -I (l0) n ∇R(θ n ) T v n + I (l0) n 2∥v n ∥ 2 -(I (l0) n -I (l0) n+1 )R(θ n+1 ) . Taking expectation of equation 21, we have E I (l0) n ∇R(θ n ) T v n = E I (l0) n E ∇R(θ n ) T v n F n = E I (l0) n ϵ 0 E ∇R(θ n ) T ∇g(θ n , ξ n ) + I (l0) n ϵ 0 E ∇R(θ n ) T min{g(θ n ), K 0 }τ n N n F n = ϵ 0 E I (l0) n ϵ 0 E ∇R(θ n ) T ∇g(θ n ) . We define Ŝ = {θ ′ | ∇g(θ) is not continue at θ ′ }. Then through Assumption 2.1, and note that E θn∈ Ŝ (h(θ n )) = 0, where h is an arbitrary measurable function, we have that the following when θ n ∈ R d / Ŝ. I (l0) n ∇R(θ n ) T ∇g(θ n ) = 2I (l0) n (θ n -θ * ) T ∇g(θ n ) ≥ 2I (l0) n α θ * ∥θ n -θ * ∥ 2 ≥ 2I (l0) n α θ * ∥θ n -θ * ∥ 2 = 2I (l0) n α θ * R(θ n ). Therefore, we have E I (l0) n ∇R(θ n ) T ∇g(θ n ) = E θn∈R d / Ŝ I (l0) n ∇R(θ n ) T ∇g(θ n ) ≥ 2 E I (l0) n α θ * R(θ n ) . and through Assumption 2.2, we get E I (l0) n 2∥v n ∥ 2 = 2ϵ 2 0 E I (l0) n E ∥ ∇g(θ n , ξ n )∥ 2 F n + 2ϵ 2 0 E I (l0) n E ∥ min{g(θ n ), K 0 }τ n N n ∥ 2 F n + 4ϵ 2 0 E I (l0) n E ∇g(θ n , ξ n ) T min{g(θ n ), K 0 }τ n N n F n = 2ϵ 2 0 E I (l0) n E ∥ ∇g(θ n , ξ n )∥ 2 F n + 2ϵ 2 0 E I (l0) n E ∥ min{g(θ n ), K 0 }τ n N n ∥ 2 F n ≤ 2(2 -p 0 )ϵ 2 0 β 2 θ * E I (l0) n R(θ n ) , where the situation of equation 2 can be seen as p 0 = 0. Then we have E I (l0) n+1 R(θ n+1 ) -E I (l0) n R(θ n ) ≤ -c θ * ϵ 0 E I (l0) n R(θ n ) + 2(2 -p 0 )ϵ 2 0 β 2 θ * E I (l0) n R(θ n ) -E (I (l0) n -I (l0) n+1 )R(θ n+1 ) , and E I (l0) n+1 R(θ n+1 ) -E I (l0) n R(θ n ) ≤ -α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * E I (l0) n R(θ n ) -E (I (l0) n -I (l0) n+1 )R(θ n+1 ) . For the term E (I (l0) n -I (l0) n+1 )R(θ n+1 ) , we first observe that E (I (l0) n -I (l0) n+1 )R(θ n+1 ) = E (I (l0) n -I (l0) n I (l0) n+1 )R(θ n+1 ) -(I (l0) n+1 -I (l0) n I (l0) n )R(θ n+1 ) , and (I (l0) n -I (l0) n I (l0) n+1 )g(θ n+1 ) ≥ l 0 (I (l0) n -I (l0) n I (l0) n+1 ) , (I (l0) n+1 -I (l0) n I (l0) n )g(θ n+1 ) ≤ l 0 (I (l0) n+1 -I (l0) n I (l0) n ) . Taking these into equation 23, we have E (I (l0) n -I (l0) n+1 )R(θ n+1 ) ≥ E (I (l0) n -I (l0) n I (l0) n+1 )l 0 -(I (l0) n+1 -I (l0) n I (l0) n )l 0 = l 0 E I (l0) n -I (l0) n+1 . (24) Substituting equation 24 into equation 23, we get E I (l0) n+1 R(θ n+1 ) -E I (l0) n R(θ n ) ≤ -α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * E I (l0) n R(θ n ) -l 0 E I (l0) n -I (l0) n+1 . Summing equation 25 over n, we have E I (l0) n+1 R(θ n+1 ) -E I (l0) 1 R(θ 1 ) ≤ -α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * n k=1 E I (l0) k R(θ n ) -l 0 E I (l0) 1 -I (l0) n+1 . As ϵ 0 < α θ * /2(2 -p 0 )β 2 θ * , we have n k=1 E I (l0) k R(θ n ) ≤ l 0 + g(θ 1 ) α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * < +∞ . Next, we construct a subset of S (l0) as S (δ0,l0) = {θ 0 < δ ≤ ∥θ -θ * ∥ < l 0 } . We also define A (δ0,l0) n = {θ n ∈ S (δ0,l0) } and the characteristic function be I (δ0,l0) n . Notice that, we have n k=1 E I (δ0,l0) k R(θ k ) < n k=1 E I (l0) k R(θ k ) ≤ l 0 + g(θ 1 ) α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * < +∞ . Denoter 0 := inf θ∈S (δ 0 ,l 0 ) R(θ) > 0, then r 0 n k=1 E I (δ0,l0) k < l 0 + g(θ 1 ) α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * < +∞, that is +∞ k=1 P θ k ∈ S (δ0,l0) = +∞ k=1 E I (δ0,l0) k < l 0 + g(θ 1 ) α θ * ϵ 0 -2(2 -p 0 )ϵ 2 0 β 2 θ * < +∞. ( ) With this, we have δ0,l0) , i.o. = P = 0 . P {θ n } ∈ S ( (28) We remark thatequation 28 implies the set S (δ0,l0) has no limit point of {θ n } almost surely. Then if we use the SGD update rule equation 5, as the noise is Gaussian, for any θ ∈ R d /J * and any k > 0, there is P (θ n+k ∈ S (δ0,l0) |θ n = θ) = δ0 > 0. If we use SGD update rule equation 2, for any max positive invariant set D/J * , we know that there must exist a boundary set ∂D. Moreover, ∀θ ′ ∈ ∂D, if θ ′ ∈ R d /D, then for any mini-batch C i , we have ∇g Ci (θ ′ ) = 0. Otherwise we can find a sequence {θ ′′ → θ ′ , /θ ′′ ∈ D}, making the trajectories started from θ ′′ close to the trajectory started from θ ′ . It forms a contradiction. Then due to J * * = J * , we know θ ′ ∈ J * . That means D ∩ J * ̸ = ∅. If θ ′ ∈ D, we can conclude all trajectories started from θ ′ are a subset of ∂D. On the other hand, we can conclude ∂g is a close set. Through Heine˘Borel theorem, it exists a finite open cover M n=1 O n ⊃ ∂D, and every O n holding an arbitrary small diameter. We let θ ′ ∈ O 1 . Then we assign T n as the lone time interval of one trajectory started from θ ′ and back to T n . If T n → +∞, that means this trajectory must stay a infinity time in some O k , that means exists a global optimum in O k . Naturally, the trajectory will converge to this global optimum. If T n is bounded, that means the trajectory will enter into O 1 infinite times. Due to a mass of different mini-batch and the enough small diameter and f (θ ) := P (θ n+k ∈ R d /D|θ n = θ) = δ0 > 0 is a continuous function, We get P (θ n+k ∈ R d /D|θ n ∈ O 1 ) = δ0 > 0, it is contradiction about D is a positive invariant set. That means for any θ ∈ R/J * , either trajectories started from it will converge to some global optimum, either it has a positive probability to make sure it transfers to S (δ0,l0) after k steps. Then for any bounded set Ŝ0 which has no intersection with J * , we first get rid of those points which will converge to J * . We know that f (θ) := P (θ n+k ∈ S (δ0,l0) |θ n = θ) = δ0 > 0 is a continuous function. Then we can get for any bounded closed set Ŝ0 which satisfied Ŝ0 ∩ J * = ∅, there is min θ∈ Ŝ0 P (θ n+k ∈ S (δ0,l0) |θ n = θ) = δ1 > 0. Then we aim to prove there is no limit point in Ŝ0 almost surely by contradiction. We assume +∞ n=1 P θ n ∈ Ŝ0 = +∞. Then we can get +∞ n=k+1 P (θ n ∈ S (δ0,l0) ) = +∞ n=k+1 S (δ 0 ,l 0 ) P n (dθ) = +∞ n=k+1 S R d P (θ n+k ∈ S (δ0,l0) |θ n = θ)P n-k (dθ) ≥ +∞ n=k+1 Ŝ0 P (θ n+k ∈ S (δ0,l0) |θ n = θ)P n-k (dθ) ≥ δ1 +∞ n=k+1 Ŝ0 P n-k (dθ) = δ1 +∞ n=1 P (θ n ∈ Ŝ0 ) = + ∞ . This is contradiction with equation 27, which implies +∞ n=1 P θ n ∈ Ŝ0 < +∞ . Hence, we can obtain  P {θ n } ∈ Ŝ0 , i.o. = P +∞ n=1 +∞ k=n θ k ∈ Ŝ0 = lim n→+∞ P +∞ k=n θ k ∈ Ŝ0 ≤ lim n→+∞ +∞ k=n P θ k ∈ Ŝ0 = 0 . ( (θ) = ∥θ -θ * ∥ 2 . We can get that R(θ n+1 ) -R(θ n ) = ∥θ n+1 -θ * ∥ 2 -∥θ n -θ * ∥ 2 = (θ n+1 -θ n ) T (θ n+1 + θ n -2θ * ) = 2(θ n -θ * ) T (θ n+1 -θ n ) + ∥θ n+1 -θ n ∥ 2 = -2(θ n -θ * ) T v n + ∥v n ∥ 2 = -2(θ n -θ * ) T ϵ 0 ∇g(θ n , ξ n ) + ϵ 0 min{g(θ n ), K 0 }τ n N n + ϵ 0 ∇g(θ n , ξ n ) + ϵ 0 min{g(θ n ), K 0 }τ n N n 2 . (30) For the term 2(θ n -θ * ) T ϵ 0 ∇g(θ n , ξ n ) + ϵ 0 min{g(θ n ), K 0 }τ n N n , we use the following transformation: 2(θ n -θ * ) T ϵ 0 ∇g(θ n , ξ n ) + ϵ 0 min{g(θ n ), K 0 }τ n N n + 2ϵ 0 = 2ϵ 0 (θ n -θ * ) T ∇g(θ n ) + 2ϵ 0 (θ n -θ * ) T ∇g(θ n , ξ n ) -∇g(θ n ) + 2ϵ 0 min{g(θ n ), K 0 }τ n (θ n -θ * ) T N n . For the term ϵ 0 ∇g(θ n , ξ n ) + ϵ 0 min{g(θ n ), K 0 }τ n N n 2 , we can obtain ϵ 0 ∇g(θ n , ξ n ) + ϵ 0 min{g(θ n ), K 0 }τ n N n 2 = ϵ 2 0 ∇g(θ n , ξ n ) 2 + 2ϵ 2 0 τ n min{g(θ n ), K 0 } ∇g(θ n , ξ n ) T N n + ϵ 2 0 τ 2 n N 2 n min{g(θ n ), K 0 } = ϵ 2 0 E ∇g(θ n , ξ n ) 2 F n + ϵ 2 0 p 0 min{g(θ n ), K 0 } + ϵ 2 0 ∇g(θ n , ξ n ) 2 -ϵ 2 0 E ∇g(θ n , ξ n ) 2 F n + ϵ 2 0 τ 2 n N 2 n min{g(θ n ), K 0 } -ϵ 2 0 p 0 min{g(θ n ), K 0 } + 2ϵ 2 0 τ n min{g(θ n ), K 0 } ∇g(θ n , ξ n ) T N n ≥ ϵ 2 0 ∇g(θ n ) 2 + ϵ 2 0 ∇g(θ n , ξ n ) 2 -ϵ 2 0 E ∇g(θ n , ξ n ) 2 F n + 2ϵ 2 0 τ n min{g(θ n ), K 0 } ∇g(θ n , ξ n ) T N n . (32) Then we construct a set S ( l0) = {θ|∥θ -θ * ∥ < l 0 := δ θ * }/{θ * } . We also define event A i,n = {θ n0 ∈ S ( l0) , n 0 ∈ [i, n]}, and its characteristic function as I i,n . We substitute equation 32 and equation 31 into equation 30, and multiple I i,n , getting I i,n R(θ n+1 ) -R(θ n ) ≥ 2(2 -p 0 )ϵ 2 0 α 2 θ * -ϵ 0 β θ * I i,n R(θ n ) + I i,n ζ n , where ζ n := 2ϵ 0 (θ n -θ * ) T ∇g(θ n , ξ n ) -∇g(θ n ) + 2ϵ 0 min{g(θ n ), K 0 }τ n (θ n -θ * ) T N n + ϵ 2 0 ∇g(θ n , ξ n ) 2 -ϵ 2 0 E ∇g(θ n , ξ n ) 2 F n + ϵ 2 0 τ 2 n N 2 n min{g(θ n ), K 0 } -ϵ 2 0 p 0 min{g(θ n ), K 0 } (33) is a Martingale difference. Denote p0 := R(θ n+1 ) -R(θ n ) ≥ 2(2 -p 0 )ϵ 2 0 α 2 θ * -ϵ 0 β θ * , we have I i,n+1 R(θ n+1 ) -I i,n R(θ n ) ≥ p0 I i,n R(θ n ) + I i,n ζn -R(θ n+1 )(I i,n -I i,n+1 ) . Then, E I i,n+1 R(θ n+1 ) -E I i,n R(θ n ) ≥ p0 E I i,n R(θ n ) -E R(θ n+1 )(I i,n -I i,n+1 ) , which implies E I i,n+1 R(θ n+1 ) ≥ 1 + p0 - E R(θ n+1 )(I i,n -I i,n+1 ) E I i,n R(θ n ) E I i,n R(θ n ) . Assuming lim sup n→+∞ E R(θ n+1 )(I i,n -I i,n+1 ) E I i,n R(θ n ) < p0 , we have E I i,n+1 R(θ n+1 ) → +∞ . Note that this contradicted the E I i,n+1 R(θ n+1 ) ≤ l0 . Hence, lim sup n→+∞ E R(θ n+1 )(I i,n -I i,n+1 ) E I i,n R(θ n ) ≥ p0 . Define an event A i,+∞ := {θ n0 ∈ S ( l0) , n 0 ≥ i}, and its characteristic function as I i,+∞ . We next prove P lim n→+∞ I i,+∞ R(θ n ) = 0 = 0. We assume P lim n→+∞ I i,+∞ R(θ n ) = 0 = 1, and we can get P lim n→+∞ I i,n R(θ n ) = 0 = 1. That means for any ϵ ′ 0 > 0, P I i,n R(θ n ) > ϵ ′ 0 → 0, concluding P I i,n R(θ n ) ≤ ϵ ′ 0 → 1. Then we get lim sup n→+∞ E R(θ n+1 )(I i,n -I i,n+1 ) E I i,n R(θ n ) = lim sup n→+∞ E(R(θ n+1 ) > l0 |θ = θ)P i,n (dθ) R(θ)ϵ ′ 0 R(θ)P i,n (dθ) + R(θ)>ϵ ′ 0 R(θ)P i,n (dθ) = lim sup n→+∞ R(θ)≤ϵ ′ 0 E(R(θ n+1 ) > l0 |θ = θ)P i,n (dθ) R(θ)≤ϵ ′ 0 R(θ)P i,n (dθ) < p0 2 . Note that this contradicted equation 34, which implies P lim n→+∞ I i,+∞ R(θ n ) = 0 = 0. Through inspecting the event {θ n → θ * }, we can get {θ n → θ * } ⊂ +∞ i=1 A i,+∞ . That means P θ n → θ * = P {θ n → θ * } +∞ m=1 A i,+∞ = P +∞ i=1 {θ n → θ * } A i,+∞ = P +∞ i=1 lim n→+∞ I i,+∞ R(θ n ) = 0 ≤ +∞ i=1 P lim n→+∞ I i,+∞ R(θ n ) = 0 = 0 . A.7 PROOF OF THEOREM 3.4. First we order J * ∞ as {θ * i }. Then Assumption 2.2 implies that ∀ θ * ∈ J * , there is ∥ ∇g(θ)∥ > 0 (g is smooth at θ and θ ∈ U (θ * , δ θ * )/{θ * }). That means for any θ * i ̸ = θ * j ∈ J * , there is ∥θ * i -θ * j ∥ ≥ inf θi̸ =θj ∥θ * i -θ * j ∥ := δ0 ̸ = 0 and U (θ * i , δ θ * i ) ∩ U (θ * j , δ θ * j ) = ∅. Furthermore, it means that there are at most infinite {θ * i }. We assign this number as m. Due lim inf θ→+∞ ∥ ∇g∥ > 0, we know {δ θ * i } is bounded. Then we construct a function R(θ) as follow: Rθ * i (θ) = ∥θ -θ * i ∥ r θ * i +1 . Then we try to prove that there exists a function R(θ) satisfies: 1. For any θ ∈ R d , there exist H θθ such that θ T H θθ ( R)θ ≤ max θ * i ∈J * ∞ r θ * i (r θ * i + 1) ∥θ∥ 2 . 2. R(θ) = ∥θ -θ * i ∥ r θ * i +1 , when θ near the θ * i . 3. R(θ) is bounded. We define indicator functions Î(ri) θ * i := 1, if ∥θ -θ * ∥ ≤ r i 0, if ∥θ -θ * ∥ > r i , where r i is an undetermined coefficient. Clearly, function Î(ri) θ * i Rθ * i (θ) can be seen as an unary function f θ * i (x) = x r θ * i +1 , (0 < x < r i ) about the independent variable ∥θ -θ * i ∥. Then for any δ0 > 0, we can always find h θ * i (x) = r r θ * i +1 i + (r θ * i + 1) 2 r 2r θ * i i 2 , to ensure there is a smooth connection (a parabola) between f θ * i (x) and h θ * i (x). Denote this entirety after adding the smooth connection between f θ * i (x) and h θ * i (x) as j θ * i (x), j θ * i (x) satisfied j ′′ (x) < 1 and the connection point on h θ * i (x) is ri (r i ) := r i + (r θ * i + 1)r Î(ri(h -1 θ * i (K0))) θ * i j θ * i (∥θ -θ * i ∥), if θ ∈ m i=1 U (θ * i , ri (h -1 θ * i (K 0 )) K 0 , others , is what we need. We next discuss this problem case by case according to the value of r. The first case is r = 1 (from here to equation 38), we define an event A ( l0) n,θ * i = {θ n ∈ U (θ * i , h -1 θ * i (K 0 )} , and the characteristic function be I  where { ζn } is a Martingale difference sequence defined as ζn := ϵ 0 ∥ ∇ R(θ n )∥ 2 -∇ R(θ n ) T v n + 2M 0 ∥v n ∥ 2 -2M 0 E(∥v n ∥ 2 |F n ) , where k 0 , k1 are two constants. We also define I  I (-l0) n R(θ n+1 ) -R(θ n ) ≤ I (-l0) n K 0 ≤ I (-l0) n R(θ n ) K 0 R(θ n ) ≤ I (-l0) n R(θ n ) 1 r θ * i +1 + (r θ * i +1) 2 1 2r θ * i 2 1 ≤ 3I (-l0) n R(θ n ) . (37) Through calculating the sum of equation 36, equation 37, we obtain A.8 PROOF OF COROLLARY 3.1 R(θ n+1 ) -R(θ n ) ≤ - k 0 k1 ϵ 0 2 R(θ n ) + 3I (-l0) n R(θ n ) + ζ′ n , E R(θ n+1 ) F n ≤ 1 - k 0 k1 ϵ 0 2 + 3I (-l0) n R(θ n ) , When the loss function g(θ n ) attains the ε ′ accuracy, according to Theorem 3.4,the overall number of SGD iteration is n= O log( 1 ε ′ ) almost surely , if r = 1 O ( 1 ε ′ ) r-1 2 almost surely , if r > 1 . Then we consider the computational time of a single step of SGD. Generally, the main timeconsuming part of one step is computing the gradient of loss function on a batch of datasets, which can be decomposed into computing N 0 times of numerical differentiation, where the N 0 is the size of the dataset. We assume time consumed of computing a function value is O 1 . When a specific numerical differentiation scheme is given, such as ∂f (θ (1) ,••• ,θ (d) ,x) ∂θi | θ=θ0 ≈ f (θ (1) 0 ,••• ,θ (i) 0 +h,••• ,θ 0 ,x)-f (θ0,x) h , it's obviously the computation time of numerical gradient is O d . In summary, the whole computation time is O N 0 d • log( 1 ε ′ ) almost surely , if r = 1 O N 0 d • ( 1 ε ′ ) r-1 2 almost surely , if r > 1 , which is bounded by a polynomial time.



∈ S (δ0,l0) ≤ lim n→+∞ +∞ k=n P θ k ∈ S (δ0,l0)

let h θ * i (x) be an arbitrary constant value M , for different r θ * i , we can always get an inverse solution r i := h -, there must exists K 0 < K 0 , such that sets {U (θ * i , ri (h

n+1 ) -R(θ n ) ≤ -I n,θ * i

ζn . Denote k ′ := k 0 k1 ϵ 0 /2, we get E R(θ n+1 ) n k=1 1 -k ′ ϵ 0 + 3I +∞ almost surely, which means R(θ n ) = O 1 -k ′ ϵ 0 n almost surely. Denote p 0 := 1 -k ′ ϵ 0 < 1 , {θ 0 ≤ ∥θ -θ * i ∥ < l0 .

Combining equation 29 with equation 27, for any bounded set which does not include J * = {θ|g(θ) = 0}, we can say that it has no limit point almost surely. That means θ n → J * or θ n → ∞ almost surely. We know the event {θ n is convergence} is a tail event. By zero-one law, we have P ({θ n } is convergence) = 0 or 1. That means {g(θ n )} either converges to J * almost surely, or diverges to infinity almost surely. Through Lemma A.4, we know P (θ n → ∞) < 1. That proves {θ n } can only converge to J * almost surely.

A APPENDIX

A.1 COUNTER-EXAMPLE Claim A.1. The chain rule does not hold the Clarke subdifferential.Proof. For a composite nonsmooth function, the chain rule may not hold at the nonsmooth point Liu et al. (2022b) . We introduce an example as follows.Consider min w1∈R,w2∈R,b1∈R,b2∈R, one can easily see that the SGD method will get stuck atfor some small positive number ϵ .Then, observe that) is neither a local minimizer of equation 10. Moreover, one can see that (1, 2, -1, -1) is a global minimizer of equation 10, at which the function value is 0.whereWe also define event A ( l0) n,θ * i = {θ n ∈ S ( l0) } and let the characteristic function be I ( l0) n,θi . Then we can get thatwhere { ζn } is a Martingale difference sequence defined asand k 0 , k1 are two constants. Define I, we getwhere â0 is a constant. Through calculating the sum of equation 39 and equation 40, we getwhere.Using the inequalities (1 + x) r0 ≥ 1 + r 0 x, (r 0 < 0), we have.Summing this over n, we have, almost surely .

Denote ζ′

.Clearly, 

