RESTRICTED STRONG CONVEXITY OF DEEP LEARNING MODELS WITH SMOOTH ACTIVATIONS

Abstract

We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the "near initialization" perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with L layers, m width, and σ 2 0 initialization variance. First, for suitable σ 2 0 , we establish a O( poly(L) √ m ) upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is Ω( poly(L) √ m ) for the square loss. We also present results for more general losses. The RSC based analysis does not need the "near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.

1. INTRODUCTION

Recent years have seen advances in understanding convergence of gradient descent (GD) and variants for deep learning models (Du et al., 2019; Allen-Zhu et al., 2019; Zou & Gu, 2019; Zou et al., 2020; Liu et al., 2022; Ji & Telgarsky, 2019; Oymak & Soltanolkotabi, 2020; Nguyen, 2021) . Despite the fact that such optimization problems are non-convex, a series of recent results have shown that GD has geometric convergence and finds near global solution "near initialization" for wide networks. Such analysis is typically done based on the Neural Tangent Kernel (NTK) (Jacot et al., 2018) , in particular by showing that the NTK is positive definite "near initialization," in turn implying the optimization problem satisfies a condition closely related to the Polyak-Łojasiewicz (PL) condition, which in turn implies geometric convergence to the global minima (Liu et al., 2022; Nguyen, 2021) . Such results have been generalized to more flexible forms of "lazy learning" where similar guarantees hold (Chizat et al., 2019) . However, there are concerns regarding whether such "near initialization" or "lazy learning" truly explains the optimization behavior in realistic deep learning models (Geiger et al., 2020; Yang & Hu, 2020; Fort et al., 2020; Chizat et al., 2019) . Our work focuses on optimization of deep models with smooth activation functions, which have become increasingly popular in recent years (Du et al., 2019; Liu et al., 2022; Huang & Yau, 2020) . Much of the theoretical convergence analysis of GD has focused on ReLU networks (Allen-Zhu et al., 2019; Nguyen, 2021) . Some progress has also been made for deep models with smooth activations, but existing results are based on a variant of the NTK analysis, and the requirements on the width of such models are high (Du et al., 2019; Liu et al., 2022) . Based on such background and context, the motivating question behind our work is: Are there other (meaningful) sufficient conditions beyond NTK which lead to (geometric) convergence of GD for deep learning optimization? Based on such motivation, we make two technical contributions in this paper which shed light on optimization of deep learning models with smooth activations and with L layers, m width, and σ 2 0 initialization variance. First, for suitable σ 2 0 , we establish a O( poly(L) √ m ) upper bound on the spectral norm of the Hessian of such models (Section 4). The bound holds over a large layerwise spectral norm (instead of Frobenius norm) ball B Spec ρ,ρ1 (θ 0 ) around the random initialization θ 0 , where the radius ρ < √ m, arguably much bigger than what real world deep models need. Our analysis builds on and sharpens recent prior work on the topic (Liu et al., 2020) . While our analysis holds for Gaussian random initialization of weights with any variance σ 2 0 , the poly(L) dependence happens when σ 2 0 ≤ 1 4+o(1) 1 m (we handle the 1 m scaling explicitly) . Second, based on our Hessian spectral norm bound, we introduce a new approach to the analysis of optimization of deep models with smooth activations based on the concept of Restricted Strong Convexity (RSC) (Section 5) (Wainwright, 2019; Negahban et al., 2012; Negahban & Wainwright, 2012; Banerjee et al., 2014; Chen & Banerjee, 2015) . While RSC has been a core theme in high-dimensional statistics especially for linear models and convex losses (Wainwright, 2019) , to the best of our knowledge, RSC has not been considered in the context of non-convex optimization of overparameterized deep models. For a normalized total loss function L(θ) = 1 n n i=1 ℓ(y i , ŷi ), ŷi = f (θ; x i ) with predictor or neural network model f parameterized by vector θ and data points {x i , y i } n i=1 , when ℓ corresponds to the square loss we show that the total loss function satisfies RSC on a suitable restricted set Q t κ ⊂ R p (Definition 5.2 in Section 5) at step t as long as 1 n n i=1 ∇ θ f (θ t ; x i ) 2 2 = Ω( 1 √ m ). We also present similar results for general losses for which additional assumptions are needed. We show that the RSC property implies a Restricted Polyak-Łojasiewicz (RPL) condition on Q t κ , in turn implying a geometric one-step decrease of the loss towards the minimum in Q t κ , and subsequently implying geometric decrease of the loss towards the minimum in the large (layerwise spectral norm) ball B Spec ρ,ρ1 (θ 0 ). The geometric convergence due to RSC is a novel approach in the context of deep learning optimization which does not depend on properties of the NTK. Thus, the RSC condition provides an alternative sufficient condition for geometric convergence for deep learning optimization to the widely-used NTK condition. The rest of the paper is organized as follows. We briefly present related work in Section 2 and discuss the problem setup in Section 3. We establish the Hessian spectral norm bound in Section 4 and introduce the RSC based optimization analysis in Section 5. We experimental results corresponding to the RSC condition in Section 6 and conclude in Section 7. All technical proofs are in the Appendix.

2. RELATED WORK

The literature on gradient descent and variants for deep learning is increasingly large, and we refer the readers to the following surveys for an overview of the field (Fan et al., 2021; Bartlett et al., 2021) . Among the theoretical works, we consider (Du et al., 2019; Allen-Zhu et al., 2019; Zou & Gu, 2019; Zou et al., 2020; Liu et al., 2022) as the closest to our work in terms of their study of convergence on multi-layer neural networks. For a literature review on shallow and/or linear networks, we refer to the recent survey (Fang et al., 2021) . Due to the rapidly growing related work, we only refer to the most related or recent work for most parts. Du et al. (2019) ; Zou & Gu (2019) ; Allen-Zhu et al. (2019) ; Liu et al. (2022) considered optimization of square loss, which we also consider for our main results, and we also present extensions to more general class of loss functions. Zou & Gu (2019) ; Zou et al. (2020) ; Allen-Zhu et al. (2019) ; Nguyen & Mondelli (2020) ; Nguyen (2021) ; Nguyen et al. (2021) analyzed deep ReLU networks. Instead, we consider smooth activation functions, similar to (Du et al., 2019; Liu et al., 2022) . The convergence analysis of the gradient descent in (Du et al., 2019; Allen-Zhu et al., 2019; Zou & Gu, 2019; Zou et al., 2020; Liu et al., 2022) relied on the near constancy of NTK for wide neural networks (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019; Liu et al., 2020) , which yield certain desirable properties for their training using gradient descent based methods. One such property is related to the PL condition (Karimi et al., 2016; Nguyen, 2021) , formulated as PL * condition in (Liu et al., 2022) . Our work uses a different optimization analysis based on RSC (Wainwright, 2019; Negahban et al., 2012; Negahban & Wainwright, 2012) related to a restricted version of the PL condition. Furthermore, Du et al. (2019); Allen-Zhu et al. (2019) ; Zou & Gu (2019) ; Zou et al. (2020) showed convergence in value to a global minimizer of the total loss, as we also do.

3. PROBLEM SETUP

: DEEP LEARNING WITH SMOOTH ACTIVATIONS Consider a training set D = {x i , y i } n i=1 , x i ∈ X ⊆ R d , y i ∈ Y ⊆ R. We will denote by X ∈ R n×d the matrix whose ith row is x ⊤ i . For a suitable loss function ℓ, the goal is to minimize the empirical loss: L(θ) = 1 n n i=1 ℓ(y i , ŷi ) = 1 n n i=1 ℓ(y i , f (θ; x i )) , where the prediction ŷi := f (θ; x i ) is from a deep model, and the parameter vector θ ∈ R p . In our setting f is a feed-forward multi-layer (fully-connected) neural network with depth L and widths m l , l ∈ [L] := {1, . . . , L} given by α (0) (x) = x , α (l) (x) = ϕ 1 √ m l-1 W (l) α (l-1) (x) , l = 1, . . . , L , f (θ; x) = α (L+1) (x) = 1 √ m L v ⊤ α (L) (x) , where W (l) ∈ R m l ×m l-1 , l ∈ [L] are layer-wise weight matrices, v ∈ R m L is the last layer vector, ϕ(•) is the smooth (pointwise) activation function, and the total set of parameters θ := (vec(W (1) ) ⊤ , . . . , vec(W (L) ) ⊤ , v ⊤ ) ⊤ ∈ R L k=1 m k m k-1 +m L , with m 0 = d. For simplicity, we will assume that the width of all the layers is the same, i.e., m l = m, l ∈ [L], and so that θ ∈ R Lm 2 +m . For simplicity, we also consider deep models with only one output, i.e., f (θ; x) ∈ R as in (Du et al., 2019) , but our results can be extended to multi-dimension outputs as in (Zou & Gu, 2019) , using V ∈ R m L ×k for k outputs at the last layer; see Appendix C. Define the pointwise loss ℓ i := ℓ(y i , •) : R → R + and denote its first-and second-derivative as ℓ ′ i := dℓ(yi,ŷi) dŷi and ℓ ′′ i := d 2 ℓ(yi,ŷi) dŷ 2 i . The particular case of square loss is ℓ(y i , ŷi ) = (y i -ŷi ) 2 . We denote the gradient and Hessian of f (•; x i ) : R p → R as ∇ i f := ∂f (θ;xi) ∂θ , and ∇ 2 i f := ∂ 2 f (θ;xi) ∂θ 2 . The neural tangent kernel (NTK) K ntk (•; θ) ∈ R n×n corresponding to parameter θ is defined as K ntk (x i , x j ; θ) = ⟨∇ i f, ∇ j f ⟩. By chain rule, the gradient and Hessian of the empirical loss w.r.t. θ are given by ∂L(θ) ∂θ = 1 n n i=1 ℓ ′ i ∇ i f and ∂ 2 L(θ) ∂θ 2 = 1 n n i=1 ℓ ′′ i ∇ i f ∇ i f ⊤ + ℓ ′ i ∇ 2 i f . Let ∥ • ∥ 2 denote the spectral norm for matrices and L 2 -norm for vectors We make the following assumption regarding the activation function ϕ: Assumption 1 (Activation function). The activation ϕ is 1-Lipschitz, i.e., |ϕ ′ | ≤ 1, and β ϕ -smooth, i.e., |ϕ ′′ l | ≤ β ϕ . Remark 3.1. Our analysis holds for any ς ϕ -Lipchitz smooth activations, with a dependence on ς ϕ on most key results. The main (qualitative) conclusions stay true if ς ϕ ≤ 1 + o(1) or ς ϕ = poly(L), which is typically satisfied for commonly used smooth activations and moderate values of L. We define two types of balls over parameters that will be used throughout our analysis. Definition 3.1 (Norm balls). Given θ ∈ R p of the form (2) with parameters W (l) , l ∈ [L], v, we define B Spec ρ,ρ1 ( θ) := θ ∈ R p as in (2) | ∥W (ℓ) -W (ℓ) ∥ 2 ≤ ρ, ℓ ∈ [L], ∥v -v∥ 2 ≤ ρ 1 , B Euc ρ ( θ) := θ ∈ R p as in (2) | ∥θ -θ∥ 2 ≤ ρ . ( ) Remark 3.2. The layerwise spectral norm ball B Spec ρ,ρ1 plays a key role in our analysis. The last layer radius of ρ 1 gives more flexibility, and we will usually assume ρ 1 ≤ ρ; e.g., we could choose the desirable operating regime of ρ < √ m and ρ 1 = O(1). Our analysis in fact goes through for any choice of ρ, ρ 1 and the detailed results will indicate specific dependencies on both ρ and ρ 1 .

4. SPECTRAL NORM OF THE HESSIAN OF THE MODEL

We start with the following assumption regarding the random initialization of the weights. Assumption 2 (Initialization weights and data normalization). The initialization weights w (l) 0,ij ∼ N (0, σ 2 0 ) for l ∈ [L] where σ 0 = σ1 2 1+ √ log m √ 2m , σ 1 > 0, and v 0 is a random unit vector with ∥v 0 ∥ 2 = 1. Further, we assume the input data satisfies: ∥x i ∥ 2 = √ d, i ∈ [n]. We focus on bounding the spectral norm of the Hessian ∥∇ 2 θ f (θ; x)∥ 2 for θ ∈ B Spec ρ,ρ1 (θ 0 ) and any input x ∈ R d with ∥x∥ 2 = √ d. The assumption ∥x∥ 2 = √ d is for convenient scaling, such assumptions are common in the literature (Allen-Zhu et al., 2019; Oymak & Soltanolkotabi, 2020; Nguyen et al., 2021) . Prior work (Liu et al., 2020) has considered a similar analysis for θ ∈ B Euc ρ (θ 0 ), effectively the layerwise Frobenius norm ball, which is much smaller than B Spec ρ,ρ1 (θ 0 ), the layerwise spectral norm ball. We choose a unit value for the last layer's weight norm for convenience, since our results hold under appropriate scaling for any other constant in O(1). All missing proofs are in Appendix A. Theorem 4.1 (Hessian Spectral Norm Bound). Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least (1 -2(L+1) m ), for any x i , i ∈ [n], we have ∇ 2 θ f (θ; x i ) 2 ≤ c H √ m , with c H = O(L 5 (1 + γ 6L )(1 + ρ 1 )) where γ := σ 1 + ρ √ m . Remark 4.1 (Desirable operating regimes). The constant γ needs careful scrutiny as c H depends on γ 6L . Let us choose ρ 1 = O(poly(L)). For any choice of the spectral norm radius ρ < √ m, we can choose σ 1 ≤ 1 -ρ √ m ensuring γ ≤ 1 and hence c H = O(poly(L)). If ρ = O(1), we can keep σ 1 = 1 so that γ = 1 + O(1) √ m , and c H = O(poly(L)) as long as L < √ m, which is common. Both of these give good choices for σ 1 and desirable operating regime for the result. If we choose σ 1 > 1, an undesirable operating regime, then c H = O(c Θ(L) ), c > 1, and we will need m = Ω(c Θ(L) ) for the result to be of interest. Remark 4.2 (Recent Related Work). In recent work, Liu et al. (2020) analyzed the Hessian spectral norm bound and showed that c H = Õ(ρ 3L ) for θ ∈ B Euc ρ (θ 0 ) (logarithmic terms hidden in Õ(•)). Our analysis builds on and sharpens the result in (Liu et al., 2020) in three respects: (a) we have c H = O(poly(L)(1 + γ 6L )) for ρ 1 = O(poly(L)) where we can choose σ 1 to make γ ≤ 1 and thus obtain c H = O(poly(L)), instead of the worse c H = Õ(ρ 3L ) in Liu et al. (2020) foot_0 ; (b) even for the same ρ, our results hold for a much larger spectral norm ball B Spec ρ,ρ1 (θ 0 ) compared to their Euclidean norm ball B Euc ρ (θ 0 ) in (Liu et al., 2020) ; and (c) to avoid an exponential term, the bound in (Liu et al., 2020) needs ρ ≤ 1 whereas our result can use radius ρ < √ m for all intermediate layer matrices and ρ 1 = O(poly(L)) for the last layer vector. Moreover, as a consequence of (b) and (c), our results hold for a larger (spectral norm) ball whose radius can increase with m, unlike the results in Liu et al. (2020) which hold for a smaller (Euclidean) ball with constant radius, i.e., "near initialization." Remark 4.3 (Exact constant c H ). For completeness, we show the exact expression of the constant c H in Theorem 4.1 so the dependencies on different factors is clear. Let h(l) := γ l-1 +|ϕ(0)| l-1 i=1 γ i-1 . Then, c H = 2L(L 2 γ 2L + Lγ L + 1) • (1 + ρ 1 ) • ψ H • max l∈[L] γ L-l + 2Lγ L max l∈[L] h(l) , where ψ H = max 1≤l1<l2≤L β ϕ (h(l 1 )) 2 , h(l 1 ) β ϕ 2 (γ 2 + (h(l 2 )) 2 ) + 1 , β ϕ γ 2 h(l 1 )h(l 2 ) . ( ) The source of the terms will be discussed shortly. Note the dependence on ρ 1 , the radius for the last layer in B Spec ρ,ρ1 (θ 0 ), and why ρ 1 = O(poly(L)) is a desirable operating regime. Next, we give a high level outline of the proof of Theorem 4.1. Proof sketch. Our analysis follows the structure developed in Liu et al. (2020) , but is considerably sharper as discussed in Remark 4.2. We start by defining the following quantities: Q ∞ (f ) := max 1≤l≤L ∂f ∂α (l) ∞ , ∂f ∂α (l) ∈ R m , Q 2 (f ) := max 1≤l≤L ∂α (l) ∂w (l) 2 , w (l) := vec(W (l) ), ∂α (l) ∂w (l) ∈ R m×m 2 , and Q 2,2,1 (f ) is the maximum over 1 ≤ l 1 < l 2 < l 3 ≤ L among the three quan- tities ∂ 2 α (l 1 ) ∂w (l 1 ) 2 2,2,1 , ∂α (l 1 ) ∂w (l 1 ) 2 ∂ 2 α (l 2 ) ∂α (l 2 -1) ∂w (l 2 ) 2,2,1 , and ∂α (l 1 ) ∂w (l 1 ) 2 ∂α (l 2 ) ∂w (l 2 ) 2 ∂ 2 α (l 3 ) ∂α (l 3 -1) 2 2,2,1 . where for an order-3 tensor T ∈ R d1×d2×d3 we define the (2, 2, 1)-norm as ∥T ∥ 2,2,1 := sup ∥x∥2=∥z∥2=1 d3 k=1 d1 i=1 d2 j=1 T ijk x i z j , x ∈ R d1 , z ∈ R d2 . The following result in (Liu et al., 2020) provides an upper bound to the spectral norm of the Hessian. Theorem 4.2 (Liu et al. (2020) , Theorem 3.1). Under Assumptions 1, assuming there is δ such that ∂α (l) ∂α (l-1) 2 ≤ δ, with C 1 ≤ L 2 δ 2L + Lδ L + L and C 2 ≤ Lδ L , we have ∇ 2 θ f (θ; x) 2 ≤ 2C 1 Q 2,2,1 (f )Q ∞ (f ) + 2 √ m C 2 Q 2 (f ) , In order to prove Theorem 4.1, we prove that Theorem 4.2 holds with high-probability where δ = γ, Q 2 (f ) = O(L(1 + γ L )), Q 2,2,1 (f ) = O(L 3 (1 + γ 3L )), and Q ∞ (f ) = O (1+γ L )(1+ρ1) √ m . Thus we obtain that the upper bound (4.2) becomes O( poly(L)(1+γ 6L )(1+ρ1) √ m ), providing a benign polynomial dependence on L when γ ≤ 1, rather than an exponential dependence on the radius ρ as in (Liu et al., 2020) . The analysis for bounding the spectral norm of the Hessian can be used to establish additional bounds, which we believe are of independent interest, some of which will be used later in Section 5. First, we bound the norms of gradient of the predictor and the loss w.r.t. the weight vector θ and the input data x. Lemma 4.1 (Predictor gradient bounds). Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2(L+1) m , we have ∥∇ θ f (θ; x)∥ 2 ≤ ϱ and ∥∇ x f (θ; x)∥ 2 ≤ γ L √ m (1 + ρ 1 ) , with ϱ 2 = (h(L + 1)) 2 + 1 m (1 + ρ 1 ) 2 L l=1 (h(l)) 2 γ 2(L-l) , γ = σ 1 + ρ √ m , h(l) = γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 . Remark 4.4. Our analysis in Lemma 4.1 provides a bound on the Lipschitz constant of the predictor, a quantity which has generated interest in recent work on robust training (Salman et al., 2019; Cohen et al., 2020; Bubeck & Sellke, 2021) . Under the assumption of square losses, further bounds can be obtained. 

5. OPTIMIZATION GUARANTEES WITH RESTRICTED STRONG CONVEXITY

We focus on minimizing the empirical loss L(θ) over θ ∈ B Spec ρ,ρ1 (θ 0 ), the layerwise spectral norm ball in (3). Our analysis is based on Restricted Strong Convexity (RSC) (Negahban et al., 2012; Banerjee et al., 2014; Chen & Banerjee, 2015; Wainwright, 2019) , which relaxes the definition of strong convexity by only needing strong convexity in certain directions or over a subset of the ambient space. We introduce the following specific definition of RSC with respect to a tuple (S, θ). Definition 5.1 (Restricted Strong Convexity (RSC)). A function L is said to satisfy α-restricted strong convexity (α-RSC) with respect to the tuple (S, θ) if for any θ ′ ∈ S ⊆ R p and some fixed θ ∈ R p , we have L(θ ′ ) ≥ L(θ) + ⟨θ ′ -θ, ∇ θ L(θ)⟩ + α 2 ∥θ ′ -θ∥ 2 2 , with α > 0. Note that L being α-RSC w.r.t. (S, θ) does not need L to be convex on R p . Let us consider a sequence of iterates {θ t } t≥0 ⊂ R p . Our RSC analysis will rely on the following Q t κ -sets at step t, which avoid directions almost orthogonal to the average gradient of the predictor. We define the following notation: for two vectors π and π, cos(π, π) denotes the cosine of the angle between π and π. Definition 5.2 (Q t κ sets). For iterate θ t ∈ R p , let ḡt = 1 n n i=1 ∇ θ f (θ t ; x i ). For any κ ∈ (0, 1], define Q t κ := {θ ∈ R p | | cos(θ -θ t , ḡt )| ≥ κ}. We define the set  B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t , i ∈ [n], is (i) strongly convex, i.e., ℓ ′′ i ≥ a > 0 and (ii) smooth, i.e., ℓ ′′ i ≤ b. Assumption 3 is satisfied by commonly used loss functions such as square loss, where a = b = 2. We state the RSC result for square loss; the result for other losses and proofs of all technical results in this section are in Appendix B. Theorem 5.1 (RSC for Square Loss). For square loss, under Assumptions 1 and 2, with probability at least (1  -2(L+1) m ), ∀θ ′ ∈ Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ) with θ t ∈ B Spec ρ,ρ1 (θ 0 ), L(θ ′ ) ≥ L(θ t ) + ⟨θ ′ -θ t , ∇ θ L(θ t )⟩ + α t 2 ∥θ ′ -θ t ∥ 2 2 , with α t = c 1 ∥ḡ t ∥ 2 2 - c 2 √ m , ( ) where ḡt = 1 n n i=1 ∇ θ f (θ t ; x i ), c 1 = 2κ (Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ), θ t ) whenever α t > 0. Remark 5.1. The RSC condition α t > 0 is satisfied at iteration t as long as ∥ḡ t ∥ 2 2 > c2 c1 √ m where c 1 , c 2 are exactly specified in Theorem 5.1. Indeed, if γ (and so σ 1 and ρ) is chosen according to the desirable operating regimes (see Remark 4.1), ρ 1 = O(poly(L)) and ρ 2 = O(poly(L)), then we can use the bounds from Lemma 4.2 and obtain that the RSC condition is satisfied when ∥ḡ t ∥ 2 2 > O(poly(L)) √ m . The condition is arguably mild, does not need the NTK condition λ min (K ntk (•; θ t )) > 0, and is expected to hold till convergence (see Remark 5.3). Moreover, it is a local condition at step t and has no dependence on being "near initialization" in the sense of θ t ∈ B Euc ρ (θ 0 ) for ρ = O(1) as in (Liu et al., 2020; 2022) . For the convergence analysis, we also need to establish a smoothness property of the total loss. Theorem 5.2 (Local Smoothness for Square Loss). For square loss, under Assumptions 1 and 2, with probability at least (1 -2(L+1) m ), ∀θ, θ ′ ∈ B Spec ρ,ρ1 (θ 0 ), L(θ ′ ) ≤ L(θ) + ⟨θ ′ -θ, ∇ θ L(θ)⟩ + β 2 ∥θ ′ -θ∥ 2 2 , with β = 2ϱ 2 + 2c H √ c ρ1,γ √ m , with c H as in Theorem 4.1, ϱ as in Lemma 4.1, and c ρ1,γ as in Lemma 4.2. Consequently, L is locally β-smooth. Moreover, if γ (and so σ 1 and ρ) is chosen according to the desirable operating regimes (see Remark 4.1) and ρ 1 = O(poly(L)), then β = O(poly(L)). Remark 5.2. Similar to the case of the standard strong convexity and smoothness, the RSC and smoothness parameters respectively in Theorems 5.1 and 5.2 satisfy α t < β. To see this note that α t < 2κ 2 ∥ḡ t ∥ 2 2 ≤ 2ϱ 2 ≤ β, where the second inequality follows since κ ≤ 1, and ∥ḡ t ∥ 2 2 ≤ ϱ 2 using Lemma 4.1. Next, we show that the RSC condition w.r.t. the tuple (B t , θ t ) implies a restricted Polyak-Łojasiewicz (RPL) condition w.r.t. the tuple (B t , θ t ), unlike standard PL which holds without restrictions (Karimi et al., 2016) . Lemma 5.1 (RSC ⇒ RPL). Let B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). In the setting of Theorem 5.1, if α t > 0, then the tuple (B t , θ t ) satisfies the Restricted Polyak-Łojasiewicz (RPL) condition, i.e., L(θ t ) -inf θ∈Bt L(θ) ≤ 1 2α t ∥∇ θ L(θ t )∥ 2 2 , ( ) with probability at least (1 -2(L+1) m ). For the rest of the convergence analysis, we make the following assumption where T can be viewed as the stopping time so the convergence analysis holds given the assumptions are satisfied. Assumption 4 (Iterates' conditions). For iterates {θ t } t=0,1,...,T : (A4.1) α t > 0; (A4.2) θ t ∈ B Spec ρ,ρ1 (θ 0 ). Remark 5.3 (Assumption (A4.1)). From Remark 5.1, (A4.1) is satisfied as long as ∥ḡ t ∥ 2 2 > c2 c1 √ m where c 1 , c 2 are as in Theorem 5.1, which is arguably a mild condition. In Section 6 we will present some empirical findings that show that this condition on ∥ḡ t ∥ 2 2 behaves well empirically. We now consider the particular case of gradient descent (GD) for the iterates: θ t+1 = θ t -η t ∇L(θ t ), where η t is chosen so that θ t+1 ∈ B Spec ρ,ρ1 (θ 0 ) and ρ 2 is chosen so that θ t+1 ∈ B Euc ρ2 (θ t ), which are sufficient for the analysis of Theorem 5.1 -we specify suitable choices in the sequel (see Remark 5.4). Given RPL w.r.t. (B t , θ t ), gradient descent leads to a strict decrease of loss in B t . Lemma 5.2 (Local Loss Reduction in B t ). Let α t , β be as in Theorems 5.1 and 5.2 respectively, and B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). Consider Assumptions 1, 2, and 4, and gradient descent with step size η t = ωt β , ω t ∈ (0, 2). Then, for any θ t+1 ∈ arginf θ∈Bt L(θ), we have with probability at least (1 -2(L+1) m ), L(θ t+1 ) -L(θ t+1 ) ≤ 1 - α t ω t β (2 -ω t ) (L(θ t ) -L(θ t+1 )) . Building on Lemma 5.2, we show that GD in fact leads to a geometric decrease in the loss relative to the minimum value of L(•) in the set B Spec ρ,ρ1 (θ 0 ). Theorem 5.3 (Global Loss Reduction in B Spec ρ,ρ1 (θ 0 )). Let α t , β be as in Theorems 5.1 and 5.2 respectively, and B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). Let θ * ∈ arginf θ∈B Spec ρ,ρ 1 (θ0) L(θ), θ t+1 ∈ arginf θ∈Bt L(θ), and γ t := L(θt+1)-L(θ * ) L(θt)-L(θ * ) . Let α t , β be as in Theorems 5.1 and 5.2 respectively, and B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). Consider Assumptions 1, 2, and 4, and gradient descent with step size η t = ωt β , ω t ∈ (0, 2). Then, with probability at least (1 -2(L+1) m ), we have we have γ t ∈ [0, 1) and L(θ t+1 ) -L(θ * ) ≤ 1 - α t ω t β (1 -γ t )(2 -ω t ) (L(θ t ) -L(θ * )) . ( ) As long as the conditions in Theorem 5.3 are kept across iterations, there will be a geometric decrease in loss. For Assumption 4, we have discussed (A4.1) in Remark 5.1, and we discuss (A4.2) next. Remark 5.4 (Assumption (A4.2)). Consider we run gradient descent iterations until some stopping time T > 0. Given radius ρ < √ m , Assumption (A4.2) θ t ∈ B Spec ρ,ρ1 (θ 0 ), t = 0, . . . , T , can be verified empirically. Alternatively, we can choose suitable step sizes η t to ensure the property using the geometric convergence from Theorem 5.3. Assume that our goal is to get L(θ T ) -L(θ * ) ≤ ϵ. Then, with χ T := min t∈[T ] αtωt β (1 -γ t )(2 -ω t ), Assumption (A4.1) along with Remark 5.2 ensures χ T < 1. Then, it suffices to have T = ⌈log( L(θ0)-L(θ * ) ϵ )/ log 1 1-χ T ⌉ = Θ(log 1 ϵ ). Then, to ensure θ t ∈ B Spec ρ,ρ1 (θ 0 ), t ∈ [T ] , in the case of the square loss, since ∥∇L(θ t )∥ 2 ≤ c for some constant c (see Corollary 4.1), it suffices to have η t ≤ min{ρ,ρ1} Θ(log 1 ϵ ) . Moreover, we point out that having ρ 2 ≥ η t c ensures ∥θ t+1 -θ t ∥ 2 ≤ ρ 2 ⇒ θ t+1 ∈ B Euc ρ2 (θ t ), which in this case can be guaranteed if ρ 2 ≥ min{ρ,ρ1} Θ(log 1 ϵ ) . The argument above is informal, but illustrates that Assumption (A4.1) along with suitable constant step sizes η t would ensure (A4.2). Thus, Assumption (A4.1), which ensures the RSC condition, is the main assumption behind the analysis. The conditions in Assumption 4 (see Remarks 5.1 and 5.4) along with Theorem 5.3 imply that the RSC based convergence analysis holds for a much larger layerwise spectral radius norm ball B Spec ρ,ρ1 (θ 0 ) with any radius ρ < √ m and ρ 1 = O(poly(L)). Remark 5.5 (RSC and NTK). In the context of square loss, the NTK condition for geometric convergence needs λ min (K ntk (•; θ t )) ≥ c 0 > 0 for every t, i.e., uniformly bounded away from 0 by a constant c 0 > 0. The NTK condition can also be written as inf v:∥v∥2=1 n i=1 v i ∇ θ f (θ t ; x i ) 2 2 ≥ c 0 > 0 . In contrast, the proposed RSC condition (Theorem 5.1) needs 1 n n i=1 ∇ θ f (θ t ; x i ) 2 2 ≥ c0 √ m , ( ) where m is the width and c0 = c2 c1 where c 1 , c 2 are constants defined in Theorem 5.1. As a quadratic form on the NTK, the RSC condition can be viewed as using a specific v in (15), i.e., v i = 1 √ n for i ∈ [n], since the RSC condition is n i=1 1 √ n ∇ θ f (θ t ; x i ) 2 2 ≥ c0n √ m . For m = Ω(n 2 ), the RSC condition is more general since NTK ⇒ RSC, but the converse is not necessarily true. Remark 5.6 (RSC covers different settings than NTK). The NTK condition may be violated in certain settings, e.g., ∇ θ f (θ t ; x i ), i = 1, . . . , n are linearly dependent, x i ≈ x j for some i ̸ = j, layer widths are small m l < n, etc., but the optimization may work in practice. The RSC condition provides a way to analyze convergence in such settings. The RSC condition gets violated when 1 n n i=1 ∇ θ f (θ t ; x i ) ≈ 0, which does not seem to happen in practice (see Section 6), and future work will focus on understanding the phenomena. Finally, note that it is possible to construct a set of gradient vectors which satisfy the NTK condition but violates the RSC condition. Our perspective is to view the NTK and the RSC as two different sufficient conditions and geometric convergence of gradient descent (GD) is guaranteed as long as one of them is satisfied in any step.

6. RSC CONDITION: EXPERIMENTAL RESULTS

In this section, we present experimental results verifying the RSC condition 1 n n i=1 ∇ θ f (θ t ; x i ) 2 2 = Ω poly(L) √ m , t = 1, . . . , T , on standard benchmarks: CIFAR-10, MNIST, and Fashion-MNIST. For simplicity, as before, we use ḡt = 1 n n i=1 ∇ θ f (θ t ; x i ). In Figure 1 (a), we consider CIFAR-10 and show the trajectory of ∥ḡ t ∥ 2 over iterations t, for different values of the network width m. For any width, the value of ∥ḡ t ∥ 2 stabilizes to a constant value over iterations, empirically validating the RSC condition ∥ḡ t ∥ 2 2 = Ω(poly(L)/ √ m). Interestingly, the smallest value of ∥ḡ t ∥ 2 seems to increase with the width. To study the width dependence further, in Figure 1 

7. CONCLUSIONS

In this paper, we revisit deep learning optimization for feedforward models with smooth activations, and make two technical contributions. First, we bound the spectral norm of the Hessian over a large layerwise spectral norm radius ball, highlighting the role of initialization in such analysis. Second, we introduce a new approach to showing geometric convergence in deep learning optimization using restricted strong convexity (RSC). Our analysis sheds considerably new light on deep learning optimization problems, underscores the importance of initialization variance, and introduces a RSC based alternative to the prevailing NTK based analysis, which may fuel future work.

A SPECTRAL NORM OF THE HESSIAN

We establish the main theorem from Section 4 in this Appendix. Theorem 4.1 (Hessian Spectral Norm Bound). Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least (1 -2(L+1) m ), for any x i , i ∈ [n], we have ∇ 2 θ f (θ; x i ) 2 ≤ c H √ m , with c H = O(L 5 (1 + γ 6L )(1 + ρ 1 )) where γ := σ 1 + ρ √ m . A.1 ANALYSIS OUTLINE Our analysis follows that of Liu et al. (2020) and sharpens the analysis to get better dependence on the depth L of the neural network. We start by defining the following quantities: Q ∞ (f ) := max 1≤l≤L ∂f ∂α (l) ∞ , ∂f ∂α (l) ∈ R m , Q 2 (f ) := max 1≤l≤L ∂α (l) ∂w (l) 2 , w (l) := vec(W (l) ) , ∂α (l) ∂w (l) ∈ R m×m 2 , ( ) Q 2,2,1 (f ) := max 1≤l1<l2<l3≤L ∂ 2 α (l1) ∂w (l1) 2 2,2,1 , ∂α (l1) ∂w (l1) 2 ∂ 2 α (l2) ∂α (l2-1) ∂w (l2) 2,2,1 , ∂α (l1) ∂w (l1) 2 ∂α (l2) ∂w (l2) 2 ∂ 2 α (l3) ∂α (l3-1) 2 2,2,1 where for an order-3 tensor T ∈ R d1×d2×d3 we define the (2, 2, 1)-norm as follows, ∥T ∥ 2,2,1 := sup ∥x∥2=∥z∥2=1 d3 k=1 d1 i=1 d2 j=1 T ijk x i z j , x ∈ R d1 , z ∈ R d2 . ( ) We will also use the notation W (L+1) := v. A key result established in Liu et al. (2020) provides an upper bound to the spectral norm of the Hessian: Theorem 4.2 (Liu et al. (2020) , Theorem 3.1). Under Assumptions 1, assuming there is δ such that ∂α (l) ∂α (l-1) 2 ≤ δ, with C 1 ≤ L 2 δ 2L + Lδ L + L and C 2 ≤ Lδ L , we have ∇ 2 θ f (θ; x) 2 ≤ 2C 1 Q 2,2,1 (f )Q ∞ (f ) + 2 √ m C 2 Q 2 (f ) , In order to prove Theorem 4.1, we prove that Theorem 4.2 holds with high-probability where • δ = γ follows from Lemma A.3, • Q 2 (f ) = O(L(1 + γ L )) follows from Lemma A.4, • Q 2,2,1 (f ) = O(L 3 (1 + γ 3L )) follows from Lemma A.4 and Lemma A.5, and • Q ∞ (f ) = O (1+γ L )(1+ρ1) √ m follows from Lemma A.7 , while also establishing precise constants to get a proper form for the constant c H in Theorem 4.1. As a result, c H ≤ O( L 5 (1+γ 6L )(1+ρ1) √ m ). A.2 SPECTRAL NORMS OF W (l) AND L 2 NORMS OF α (l) We start by bounding the spectral norm of the layer-wise matrices at initialization.  (l) 0,ij ∼ N (0, σ 2 0 ) where σ 0 = σ1 2(1+ √ log m 2m ) as in Assumption 2, then with probability at least 1 -2 m , we have ∥W (l) 0 ∥ 2 ≤ σ 1 √ m . ( ) Proof. For a (m l × m l-1 ) random matrix W (l) 0 with i.i.d. entries w (l) 0,ij ∈ N (0, σ 2 0 ) , with probability at least (1 -2 exp(-t 2 /2σ 2 0 )), the largest singular value of W 0 is bounded by σ max (W (ℓ) 0 ) ≤ σ 0 ( √ m l + √ m l-1 ) + t . ( ) This concentration result can be easily derived as follows: notice that W 0 = σ 0 W (ℓ) 0 , where w(ℓ) 0,ij ∼ N (0, 1), thus we can use the expectation E[∥W 0 ∥ (ℓ) 2 ] = σ 0 E[ W0 (ℓ) 2 ] = σ 0 ( √ m ℓ + √ m ℓ-1 ) from Gordon's Theorem for Gaussian matrices (Vershynin, 2012, Theorem 5.32) in the Gaussian concentration result for Lipschitz functions (Vershynin, 2012, Proposition 3.4 ) considering that B → ∥σ 0 B∥ 2 is a σ 0 -Lipschitz function when the matrix B is treated as a vector. Let us choose t = σ 0 √ 2 log m so that (23) holds with probability at least (1 -2 m ). Then, to obtain ( 22), Case 1: l = 1. With m 0 = d and m 1 = m, ∥W (1) 0 ∥ 2 ≤ σ 0 ( √ d + √ m + 2 log m) ≤ σ 0 (2 √ m + 2 log m) since we are in the over-parameterized regime m ≥ d. Case 2: 2 ≤ l ≤ L. With m l = m l-1 = m, ∥W (l) 0 ∥ 2 ≤ σ 0 (2 √ m + 2 log m) . Now, using σ 0 = σ1 2(1+ √ log m 2m ) in both cases completes the proof. Next we bound the spectral norm of layerwise matrices. Proposition A.1. Under Assumptions 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2 m , ∥W (l) ∥ 2 ≤ σ 1 + ρ √ m √ m , l ∈ [L]. Proof. By triangle inequality, for l ∈ [L], ∥W (l) ∥ 2 ≤ ∥W (l) 0 ∥ 2 + ∥W (l) -W (l) 0 ∥ 2 (a) ≤ σ 1 √ m + ρ , where (a) follows from Lemma A.1. This completes the proof. Next, we show that the output α (l) of layer l has an L 2 norm bounded by O( √ m). Lemma A.2. Consider any l ∈ [L]. Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2l m , we have ∥α (l) ∥ 2 ≤ √ m σ 1 + ρ √ m l + √ m l i=1 σ 1 + ρ √ m i-1 |ϕ(0)| = γ l + |ϕ(0)| l i=1 γ i-1 √ m . Proof. Following Allen-Zhu et al. ( 2019); Liu et al. (2020) , we prove the result by recursion. First, recall that since ∥x∥ 2 2 = d, we have ∥α (0) ∥ 2 = √ d. Then, since m 0 = d and ϕ is 1-Lipschitz, ϕ 1 √ d W (1) α (0) 2 -∥ϕ(0)∥ 2 ≤ ϕ 1 √ d W (1) α (0) -ϕ(0) 2 ≤ 1 √ d W (1) α (0) 2 , so that ∥α (1) ∥ 2 = ϕ 1 √ d W (1) α (0) 2 ≤ 1 √ d W (1) α (0) 2 + ∥ϕ(0)∥ 2 ≤ 1 √ d ∥W (1) ∥ 2 ∥α (0) ∥ 2 + |ϕ(0)| √ m ≤ σ 1 + ρ √ m √ m + |ϕ(0)| √ m , where we used Proposition A.1 in the last inequality, which holds with probability at least 1 -2 m . For the inductive step, we assume that for some l -1, we have ∥α (l-1) ∥ 2 ≤ √ m σ 1 + ρ √ m l-1 + √ m l-1 i=1 σ 1 + ρ √ m i-1 |ϕ(0)|, which holds with the probability at least 1 -2(l-1) m . Since ϕ is 1-Lipschitz, for layer l, we have ϕ 1 √ m W (l) α (l-1) 2 -∥ϕ(0)∥ 2 ≤ ϕ 1 √ m W (l) α (l-1) -ϕ(0) 2 ≤ 1 √ m W (l) α (l-1) 2 , so that ∥α (l) ∥ 2 = ϕ 1 √ m W (l) α (l-1) 2 ≤ 1 √ m W (l) α (l-1) 2 + ∥ϕ(0)∥ 2 ≤ 1 √ m ∥W (l) ∥ 2 ∥α (l-1) ∥ 2 + √ m|ϕ(0)| (a) ≤ σ 1 + ρ √ m ∥α (l-1) ∥ 2 + √ m|ϕ(0)| (b) = √ m σ 1 + ρ √ m l + √ m l i=1 σ 1 + ρ √ m i-1 |ϕ(0)|, where (a) follows from Proposition A.1 and (b) from the inductive step. Since we have used Proposition A.1 l times, after a union bound, our result would hold with probability at least 1 -2l m . This completes the proof.

A.3 SPECTRAL NORMS OF ∂α (l)

∂w (l) AND ∂α (l) ∂α (l-1) Recall that in our setup, the layerwise outputs and pre-activations are respectively given by: α (l) = ϕ α(l) , α(l) := 1 √ m l-1 W (l) α (l-1) . ( ) Lemma A.3. Consider any l ∈ {2, . . . , L}. Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2 m , ∂α (l) ∂α (l-1) 2 2 ≤ σ 1 + ρ √ m 2 = γ 2 . ( ) Proof. By definition, we have ∂α (l) ∂α (l-1) i,j = 1 √ m ϕ ′ (α (l) i )W (l) ij . Since ∥A∥ 2 = sup ∥v∥2=1 ∥Av∥ 2 , so that ∥A∥ 2 2 = sup ∥v∥2=1 i ⟨a i , v⟩ 2 , we have that for 2 ≤ l ≤ L, ∂α (l) ∂α (l-1) 2 2 = sup ∥v∥2=1 1 m m i=1   ϕ ′ (α (l) i ) m j=1 W (l) ij v j   2 (a) ≤ sup ∥v∥2=1 1 m ∥W (l) v∥ 2 2 = 1 m ∥W (l) ∥ 2 2 (b) ≤ γ 2 , where (a) follows from ϕ being 1-Lipschitz by Assumption 1 and (b) from Proposition A.1. This completes the proof. Lemma A.4. Consider any l ∈ [L]. Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2l m , ∂α (l) ∂w (l) 2 2 ≤ 1 m √ m σ 1 + ρ √ m l-1 + √ m l-1 i=1 σ 1 + ρ √ m i-1 |ϕ(0)| 2 = γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 2 . ( ) Proof. Note that the parameter vector w (l) = vec(W (l) ) and can be indexed with j ∈ [m] and j ′ ∈ [d] when l = 1 and j ′ ∈ [m] when l ≥ 2. Then, we have ∂α (l) ∂w (l) i,jj ′ = ∂α (l) ∂W (l) i,jj ′ = 1 √ m ϕ ′ (α (l) i )α (l-1) j ′ 1 [i=j] . For l ∈ {2, . . . , L}, noting that ∂α (l) ∂w (l) ∈ R m×m 2 and ∥V ∥ F = ∥vec(V )∥ 2 for any matrix V , we have ∂α (l) ∂w (l) 2 2 = sup ∥V ∥ F =1 1 m m i=1   ϕ ′ (α (l) i ) m j,j ′ =1 α (l-1) j ′ 1 [i=j] V jj ′   2 ≤ sup ∥V ∥ F =1 1 m ∥V α (l-1) ∥ 2 2 ≤ 1 m sup ∥V ∥ F =1 ∥V ∥ 2 2 ∥α (l-1) ∥ 2 2 (a) ≤ 1 m ∥α (l-1) ∥ 2 2 (b) ≤ 1 m √ m σ 1 + ρ √ m l-1 + √ m l-1 i=1 σ 1 + ρ √ m i-1 |ϕ(0)| 2 = γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 2 where (a) follows from ∥V ∥ 2 2 ≤ ∥V ∥ 2 F for any matrix V , and (b) from Lemma A.2. The l = 1 case follows in a similar manner: ∂α (1) ∂w (1) 2 2 ≤ 1 d ∥α (0) ∥ 2 2 = 1 d ∥x∥ 2 2 = 1 which satisfies the form for l = 1. That completes the proof. A.4 (2, 2, 1)-NORMS OF ORDER 3 TENSORS Lemma A.5. Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), each of the following inequalities hold with probability at least 1 -2l m , ∂ 2 α (l) (∂α (l-1) ) 2 2,2,1 ≤ β ϕ γ 2 , ( ) ∂ 2 α (l) ∂α (l-1) ∂W (l) 2,2,1 ≤ β ϕ 2   γ 2 + γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 2   + 1, for l = 2, . . . , L; and ∂ 2 α (l) (∂W (l) ) 2 2,2,1 ≤ β ϕ γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 2 , ( ) for l ∈ [L]. Proof. For the inequality (29), note that from (26) we obtain ∂ 2 α (l) (∂α (l-1) ) 2 i,j,k = 1 m ϕ ′′ (α (l) i )W (l) ik W (l) ij , and so ∂ 2 α (l) (∂α (l-1) ) 2 2,2,1 = sup ∥v1∥ 2 =∥v2∥ 2 =1 1 m m i=1 ϕ ′′ (α (l) i )(W (l) v 1 ) i (W (l) v 2 ) i ≤ sup ∥v1∥ 2 =∥v2∥ 2 =1 1 m β ϕ m i=1 (W (l) v 1 ) i (W (l) v 2 ) i (a) ≤ sup ∥v1∥ 2 =∥v2∥ 2 =1 1 2m β ϕ m i=1 (W (l) v 1 ) 2 i + (W (l) v 2 ) 2 i ≤ 1 2m β ϕ sup ∥v1∥ 2 =∥v2∥ 2 =1 (∥W (l) v 1 ∥ 2 2 + ∥W (l) v 2 ∥ 2 2 ) ≤ 1 2m β ϕ (∥W (l) ∥ 2 2 + ∥W (l) ∥ 2 2 ) (b) ≤ β ϕ (σ 1 + ρ/ √ m) 2 = β ϕ γ 2 , where (a) follows from 2ab ≤ a 2 + b 2 for a, b ∈ R, and (b) from Proposition A.1, with probability at least 1 -2 m . For the inequality (30), carefully following the chain rule in (28) we obtain ∂ 2 α (l) ∂α (l-1) ∂W (l) i,jj ′ ,k = 1 m ϕ ′′ (α (l) i )W (l) ik α (l-1) j ′ 1 [j=i] + 1 √ m ϕ ′ (α (l) i )1 [i=j] 1 [j ′ =k] . Then, we have ∂ 2 α (l) ∂α (l-1) ∂W (l) 2,2,1 = sup ∥v1∥ 2 =∥V2∥ F =1 m i=1 m k=1 m j=1 m j ′ =1 1 m ϕ ′′ (α (l) i )W (l) ik α (l-1) j ′ 1 [j=i] + 1 √ m ϕ ′ (α (l) i )1 [i=j] 1 [j ′ =k] v 1,k V 2,jj ′ = sup ∥v1∥ 2 =∥V2∥ F =1 m i=1 1 m m j ′ =1 ϕ ′′ (α (l) i )α (l-1) j ′ V 2,ij ′ m k=1 W (l) ik v 1,k + 1 √ m m k=1 ϕ ′ ( α(l) i )v 1,k V 2,ik ≤ sup ∥v1∥ 2 =∥V2∥ F =1 1 m β ϕ m i=1 (W (l) v 1 ) i (V 2 α (l-1) ) i + 1 √ m m i=1 m k=1 |v 1,k V 2,ik | ≤ sup ∥v1∥ 2 =∥v2∥ F =1 1 2m β ϕ m i=1 (W (l) v 1 ) 2 i + (V 2 α (l-1) ) 2 i + 1 √ m m i=1 ∥v 1 ∥ 2 V 2,i,: 2 = sup ∥v1∥ 2 =∥V2∥ F =1 1 2m β ϕ ( W (l) v 1 2 2 + V 2 α (l-1) 2 2 ) + 1 √ m m i=1 V 2,i,: 2 (a) ≤ 1 2m β ϕ ( W (l) 2 2 + α (l-1) 2 2 ) + ∥V 2 ∥ F (b) ≤ β ϕ 2   γ 2 + γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 2   + 1 where (a) follows from V 2 α (l-1) 2 ≤ ∥V 2 ∥ 2 α l-1 ≤ ∥V 2 ∥ F α l-1 2 = α l-1 2 and m i=1 V 2,i,: 2 ≤ √ m m i=1 V 2,i,: 2 , and (b) follows from Proposition A.1 and Lemma A.2, with altogether holds with probability at least 1 -2l m . For the last inequality (31), we start with the analysis for l ≥ 2. Carefully following the chain rule in (28) we obtain ∂ 2 α (l) (∂W (l) ) 2 i,jj ′ ,kk ′ = 1 m ϕ ′′ (α (l) i )α (l-1) k ′ α (l-1) j ′ 1 [j=i] 1 [k=i] . Then, we have ∂ 2 α (l) (∂W (l) ) 2 2,2,1 = sup ∥V1∥ F =∥V2∥ F =1 m i=1 m j,j ′ =1 m k,k ′ =1 1 m ϕ ′′ ( α(l) i )α (l-1) k ′ α (l-1) j ′ 1 [j=i] 1 [k=i] V 1,jj ′ V 2,kk ′ = sup ∥V1∥ F =∥V2∥ F =1 m i=1 ϕ ′′ (α (l) i ) m m j ′ =1 α (l-1) j ′ V 1,ij ′ m k ′ =1 α (l-1) k ′ V 2,ik ′ ≤ sup ∥V1∥ F =∥V2∥ F =1 1 m β ϕ m i=1 (V 1 α (l-1) ) i (V 2 α (l-1) ) i ≤ sup ∥V1∥ F =∥v2∥ F =1 1 2m β ϕ m i=1 (V 2 α (l-1) ) 2 i + (V 2 α (l-1) ) 2 i = sup ∥V1∥ F =∥V2∥ F =1 1 2m β ϕ ( V 2 α (l-1) 2 2 + V 2 α (l-1) 2 2 ) ≤ 1 2m β ϕ ( α (l-1) 2 2 + α (l-1) 2 2 ) ≤ β ϕ γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 2 , which holds with probability at least 1 -2(l-1) m . For the case l = 1, it is easy to show that  ∂ 2 α (1) (∂W (1) ) 2 i,jj ′ ,kk ′ = 1 d ϕ ′′ (α (1) i )x k ′ x j ′ 1 [j=i] 1 [k=i] and so ∂ 2 α (1) (∂W (1) ) 2 2,2,1 ≤ β ϕ . (l) = ∂f ∂α (l) = L l ′ =l+1 ∂α (l) ∂α (l-1) ∂f ∂α (L) = L l ′ =l+1 1 √ m (W (l ′ ) ) ⊤ D (l ′ ) 1 √ m v , where D (l ′ ) is a diagonal matrix of the gradient of activations, i.e., D (l ′ ) ii = ϕ ′ ( α(l ′ ) i ). Note that we also have the following recursion: b (l) = ∂f ∂α (l) = ∂α (l+1) ∂α (l) ∂f ∂α (l+1) = 1 √ m (W (l+1) ) ⊤ D (l+1) b (l+1) . Lemma A.6. Consider any l ∈ [L]. Under Assumptions1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 - 2(L-l+1) m , ∥b (l) ∥ 2 ≤ 1 √ m σ 1 + ρ √ m L-l (1 + ρ 1 ) and ∥b (l) 0 ∥ 2 ≤ σ L-l 1 √ m ≤ γ L-l √ m . ( ) Proof. First, note that b (L) 2 = 1 √ m ∥v∥ 2 ≤ 1 √ m (∥v 0 ∥ 2 + ∥v -v 0 ∥ 2 ) ≤ 1 √ m (1 + ρ 1 ) , where the inequality follows from from Proposition A.1. Now, for the inductive step, assume b (l) 2 ≤ σ 1 + ρ √ m L-l 1 √ m (1 + ρ 1 ) with probability at least 1 -2l m . Then, b (l-1) 2 = ∂α (l) ∂α (l-1) b (l) 2 ≤ ∂α (l) ∂α (l-1) 2 b (l) 2 ≤ σ 1 + ρ √ m σ 1 + ρ √ m L-l 1 √ m (1 + ρ 1 ) = σ 1 + ρ √ m L-l+1 1 √ m (1 + ρ 1 ) where the last inequality follows from Lemma A.3 with probability at least 1 -2 m (l + 1). Since we use Proposition A.1 once at layer L and then Lemma A.3 (L -l) times at layer l, then we have that everything holds altogether with probability at least 1 -2 m (L -l + 1). We have finished the proof by induction. Lemma A.7. Consider any l ∈ [L]. Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2(L-l) m , b (l) ∞ ≤ γ L-l √ m (1 + ρ 1 ). ( ) Proof. For any l ∈ [L], by definition i-th component of b (l) , i.e., b i , takes the form b (l) i = ∂α (L) ∂α (l) i ∂f ∂α (L) = ∂α (L) ∂α (l) i 1 √ m v. Then, with W (l) :,i denoting the i-th column of the matrix W (l) , ∂α (L) ∂α (l) i 2 (a) = ϕ ′ ( α(l) i ) √ m W (l) :,i ⊤ L l ′ =l+2 ∂α (l ′ ) ∂α (l ′ -1) 2 (b) ≤ 1 √ m W (l) :,i 2 L l ′ =l+2 ∂α (l ′ ) ∂α (l ′ -1) 2 (c) ≤ 1 √ m W (l) :,i 2 γ L-l-1 (d) ≤ γ γ L-l-1 = γ L-l (36) where (a) follows from ∂α (l+1) ∂α (l) i = 1 √ m ϕ ′ (α (l) i )(W (l) :,i ) ⊤ , (b) from ϕ being 1-Lipschitz, (c) from Lemma A.3, and (d) from W (l) :,i 2 ≤ W (l) 2 and Proposition A.1, which altogether holds with probability 1 -2 m (L -l). Therefore, for every i ∈ [m], b (l) i ≤ 1 √ m ∂α (L) ∂α (l) i v ≤ 1 √ m ∂α (L) ∂α (l) i 2 ∥v∥ 2 ≤ 1 √ m γ L-l (1 + ρ 1 ) , where the last inequality follows from (36) and ∥v∥ 2 ≤ ∥v 0 ∥ 2 +∥v -v 0 ∥ 2 ≤ 1+ρ 1 . This completes the proof. A.6 USEFUL BOUNDS Lemma 4.1 (Predictor gradient bounds). Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2(L+1) m , we have ∥∇ θ f (θ; x)∥ 2 ≤ ϱ and ∥∇ x f (θ; x)∥ 2 ≤ γ L √ m (1 + ρ 1 ) , with ϱ 2 = (h(L + 1)) 2 + 1 m (1 + ρ 1 ) 2 L l=1 (h(l)) 2 γ 2(L-l) , γ = σ 1 + ρ √ m , h(l) = γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 . Proof. We first prove the bound on the gradient with respect to the weights. Using the chain rule, ∂f ∂w (l) = ∂α (l) ∂w (l) L l ′ =l+1 ∂α (l ′ ) ∂α (l ′ -1) ∂f ∂α (L) and so ∂f ∂w (l) 2 2 ≤ ∂α (l) ∂w (l) 2 2 L l ′ =l+1 ∂α (l ′ ) ∂α (l ′ -1) ∂f ∂α (L) 2 2 (a) ≤ ∂α (l) ∂w (l) 2 2 γ 2(L-l) • 1 m (1 + ρ 1 ) 2 (b) ≤ γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 2 γ 2(L-l) • 1 m (1 + ρ 1 ) 2 for l ∈ [L] , where (a) follows from Lemma A.6, (b) follows from Lemma A.4. Similarly, ∂f ∂w (L+1) 2 2 = 1 m α (L) 2 2 ≤ γ L + |ϕ(0)| L i=1 γ i-1 2 , where we used Lemma A.2 for the inequality. Now, ∥∇ θ f ∥ 2 2 = L+1 l=1 ∂f ∂w (l) 2 2 (a) ≤ γ L + |ϕ(0)| L i=1 γ i-1 2 + 1 m (1 + ρ 1 ) 2 L l=1 γ l-1 + |ϕ(0)| l-1 i=1 γ i-1 2 γ 2(L-l) = ϱ 2 , where (a) follows with probability 1 -2 m (L + 1) using a union bound from all the previously used results. We now prove the bound on the gradient with respect to the input data. Using the chain rule, ∂f ∂x = ∂f ∂α (0) = ∂α (1) ∂α (0) L l ′ =2 ∂α (l ′ ) ∂α (l ′ -1) ∂f ∂α (L) and so ∂f ∂x 2 ≤ ∂α (1) ∂α (0) 2 L l ′ =2 ∂α (l ′ ) ∂α (l ′ -1) ∂f ∂α (L) 2 ≤ ∂α (1) ∂α (0) 2 L l ′ =2 ∂α (l ′ ) ∂α (l ′ -1) 2 ∂f ∂α (L) 2 (a) ≤ γ • γ L-1 • 1 √ m (1 + ρ 1 ) = γ L √ m (1 + ρ 1 ) where (a) follows from Lemma A.3 and Lemma A.6 with probability at least 1 -2L m due to union bound. This completes the proof. Proof. We start by noticing that for θ ∈ B Spec ρ,ρ1 (θ 0 ), L(θ) = 1 n n i=1 (y i -f (θ; x i )) 2 ≤ 1 n n i=1 (2y 2 i + 2|f (θ; x i )| 2 ). ( ) Now, let us consider the particular case θ = θ 0 and a generic ∥x∥ 2 = √ d. Let α (l) o be the layerwise output of layer l at initialization. Then, |f (θ 0 ; x)| = 1 √ m v ⊤ 0 α (L) o (x) ≤ 1 √ m ∥v 0 ∥ 2 α (L) o (x) 2 (a) ≤ 1 √ m • 1 • α (L) o (x) 2 (b) ≤ 1 √ m σ L 1 + |ϕ(0)| L i=1 σ i-1 1 √ m = g(σ 1 ), where (a) follows by assumption and (b) follows from following the same proof as in Lemma A.2 with the difference that we consider the weights at initialization. Now, replacing this result back in (37) we obtain L(θ 0 ) ≤ c 0,σ1 . Now, let us consider the general case of θ ∈ B Spec ρ,ρ1 (θ 0 ), |f (θ; x)| = 1 √ m v ⊤ α (L) (x) ≤ 1 √ m ∥v∥ 2 α (L) (x) 2 (a) ≤ 1 √ m (1 + ρ 1 ) α (L) (x) 2 (b) ≤ 1 √ m (1 + ρ 1 ) γ L + |ϕ(0)| L i=1 γ i-1 √ m = (1 + ρ 1 )g(γ), where (a) follows from ∥v∥ 2 ≤ ∥v 0 ∥ 2 + ∥v -v 0 ∥ 2 ≤ 1 + ρ 1 , and (b) follows from Lemma A.2. Now, replacing this result back in (37) we obtain L(θ 0 ) ≤ c ρ1,γ . In either case, a union bound let us obtain the probability with which the results hold. This finishes the proof. Corollary 4.1 (Loss gradient bound). Consider the square loss. Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2(L+1) m , we have ∥∇ θ L(θ)∥ 2 ≤ 2 L(θ)ϱ ≤ 2 √ c ρ1,γ ϱ, with ϱ as in Lemma 4.1 and c ρ1,γ as in Lemma 4.2. Proof. We have that ∥∇ θ L(θ)∥ 2 = 1 n n i=1 ℓ ′ i ∇ θ f 2 ≤ 1 n n i=1 |ℓ ′ i | ∥∇ θ f ∥ 2 (a) ≤ 2ϱ n n i=1 |y i - ŷi | ≤ 2ϱ L(θ) (b) ≤ 2 √ c ρ1,γ ϱ where (a) follows from Lemma 4.1 and (b) from Lemma 4.2.

A.7 REGARDING THE NETWORK ARCHITECTURES IN OUR WORK AND LIU ET AL. (2020)'S

A difference between the neural network used in our work and the one in (Liu et al., 2020 ) is that we normalize the norm of the last layer's weight at initialization, whereas (Liu et al., 2020) does not. However, if we did not normalize our last layer, then our result on the Hessian spectral norm bound would still hold with O instead of O; consequently, our comparison with (Liu et al., 2020) on the dependence on the network's depth L (our polynomial dependence against their exponential dependence) would still hold as stated in Remark 4.2 .

B RESTRICTED STRONG CONVEXITY

We establish the results from Section 5 in this appendix.

B.1 RESTRICTED STRONG CONVEXITY AND SMOOTHNESS

Theorem 5.1 (RSC for Square Loss). For square loss, under Assumptions 1 and 2, with probability at least (1  -2(L+1) m ), ∀θ ′ ∈ Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ) with θ t ∈ B Spec ρ,ρ1 (θ 0 ), L(θ ′ ) ≥ L(θ t ) + ⟨θ ′ -θ t , ∇ L(θ t )⟩ + α t 2 ∥θ ′ -θ t ∥ 2 2 , with α t = c 1 ∥ḡ t ∥ 2 2 - c 2 √ m , ( ) where ḡt = 1 n n i=1 ∇ θ f (θ t ; x i ), c 1 = 2κ 2 and c 2 = 2c H (2ϱρ 2 + √ c ρ1,γ ), (Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ), θ t ) whenever α t > 0. Proof. For any θ ′ ∈ Q t κ/2 ∩ B Euc ρ,ρ1 (θ 0 ), by the second order Taylor expansion around θ t , we have L(θ ′ ) = L(θ t ) + ⟨θ ′ -θ t , ∇ θ L(θ t )⟩ + 1 2 (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) , where θt = ξθ ′ + (1 -ξ)θ t for some ξ ∈ [0, 1]. We note that θt ∈ B Spec ρ,ρ1 (θ 0 ) since, • W (l) t -W (l) 0 2 = ξW ′ (l) -ξW (l) 0 + (1 -ξ)W (l) t -(1 -ξ)W (l) 0 2 ≤ ξ W ′ (l) -W (l) 0 2 + (1 -ξ) W (l) t -W (l) 0 2 ≤ ρ, for any l ∈ [L], where the last inequality follows from our assumption θ ′ , θ t ∈ B Spec ρ,ρ1 (θ 0 ); and • W (L+1) t -W (L+1) 0 2 = ∥ṽ -v 0 ∥ 2 ≤ ρ 1 , by following a similar derivation as in the previous point. Focusing on the quadratic form in the Taylor expansion and recalling the form of the Hessian, we get (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) = (θ ′ -θ t ) ⊤ 1 n n i=1 ℓ ′′ i ∂f ( θt ; x i ) ∂θ ∂f ( θt ; x i ) ∂θ ⊤ + ℓ ′ i ∂ 2 f ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) = 1 n n i=1 ℓ ′′ i θ ′ -θ t , ∂f ( θt ; x i ) ∂θ 2 I1 + 1 n n i=1 ℓ ′ i (θ ′ -θ t ) ⊤ ∂ 2 f ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) I2 , where ℓ i = ℓ(y i , f ( θt , x i )), ℓ ′ i = ∂ℓ(yi,z) ∂z z=f ( θt,xi)) , and ℓ ′′ i = ∂ 2 ℓ(yi,z) ∂z 2 z=f ( θt,xi)) . Now, note that I 1 = 1 n n i=1 ℓ ′′ i θ ′ -θ t , ∂f ( θt ; x i ) ∂θ 2 ≥ 2 n n i=1 θ ′ -θ t , ∂f (θ t ; x i ) ∂θ + ∂f ( θt ; x i ) ∂θ - ∂f (θ t ; x i ) ∂θ 2 = 2 n n i=1 θ ′ -θ t , ∂f (θ t ; x i ) ∂θ 2 + 2 n n i=1 θ ′ -θ t , ∂f ( θt ; x i ) ∂θ - ∂f t ; x i ) ∂θ 2 + 4 n n i=1 θ ′ -θ t , ∂f (θ t ; x i ) ∂θ θ ′ -θ t , ∂f ( θt ; x i ) ∂θ - ∂f (θ t ; x i ) ∂θ (a) ≥ 2 n n i=1 θ ′ -θ t , ∂f (θ t ; x i ) ∂θ 2 - 4 n n i=1 ∂f (θ t ; x i ) ∂θ 2 ∂f ( θt ; x i ) ∂θ - ∂f (θ t ; x i ) ∂θ 2 × ∥θ ′ -θ t ∥ 2 2 (b) ≥ 2 θ ′ -θ t , 1 n n i=1 ∂f (θ t ; x i ) ∂θ 2 - 4 n n i=1 ϱ c H √ m ∥ θt -θ t ∥ 2 ∥θ ′ -θ t ∥ 2 2 (c) ≥ 2 θ ′ -θ t , 1 n n i=1 ∂f (θ t ; x i ) ∂θ 2 - 4ϱc H √ m ∥θ ′ -θ t ∥ 3 2 (d) ≥ 2κ 2 1 n n i=1 ∂f (θ t ; x i ) ∂θ 2 2 ∥θ ′ -θ t ∥ 2 2 - 4ϱc H √ m ∥θ ′ -θ t ∥ 3 2 =   2κ 2 1 n n i=1 ∂f (θ t ; x i ) ∂θ 2 2 - 4ϱc H ∥θ ′ -θ t ∥ 2 √ m   ∥θ ′ -θ t ∥ 2 2 , where (a) follows by Cauchy-Schwartz inequality; (b) follows by Jensen's inequality (first term) and the use of Theorem 4.1 and Lemma 4. 1 due to θt ∈ B Spec ρ,ρ1 (θ 0 ); (c) follows from θt -θ t 2 = ∥ξθ ′ + (1 -ξ)θ t -θ t ∥ 2 = ξ ∥θ ′ -θ t ∥ ≤ ∥θ ′ -θ t ∥ 2 ; (d) follows since θ ′ ∈ Q t κ and from the fact that p ⊤ q = cos(p, q) ∥p∥ ∥q∥ for any vectors p, q. For analyzing I 2 , first note that for square loss, ℓ ′ i,t = 2(ỹ i,t -y i ) with ỹi,t = f ( θt ; x i ), so that for the vector [ℓ ′ i,t ] i , we have 1 n ∥[ℓ ′ i,t ] i ∥ 2 2 = 4 n n i=1 (ỹ i,t -y i ) 2 = 4L(θ t ). Further, with Q t,i = (θ ′ -θ t ) ⊤ ∂ 2 f ( θt;xi) ∂θ 2 (θ ′ -θ t ), we have |Q t,i | = (θ ′ -θ t ) ⊤ ∂ 2 f ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) ≤ ∥θ ′ -θ t ∥ 2 2 ∂ 2 f ( θt ; x i ) ∂θ 2 2 ≤ c H ∥θ ′ -θ t ∥ 2 2 √ m . Now, note that I 2 = 1 n n i=1 ℓ ′ i,t Q t,i ≥ - n i=1 1 √ n ℓ ′ i,t 1 √ n Q t,i ≥ - 1 n ∥[ℓ ′ i,t ]∥ 2 2 1/2 1 n n i=1 Q 2 t,i 1/2 ≥ -2 L( θt ) c H ∥θ ′ -θ t ∥ 2 2 √ m , where (a) follows by Cauchy-Schwartz inequality. Putting the lower bounds on I 1 and I 2 back, with ḡt = 1 n n i=1 (θt;xi) ∂θ , we have (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) ≥   2κ 2 ∥ḡ t ∥ 2 2 - 4ϱc H ∥θ ′ -θ t ∥ 2 + 2c H L( θt ) √ m   ∥θ ′ -θ t ∥ 2 2 . Now, since θ ′ ∈ B Euc ρ2 (θ t ), ∥θ ′ -θ t ∥ 2 ≤ ρ 2 , so we have (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) ≥   2κ 2 ∥ḡ t ∥ 2 2 - 4ϱc H ρ 2 + 2c H L( θt ) √ m   ∥θ ′ -θ t ∥ 2 2 ≥ 2κ 2 ∥ḡ t ∥ 2 2 - 4ϱc H ρ 2 + 2c H √ c ρ1,γ √ m ∥θ ′ -θ t ∥ 2 2 , where the last inequality follows from Lemma 4.2. That completes the proof. Next, we state and prove the RSC result for general losses. Theorem B.1 (RSC of Loss). Under Assumptions 1, 2 and 3, with probability at least (1 -2(L+1) m ), ∀θ ′ ∈ Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ), L(θ ′ ) ≥ L(θ t ) + ⟨θ ′ -θ t , ∇ θ L(θ t )⟩ + α t 2 ∥θ ′ -θ t ∥ 2 2 , with α t = c 1 ∥ḡ t ∥ 2 2 - c 4 + c 4,t √ m , where ḡt = 1 n n i=1 ∇ θ f (θ t ; x i ), c 1 = aκ 2 , c 4 = 2aϱc H ρ 2 , c H is as in Theorem 4.1, and c 4,t = c H √ λ t where λ t = 1 n n i=1 (ℓ ′ i,t ) 2 with ℓ ′ i = ∂ℓ(yi,z) ∂z z=f ( θt,xi)) and θ ∈ B Spec ρ,ρ1 being some point in the segment that joins θ ′ and θ t . Consequently, L satisfies RSC w.r.t. (Q t κ ∩ B Spec ρ (θ 0 ) ∩ B Euc ρ2 (θ t ), θ t ) whenever α t > 0. Proof. For any θ ′ ∈ Q t κ/2 ∩ B Euc ρ,ρ1 (θ 0 ), by the second order Taylor expansion around θ t , we have L(θ ′ ) = L(θ t ) + ⟨θ ′ -θ t , ∇ θ L(θ t )⟩ + 1 2 (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) , where θt = ξθ ′ + (1 -ξ)θ t for some ξ ∈ [0, 1]. We note that θt ∈ B Spec ρ,ρ1 (θ 0 ) since, • W (l) t -W (l) 0 2 = ξW ′ (l) -ξW (l) 0 + (1 -ξ)W (l) t -(1 -ξ)W (l) 0 2 ≤ ξ W ′ (l) -W (l) 0 2 + (1 -ξ) W (l) t -W (l) 0 2 ≤ ρ, for any l ∈ [L], where the last inequality follows from our assumption θ ′ , θ t ∈ B Spec ρ,ρ1 (θ 0 ); and • W (L+1) t -W (L+1) 0 2 = ∥ṽ -v 0 ∥ 2 ≤ ρ 1 , by following a similar derivation as in the previous point. Focusing on the quadratic form in the Taylor expansion and recalling the form of the Hessian, we get (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) = (θ ′ -θ t ) ⊤ 1 n n i=1 ℓ ′′ i ∂f ( θt ; x i ) ∂θ ∂f ( θt ; x i ) ∂θ ⊤ + ℓ ′ i ∂ 2 f ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) = 1 n n i=1 ℓ ′′ i θ ′ -θ t , ∂f ( θt ; x i ) ∂θ 2 I1 + 1 n n i=1 ℓ ′ i (θ ′ -θ t ) ⊤ ∂ 2 f ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) I2 , where ℓ i = ℓ(y i , f ( θt , x i )), ℓ ′ i = ∂ℓ(yi,z) ∂z z=f ( θt,xi)) , and ℓ ′′ i = ∂ 2 ℓ(yi,z) ∂z 2 z=f ( θt,xi)) . Now, that I 1 = 1 n n i=1 ℓ ′′ i θ ′ -θ t , ∂f ( θt ; x i ) ∂θ 2 ≥ 2 n n i=1 θ ′ -θ t , ∂f (θ t ; x i ) ∂θ + ∂f ( θt ; x i ) ∂θ - ∂f (θ t ; x i ) ∂θ 2 = 2 n n i=1 θ ′ -θ t , ∂f (θ t ; x i ) ∂θ 2 + 2 n n i=1 θ ′ -θ t , ∂f ( θt ; x i ) ∂θ - ∂f (θ t ; x i ) ∂θ 2 + 4 n n i=1 θ ′ -θ t , ∂f (θ t ; x i ) ∂θ θ ′ -θ t , ∂f ( θt ; x i ) ∂θ - ∂f (θ t ; x i ) ∂θ (a) ≥ 2 n n i=1 θ ′ -θ t , ∂f (θ t ; x i ) ∂θ 2 - 4 n n i=1 ∂f (θ t ; x i ) ∂θ 2 ∂f ( θt ; x i ) ∂θ - ∂f (θ t ; x i ) ∂θ 2 ∥θ ′ -θ t ∥ 2 2 (b) ≥ a θ ′ -θ t , 1 n n i=1 ∂f (θ t ; x i ) ∂θ 2 - 4 n n i=1 ϱ c H √ m ∥ θt -θ t ∥ 2 ∥θ ′ -θ t ∥ 2 2 (c) ≥ a θ ′ -θ t , 1 n n i=1 ∂f (θ t ; x i ) ∂θ 2 - 2aϱc H √ m ∥θ ′ -θ t ∥ 3 2 (d) ≥ aκ 2 1 n n i=1 ∂f (θ t ; x i ) ∂θ 2 2 ∥θ ′ -θ t ∥ 2 2 - 2aϱc H √ m ∥θ ′ -θ t ∥ 3 2 =   aκ 2 1 n n i=1 ∂f (θ t ; x i ) ∂θ 2 2 - 2aϱc H ∥θ ′ -θ t ∥ 2 √ m   ∥θ ′ -θ t ∥ 2 2 , where (a) follows by Cauchy-Schwartz inequality; (b) follows by Jensen's inequality (first term) and the use of Theorem 4.1 and Lemma 4. 1 due to θt ∈ B Spec ρ,ρ1 (θ 0 ); (c) follows from θt -θ t 2 = ∥ξθ ′ + (1 -ξ)θ t -θ t ∥ 2 = ξ ∥θ ′ -θ t ∥ ≤ ∥θ ′ -θ t ∥ 2 ; (d) follows since θ ′ ∈ Q t κ and from the fact that p ⊤ q = cos(p, q) ∥p∥ ∥q∥ for any vectors p, q. For analyzing I 2 , let λ t := 1 n n i=1 (ℓ ′ i,t ) 2 . As before, with Q t,i = (θ ′ -θ t ) ⊤ ∂ 2 f ( θt;xi) ∂θ 2 (θ ′ -θ t ), we have |Q t,i | = (θ ′ -θ t ) ⊤ ∂ 2 f ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) ≤ ∥θ ′ -θ t ∥ 2 2 ∂ 2 f ( θt ; x i ) ∂θ 2 2 ≤ c H ∥θ ′ -θ t ∥ 2 2 √ m . Now, note that I 2 = 1 n n i=1 ℓ ′ i,t Q t,i ≥ - n i=1 1 √ n ℓ ′ i,t 1 √ n Q t,i ≥ - 1 n ∥[ℓ ′ i,t ]∥ 2 2 1/2 1 n n i=1 Q 2 t,i 1/2 ≥ -λ t c H ∥θ ′ -θ t ∥ 2 2 √ m , where (a) follows by Cauchy-Schwartz inequality. Putting the lower bounds on I 1 and I 2 back, with ḡt = 1 n n i=1 ∂f (θt;xi) ∂θ , we have (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) ≥ aκ 2 ∥ḡ t ∥ 2 2 - 2aϱc H ∥θ ′ -θ t ∥ 2 + c H √ λ t √ m ∥θ ′ -θ t ∥ 2 2 . since θ ′ ∈ B Euc ρ2 (θ t ), ∥θ ′ -θ t ∥ 2 ≤ ρ 2 , so we have (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) ≥ aκ 2 ∥ḡ t ∥ 2 2 - 2aϱc H ρ 2 + c H √ λ t √ m ∥θ ′ -θ t ∥ 2 2 . That completes the proof. Theorem 5.2 (Local Smoothness for Square Loss). For square loss, under Assumptions 1 and 2, with probability at least (1 -2(L+1) m ), ∀θ, θ ′ ∈ B Spec ρ,ρ1 (θ 0 ), L(θ ′ ) ≤ L(θ) + ⟨θ ′ -θ, ∇ θ L(θ)⟩ + β 2 ∥θ ′ -θ∥ 2 2 , with β = 2ϱ 2 + 2c H √ c ρ1,γ √ m , with c H as in Theorem 4.1, ϱ as in Lemma 4.1, and c ρ1,γ as in Lemma 4.2. Consequently, L is locally β-smooth. Moreover, if γ (and so σ 1 and ρ) is chosen according to the desirable operating regimes (see Remark 4.1) and ρ 1 = O(poly(L)), then β = O(poly(L)). Proof. By the second order Taylor expansion about θ ∈ B Spec ρ,ρ1 (θ 0 ), we have L(θ ′ ) = L( θ) + ⟨θ ′ - θ, ∇ θ L( θ)⟩ + 1 2 (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ), where θ = ξθ ′ + (1 -ξ) θ for some ξ ∈ [0, 1]. Then, (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ) = (θ ′ -θ) ⊤ 1 n n i=1 ℓ ′′ i ∂f ( θ; x i ) ∂θ ∂f ( θ; x i ) ∂θ ⊤ + ℓ ′ i ∂ 2 f ( θ; x i ) ∂θ 2 (θ ′ -θ) = 1 n n i=1 ℓ ′′ i θ ′ -θ, ∂f ( θ; x i ) ∂θ 2 I1 + 1 n n i=1 ℓ ′ i (θ ′ -θ) ⊤ ∂ 2 f ( θ; x i ) ∂θ 2 (θ ′ -θ) I2 , where ℓ i = ℓ(y i , f ( θ, x i )), ℓ ′ i = ∂ℓ(yi,z) ∂z z=f ( θ,xi)) , and ℓ ′′ i = ∂ 2 ℓ(yi,z) ∂z 2 z=f ( θ,xi)) . Now, note that I 1 = 1 n n i=1 ℓ ′′ i θ ′ -θ, ∂f ( θ; x i ) ∂θ 2 (a) ≤ 2 n n i=1 ∂f ( θ; x i ) ∂θ 2 2 ∥θ ′ -θ∥ 2 2 (b) ≤ 2ϱ 2 ∥θ ′ -θ∥ 2 2 , where (a) follows by the Cauchy-Schwartz inequality and (b) from Lemma 4.1. For I 2 , first note that for square loss, ℓ ′ i = 2(ỹ i -y i ) with ỹi = f ( θ; x i ), so that for the vector [ℓ ′ i ] i , we have 1 n ∥[ℓ ′ i ] i ∥ 2 2 = 4 n n i=1 (ỹ i -y i ) 2 = 4L( θ). Further, with Q i = (θ ′ -θ) ⊤ ∂ 2 f ( θt ; x i ) ∂θ 2 (θ ′ -θ), we have |Q i | = (θ ′ -θ) ⊤ ∂ 2 f ( θ; x i ) ∂θ 2 (θ ′ -θ) ≤ ∥θ ′ -θ∥ 2 2 ∂ 2 f ( θ; x i ) ∂θ 2 2 ≤ c H ∥θ ′ -θ∥ 2 2 √ m . Then, we have I 2 = 1 n n i=1 ℓ ′ i (θ ′ -θ) ⊤ ∂ 2 f ( θ; x i ) ∂θ 2 (θ ′ -θ) ≤ n i=1 1 √ n ℓ ′ i 1 √ n Q i (a) ≤ 1 n ∥[ℓ ′ i ] i ∥ 2 2 1/2 1 n n i=1 Q 2 i 1/2 ≤ 2 L( θ) c H ∥θ ′ -θ∥ 2 2 √ m , where (a) follows by Cauchy-Schwartz. Putting the upper bounds on I 1 and I 2 back, we (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ) ≤   2ϱ 2 + 2c H L( θ) √ m   ∥θ ′ -θ∥ 2 2 ≤ 2ϱ 2 + 2c H √ c ρ1,γ √ m ∥θ ′ -θ∥ 2 2 , where the last inequality follows from Lemma 4.2. This proves the first statement of the theorem. Now, the second statement simply follows from the fact that by choosing γ (and so ρ and σ 1 ) according to the desirable operating regimes (see Remark 4.1) and by choosing ρ 1 according to Theorem 4.1, we obtain c H = O(poly(L)), ρ 2 = O(poly(L)) and c ρ1,γ = O(poly(L)). This completes the proof. Next, we state and prove the smoothness result for general losses.  λ t = 1 n n i=1 (ℓ ′ i ) 2 with ℓ ′ i = ∂ℓ(yi,z) ∂z z=f ( θt,xi)) and θ ∈ B Spec ρ,ρ1 being some point in the segment that joins θ ′ and θ as in (11). Proof. By the second order Taylor expansion about θ, we have L(θ ′ ) = L( θ) + ⟨θ ′ -θ, ∇ θ L( θ)⟩ + 1 2 (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ), where θ = ξθ ′ + (1 -ξ) θ for some ξ ∈ [0, 1]. Then, (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ) = (θ ′ -θ) ⊤ 1 n n i=1 ℓ ′′ i ∂f ( θ; x i ) ∂θ ∂f ( θ; x i ) ∂θ ⊤ + ℓ ′ i ∂ 2 f ( θ; x i ) ∂θ 2 (θ ′ -θ) = 1 n n i=1 ℓ ′′ i θ ′ -θ, ∂f ( θ; x i ) ∂θ 2 I1 + 1 n n i=1 ℓ ′ i (θ ′ -θ) ⊤ ∂ 2 f ( θ; x i ) ∂θ 2 (θ ′ -θ) I2 . where ℓ i = ℓ(y i , f ( θ, x i )), ℓ ′ i = ∂ℓ(yi,z) ∂z z=f ( θ,xi)) , and ℓ ′′ i = ∂ 2 ℓ(yi,z) ∂z 2 z=f ( θ,xi)) . Now, note that I 1 = 1 n n i=1 ℓ ′′ i θ ′ -θ, ∂f ( θ; x i ) ∂θ 2 (a) ≤ b n n i=1 ∂f ( θ; x i ) ∂θ 2 2 ∥θ ′ -θ∥ 2 2 (b) ≤ bϱ 2 ∥θ ′ -θ∥ 2 2 , where (a) follows by the Cauchy-Schwartz inequality and (b) from Lemma 4.1. For I 2 , let λ t = 1 n ∥[ℓ ′ i ] i ∥ 2 2 . Further, with Q t,i = (θ ′ -θ) ⊤ ∂ 2 f ( θ;xi) ∂θ 2 (θ ′ -θ), we have |Q t,i | = (θ ′ -θ) ⊤ ∂ 2 f ( θt ; x i ) ∂θ 2 (θ ′ -θ) ≤ ∥θ ′ -θ∥ 2 2 ∂ 2 f ( θt ; x i ) ∂θ 2 2 ≤ c H ∥θ ′ -θ∥ 2 2 √ m Then, we have I 2 = 1 n n i=1 ℓ ′ i (θ ′ -θ) ⊤ ∂ 2 f ( θ; x i ) ∂θ 2 (θ ′ -θ) ≤ n i=1 1 √ n ℓ ′ i 1 √ n Q i (a) ≤ 1 n ∥[ℓ ′ i ] i ∥ 2 2 1/2 1 n n i=1 Q 2 i 1/2 ≤ λ t c H ∥θ ′ -θ∥ 2 2 √ m , where (a) follows by Cauchy-Schwartz. Putting the upper bounds on I 1 and I 2 back, we have (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ) ≤ bϱ 2 + c H √ λ t √ m ∥θ ′ -θ∥ 2 2 . This completes the proof. Lemma 5.1 (RSC ⇒ RPL). Let B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). In the setting of Theorem 5.1, if α t > 0, then the tuple (B t , θ t ) satisfies the Restricted Polyak-Łojasiewicz (RPL) condition, i.e., L(θ t ) -inf θ∈Bt L(θ) ≤ 1 2α t ∥∇ θ L(θ t )∥ 2 2 , ( ) with probability at least (1 -2(L+1) m ).

Proof. Define

Lθt (θ) := L(θ t ) + ⟨θ -θ t , ∇ θ L(θ t )⟩ + α t 2 ∥θ -θ t ∥ 2 2 . By Theorem 5.1, ∀θ ′ ∈ B t , we have L(θ ′ ) ≥ Lθt (θ ′ ) . Further, note that Lθt (θ) is minimized at θt+1 := θ t -∇ θ L(θ t )/α t and the minimum value is: inf θ Lθt (θ) = Lθt ( θt+1 ) = L(θ t ) - 1 2α t ∥∇ θ L(θ t )∥ 2 2 . Then, we have inf θ∈Bt L(θ) (a) ≥ inf θ∈Bt Lθt (θ) ≥ inf θ Lθt (θ) = L(θ t ) - 1 2α t ∥∇ θ L(θ t )∥ 2 2 , where (a) follows from (39). Rearranging terms completes the proof.

B.2 CONVERGENCE WITH RESTRICTED STRONG CONVEXITY

Lemma 5.2 (Local Loss Reduction in B t ). Let α t , β be as in Theorems 5.1 and 5.2 respectively, and B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). Consider Assumptions 1, 2, and 4, and gradient descent with step size η t = ωt β , ω t ∈ (0, 2). Then, for any θ t+1 ∈ arginf θ∈Bt L(θ), we have with probability at least (1 -2(L+1) m ), L(θ t+1 ) -L(θ t+1 ) ≤ 1 - α t ω t β (2 -ω t ) (L(θ t ) -L(θ t+1 )) . Proof. Since L is β-smooth by Theorem 5.2, we have L(θ t+1 ) ≤ L(θ t ) + ⟨θ t+1 -θ t , ∇ θ L(θ t )⟩ + β 2 ∥θ t+1 -θ t ∥ 2 2 = L(θ t ) -η t ∥∇ θ L(θ t )∥ 2 2 + βη 2 t 2 ∥∇ θ L(θ t )∥ 2 2 = L(θ t ) -η t 1 - βη t 2 ∥∇ θ L(θ t )∥ 2 2 (40) Since θt+1 ∈ arginf θ∈Bt L(θ) and α t > 0 by assumption, from Lemma 5.1 we obtain -∥∇ θ L(θ t )∥ 2 2 ≤ -2α t (L(θ t ) -L( θt+1 )) . Hence L(θ t+1 ) -L( θt+1 ) ≤ L(θ t ) -L( θt+1 ) -η t 1 - βη t 2 ∥∇ θ L(θ t )∥ 2 2 (a) ≤ L(θ t ) -L( θt+1 ) -η t 1 - βη t 2 2α t (L(θ t ) -L( θt+1 )) = 1 -2α t η t 1 - βη t 2 (L(θ t ) -L( θt+1 )) where (a) follows for any η t ≤ 2 β because this implies 1 -βηt 2 ≥ 0. Choosing η t = ωt β , ω t ∈ (0, 2), L(θ t+1 ) -L( θt+1 ) ≤ 1 - α t ω t β (2 -ω t ) (L(θ t ) -L( θt+1 )) . This completes the proof. Theorem 5.3 (Global Loss Reduction in B Spec ρ,ρ1 (θ 0 )). Let α t , β be as in Theorems 5.1 and 5.2 respectively, and B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). Let θ * ∈ arginf θ∈B Spec ρ,ρ 1 (θ0) L(θ), θ t+1 ∈ arginf θ∈Bt L(θ), and γ t := L(θt+1)-L(θ * ) L(θt)-L(θ * ) . Let α t , β be as in Theorems 5.1 and 5.2 respectively, and B t := Q t κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). Consider Assumptions 1, 2, and 4, and gradient descent with step size η t = ωt β , ω t ∈ (0, 2). Then, with probability at least (1 -2(L+1) m ), we have we have γ t ∈ [0, 1) and L(θ t+1 ) -L(θ * ) ≤ 1 - α t ω t β (1 -γ t )(2 -ω t ) (L(θ t ) -L(θ * )) . Proof. We start by showing γ t = L( θt+1)-L(θ * ) L(θt)-L(θ * ) satisfies 0 ≤ γ t < 1. Since θ * ∈ arginf θ∈B Spec ρ,ρ 1 (θ0) L(θ) and θ t+1 ∈ B Spec ρ,ρ1 (θ 0 ) by the definition of gradient descent and Assumption 4, we have L(θ * ) ≤ L(θ t+1 ) (a) ≤ L(θ t ) - ω t β 1 - ω t 2 ∥∇ θ L(θ t )∥ 2 2 < L(θ t ) , where (a) follows from (40). Since L( θt+1 ) ≥ L(θ * ) and L(θ t ) > L(θ * ), we have γ t ≥ 0. Further, we have L( θt+1 ) < L(θ t ), and so we have γ t < 1. To see this, consider two cases: (i) θ t+1 ∈ B t and (ii) θ t+1 ̸ ∈ B t . When θ t+1 ∈ B t , we have L( θt+1 ) ≤ L(θ t+1 ) < L(θ t ). When θ t+1 ̸ ∈ B t , we only consider the case L(θ t+1 ) < L( θt+1 ); otherwise, if L(θ t+1 ) ≥ L( θt+1 ) then it follows L( θt+1 ) < L(θ t ) by ( 40). So, let us consider level sets of the loss between L(θ t+1 ) and L(θ t ). Because of the definition of Q t κ (which defines a cone), the RSC property due to θ ′ ∈ B t , and the smoohtness of the loss, we will have some θ ′ ∈ B t living in one of those level sets such that L(θ t+1 ) ≤ L(θ ′ ) < L(θ t ), but then L( θt+1 ) ≤ L(θ ′ ) by definition, implying L( θt+1 ) < L(θ t ). Hence, γ t < 1. Now, with ω t ∈ (0, 2), we have L(θ t+1 ) -L(θ * ) = L(θ t+1 ) -L( θt+1 ) + L( θt+1 ) -L(θ * ) ≤ 1 - α t ω t β (2 -ω t ) (L(θ t ) -L( θt+1 )) + 1 - α t ω t β (2 -ω t ) (L( θt+1 ) -L(θ * )) + L( θt+1 ) -1 - α t ω t β (2 -ω t ) L( θt+1 ) -L(θ * ) -1 - α t ω t β (2 -ω t ) L(θ * ) = 1 - α t ω t β (2 -ω t ) (L(θ t ) -L(θ * )) + α t ω t β (2 -ω t )(L( θt+1 ) -L(θ * )) = 1 - α t ω t β (2 -ω t ) (L(θ t ) -L(θ * )) + α t ω t β (2 -ω t )γ t (L(θ t ) -L(θ * )) = 1 - α t ω t β (1 -γ t )(2 -ω t ) (L(θ t ) -L(θ * )) . That completes the proof.

C ANALYSIS FOR MODELS WITH k OUTPUTS

In this section, we illustrate that our results can be extended to neural models with k outputs.

C.1 OPTIMIZATION SETUP

WITH k OUTPUTS Consider a training set D = {x i , y i } n i=1 , x i ∈ X ⊆ R d , y i ∈ Y ⊆ R k , k ≥ 1. We will denote by X ∈ R n×d the matrix whose ith row is x ⊤ i . The goal is to minimize the square loss: L(θ) = 1 n n i=1 ∥y i -ŷi ∥ 2 2 = 1 n n i=1 k h=1 (y ih -f h (θ; x i )) 2 , where the prediction ŷi := f (θ; x i ) ∈ R k is from a deep model, f h (θ; x i ), h ∈ [k] denotes the h th output, and θ ∈ R p denotes the parameter vector. In our setting f is a feed-forward multi-layer (fully-connected) neural network with depth L and widths m l , l ∈ [L] := {1, . . . , L} given by α (0) (x) = x , α (l) (x) = ϕ 1 √ m l-1 W (l) α (l-1) (x) , l = 1, . . . , L , f (θ; x) = α (L+1) (x) = 1 √ m L V ⊤ α (L) (x) , where W (l) ∈ R m l ×m l-1 , l ∈ [L] are layer-wise weight matrices, V ∈ R m L ×k is the last layer matrix, ϕ(•) is the smooth (pointwise) activation function, and the total set of parameters θ := (vec(W (1) ) ⊤ , . . . , vec(W (L) ) ⊤ , V ⊤ ) ⊤ ∈ R L l=1 m l m l-1 +km L , with m 0 = d. Note that the total number of parameters is p = L l=1 m l m l-1 + km L . For simplicity, we will assume that the width of all the layers is the same, i.e., m l = m, l ∈ [L]. Define the pointwise loss ℓ ih := (y ih -ŷih ) 2 and denote its first-and second-derivative w.r.t. ŷih as ℓ ′ ih := -2(y ih -ŷih ) and ℓ ′′ ih := 2. Let f h (θ; x), h ∈ [k] denote the h th output, and let the gradient and Hessian of f h (θ; x i ) w.r.t. θ be denoted as ∇f ih := ∂f h (θ; x i ) ∂θ , ∇ 2 f ih := ∂ 2 f h (θ; x i ) ∂θ 2 . By chain rule, the gradient and Hessian of the empirical loss w.r.t. θ are given by ∂L(θ) ∂θ = 1 n n i=1 k h=1 ℓ ′ ih ∇f ih , ∂ 2 L(θ) ∂θ 2 = 1 n n i=1 k h=1 ℓ ′′ ih ∇f ih ∇f ⊤ ih + ℓ ′ ih ∇ 2 f ih . For the last layer, note that f h (θ; x i ) = 1 √ m v T h α (L) (x) where v h ∈ R m is the last layer vector corresponding to output f h and V = [v h ] ∈ R m×k is the last layer matrix. For the analysis, we update the definition of the spectral norm ball to work with each last layer vector v h : B Spec ρ,ρ1 ( θ) := θ ∈ R p as in (42) | ∥W (ℓ) -W (ℓ) ∥ 2 ≤ ρ, ℓ ∈ [L], ∥v h -vh ∥ 2 ≤ ρ 1 , h ∈ [k] , Similarly, we update the initialization so each of the last layer vectors are unit norm. Assumption 5 (Initialization). The initialization weights w (l) 0,ij ∼ N (0, σ 2 0 ) for l ∈ [L] where σ 0 = σ1 2 1+ √ log m √ 2m , σ 1 > 0, and v 0,h , h ∈ [k] are random unit vectors with ∥v 0,h ∥ 2 = 1. Further, we assume the input data satisfies: ∥x i ∥ 2 = √ d, i ∈ [n]. Based on the setup, following Theorem 4.1, we get the following result for the Hessian of each f h : Theorem C.1 (Hessian Spectral Norm Bound). Under Assumptions 1, and 5, for θ ∈ B Spec ρ,ρ1 (θ 0 ) as in (43), ρ 1 = O(poly(L)), with probability at least (1 -2(L+1) m ), for any x i , i ∈ [n], we have ∇ 2 θ f (θ; x i ) 2 ≤ c H √ m , with c H = O(poly(L)(1 + γ 2L )) where γ := σ 1 + ρ √ m .

C.2 RESTRICTED STRONG CONVEXITY AND SMOOTHNESS

Let us assume we have a sequence of iterates {θ t } t≥0 from gradient descent. Our RSC analysis will rely on the following Qt κ -sets at step t, which avoids directions almost orthogonal to the average gradient of the predictor. Definition C.1 ( Qt κ sets). For iterate θ t ∈ R p , let ḡt = 1 nk n i=1 k h=1 ∇ θ f h (θ t ; x i ). For any κ ∈ (0, 1], define Qt κ := {θ | | cos(θ -θ t , ḡt )| ≥ κ}. We define the set B t := Qt κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ). We focus on establishing RSC w.r.t. the tuple (B t , θ t ), where B Spec ρ,ρ1 (θ 0 ) becomes the feasible set for the optimization and B Euc ρ2 (θ t ) is an Euclidean ball around the current iterate. Theorem C.2 (RSC for k-output Square Loss). For square loss, under Assumptions 1, 5, and 3, with probability at least (1 -2(L+1) m ), ∀θ ′ ∈ Qt κ ∩ B Spec ρ,ρ1 (θ 0 ) ∩ B Euc ρ2 (θ t ), L(θ ′ ) ≥ L(θ t ) + ⟨θ ′ -θ t , ∇ θ L(θ t )⟩ + α t 2 ∥θ ′ -θ t ∥ 2 2 , with α t = c 1 k ∥ḡ t ∥ 2 2 - kc 2 √ m where Proof. For any θ ′ ∈ Qt κ ∩ B ρ (θ 0 ), by the second order Taylor expansion around θ t , we have ḡt = 1 nk n i=1 k h=1 ∇ θ f h (θ t ; x i ), c 1 = 2κ L(θ ′ ) = L(θ t ) + ⟨θ ′ -θ t , ∇ θ L(θ t )⟩ + 1 2 (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) , where θt = ξθ ′ + (1 -ξ)θ t for some ξ ∈ [0, 1]. We note that θt ∈ B Spec ρ,ρ1 (θ 0 ) since, • θ(l) t -θ Focusing on the quadratic form in the Taylor expansion and recalling the form of the Hessian, we get (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) = (θ ′ -θ t ) ⊤ 1 n n i=1 k h=1 ℓ ′′ ih ∂f h ( θt ; x i ) ∂θ ∂f h ( θt ; x i ) ∂θ ⊤ + ℓ ′ ih ∂ 2 f h ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) = 1 n n i=1 k h=1 ℓ ′′ ih θ ′ -θ t , ∂f h ( θt ; ) ∂θ 2 I1 + 1 n n i=1 k h=1 ℓ ′ ih (θ ′ -θ t ) ⊤ ∂ 2 f h ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) I2 , where ℓ ih = ℓ(y ih , f h ( θt , x i )), ℓ ′ ih = ∂ℓ(y ih ,z) ∂z z=f h ( θt,xi)) , and ℓ ′′ ih = ∂ 2 ℓ(y ih ,z) ∂z 2 z=f h ( θt,xi)) . Now, note that ∥ξθ ′ + (1 -ξ)θ t -θ t ∥ 2 = ξ ∥θ ′ -θ t ∥ ≤ ∥θ ′ -θ t ∥ 2 ; (d) follows since θ ′ ∈ Q t κ and from the fact that p ⊤ q = cos(p, q) ∥p∥ ∥q∥ for any vectors p, q. I 1 = 1 n n i=1 k h=1 ℓ ′′ ih θ ′ - For analyzing I 2 , first note that for square loss, ℓ ′ ih,t = 2(ŷ ih,t -y ih ) with ŷih,t = f h (θ t ; x i ), so that for the vector [ℓ ′ ih,t ], we have 1 n ∥[ℓ ′ ih,t ]∥ 2 2 = 4 n n i=1 k h=1 (ŷ ih,t -y ih ) 2 = 4L(θ t ). Further, with Q ih,t = (θ ′ -θ t ) ⊤ ∂ 2 f h ( θt;xi) ∂θ 2 (θ ′ -θ t ), we have |Q ih,t | = (θ ′ -θ t ) ⊤ ∂ 2 f h ( θt ; x i ) ∂θ 2 (θ ′ -θ t ) ≤ ∥θ ′ -θ t ∥ 2 2 ∂ 2 f h ( θt ; x i ) ∂θ 2 2 ≤ c H ∥θ ′ -θ t ∥ 2 2 √ m . Now, note that I 2 = 1 n n i=1 k h=1 ℓ ′ ih,t Q ih,t ≥ - n i=1 k h=1 1 √ n ℓ ′ ih,t 1 √ n Q ih,t ≥ - 1 n n i=1 k h=1 ℓ ′ ih,t 2 1/2 1 n n i=1 k h=1 Q 2 ih,t 1/2 ≥ -2 L( θt ) c H k∥θ ′ -θ t ∥ 2 2 √ m , where (a) follows by Cauchy-Schwartz inequality. Putting the lower bounds on I 1 and I 2 back, with ḡt = 1 nk n i=1 k h=1 ∂f h (θt;xi) ∂θ , we have (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) ≥ k   2κ 2 ∥ḡ t ∥ 2 2 - 4ϱc H ∥θ ′ -θ t ∥ 2 + 2c H L( θt ) √ m   ∥θ ′ -θ t ∥ 2 2 . Now, since θ ′ ∈ B Euc ρ2 (θ t ), ∥θ ′ -θ t ∥ 2 ≤ ρ 2 , so we have (θ ′ -θ t ) ⊤ ∂ 2 L( θt ) ∂θ 2 (θ ′ -θ t ) ≥ k   2κ 2 ∥ḡ t ∥ 2 2 - 4ϱc H ρ 2 + 2c H L( θt ) √ m   ∥θ ′ -θ t ∥ 2 2 ≥ k 2κ 2 ∥ḡ t ∥ 2 2 - 4ϱc H ρ 2 + 2c H kc ρ1,γ √ m ∥θ ′ -θ t ∥ 2 2 , where the last inequality follows from Lemma 4.2. That completes the proof. Theorem C.3 (Local Smoothness for k-output Square Loss). Under Assumptions 1, 5, and 3, with probability at least (1 -2(L+1) m ), ∀θ, θ ′ ∈ B Spec ρ,ρ1 (θ 0 ), L(θ ′ ) ≤ L(θ) + ⟨θ ′ -θ, ∇ θ L(θ)⟩ + β 2 ∥θ ′ -θ∥ 2 2 , with β = 2kϱ 2 + 2kc H kc ρ1,γ √ m , with c H as in Theorem 4.1, ϱ as in Lemma 4.1, and c ρ1,γ as in Lemma 4.2. Consequently, L is locally β-smooth. Proof. By the second order Taylor expansion about θ ∈ B Spec ρ,ρ1 (θ 0 ), we have L(θ ′ ) = L( θ) + ⟨θ ′θ, ∇ θ L( θ)⟩ + 1 2 (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ), where θ = ξθ ′ + (1 -ξ) θ for some ξ ∈ [0, 1]. Then, (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ) = (θ ′ -θ) ⊤ 1 n n i=1 k h=1 ℓ ′′ ih ∂f h ( θ; x i ) ∂θ ∂f h ( θ; x i ) ∂θ ⊤ + ℓ ′ ih ∂ 2 f h ( θ; x i ) ∂θ 2 (θ ′ -θ) = 1 n n i=1 k h=1 ℓ ′′ ih θ ′ -θ, ∂f h ( θ; x i ) ∂θ 2 I1 + 1 n n i=1 k h=1 ℓ ′ ih (θ ′ -θ) ⊤ ∂ 2 f h ( θ; x i ) ∂θ 2 (θ ′ -θ) I2 , where ℓ ih = ℓ(y i , f h ( θ, x i )), ℓ ′ ih = ∂ℓ(y ih ,z) ∂z z=f h ( θ,xi)) , and ℓ ′′ ih = ∂ 2 ℓ(y ih ,z) ∂z 2 z=f h ( θ,xi)) . Now, note that For I 2 , first note that for square loss, ℓ ′ ih = 2(ŷ ih -y ih ) with ŷih = f h ( θ; x i ), so that for the vector [ℓ ′ ih ], we have I 1 = 1 n n i=1 k h=1 ℓ ′′ ih θ ′ -θ, 1 n ∥[ℓ ′ ih ]∥ 2 2 = 4 n n i=1 k h=1 (ŷ ih -y ih ) 2 = 4L( θ). Further, with Q ih = (θ ′ -θ) ⊤ ∂ 2 f h ( θt ; x i ) ∂θ 2 (θ ′ -θ), we have |Q ih | = (θ ′ -θ) ⊤ ∂ 2 f h ( θ; x i ) ∂θ 2 (θ ′ -θ) ≤ ∥θ ′ -θ∥ 2 2 ∂ 2 f h ( θ; x i ) ∂θ 2 2 ≤ c H ∥θ ′ -θ∥ 2 2 √ m . Then, we have I 2 = 1 n n i=1 k h=1 ℓ ′ ih (θ ′ -θ) ⊤ ∂ 2 f h ( θ; x i ) ∂θ 2 (θ ′ -θ) ≤ n i=1 k h=1 1 √ n ℓ ′ ih 1 √ n Q ih (a) ≤ 1 n k h=1 ∥[ℓ ′ ih ]∥ 2 2 1/2 1 n n i=1 k h=1 Q 2 ih 1/2 ≤ 2 L( θ) kc H ∥θ ′ -θ∥ 2 2 √ m , where (a) follows by Cauchy-Schwartz. Putting the upper bounds on I 1 and I 2 back, we have (θ ′ -θ) ⊤ ∂ 2 L( θ) ∂θ 2 (θ ′ -θ) ≤   2kϱ 2 + 2kc H L( θ) √ m   ∥θ ′ -θ∥ 2 2 ≤ 2kϱ 2 + 2kc H kc ρ1,γ √ m ∥θ ′ -θ∥ 2 2 , where the last inequality follows from Lemma 4.2. That completes the proof. Note that we now have the RSC and smoothness results for the k-output case similar to the single output case. With these properties, the rest of the convergence analysis for the k-output case stays the same as before.



See the end of Appendix A for a quick note about the network architecture in our work and the one in(Liu et al., 2020).



Lemma 4.2 (Loss bounds). Consider the square loss. Under Assumptions 1, and 2, for γ = σ 1 + ρ √ m , each of the following inequalities hold with probability at least 1 -2(L+1) m : L(θ 0 ) ≤ c 0,σ1 and L(θ) ≤ c ρ1,γ for θ ∈ B Spec ρ,ρ1 (θ 0 ), where c a,b = 2 a) 2 |g(b)| 2 and g(a) = a L + |ϕ(0)| L i=1 a i for any a, b ∈ R.

Corollary 4.1 (Loss gradient bound). Consider the square loss. Under Assumptions 1 and 2, for θ ∈ B Spec ρ,ρ1 (θ 0 ), with probability at least 1 -2(L+1) m , we have∥∇ θ L(θ)∥ 2 ≤ 2 L(θ)ϱ ≤ 2 √ c ρ1,γ ϱ,with ϱ as in Lemma 4.1 and c ρ1,γ as in Lemma 4.2.

(b), we plot min t∈[T ] ∥ḡ t ∥ 2 as a function of width m for several values of the width.The plot shows that min t∈[T ] ∥ḡ t ∥ 2 increases steadily with m illustrating that the RSC condition is empirically satisfied more comfortably for wider networks. In Figure1(c) and (d), we show similar plots for MNIST and Fashion-MNIST illustrating the same phenomena of min t∈[T ] ∥ḡ t ∥ 2 increasing with m.For the experiments, the network architecture we used had 3-layer fully connected neural network with tanh activation function. The training algorithm is gradient descent (GD) width constant learning rate, chosen appropriately to keep the training in NTK regime. Since we are using GD, we use 512 randomly chosen training points for the experiments. The stopping criteria is either training loss < 10 -3 or number of iterations larger than 3000.

(a) CIFAR-10: ∥ḡt∥2 over iterations. (b) CIFAR-10: Minimum ∥ḡt∥2 vs. width. (c) MNIST: Minimum ∥ḡt∥2 vs. width. (d) Fashion-MNIST: Minimum ∥ḡt∥2 vs. width.

Figure 1: Experiments on CIFAR-10: (a) ∥ḡ t ∥ 2 over iterations for different network widths m; (b) minimum ∥ḡ t ∥ 2 over all iterations, i.e., min t∈[T ] ∥ḡ t ∥ 2 , as a function of network width m. Experiments on (c) MNIST and (d) Fashion-MNIST. The experiments validates the RSC condition empirically, and illustrates that the condition is more comfortably satisfied for wider networks. Each curve is the average of 3 independent runs.

Consider any l ∈ [L]. If the parameters are initialized as w

Let b (l) := ∂f ∂α (l) ∈ R m for any l ∈ [L]. Let b (l) 0 denote b (l) at initialization. By a direct calculation, we have b

Lemma 4.2 (Loss bounds). Consider the square loss. Under Assumptions 1, and 2, for γ = σ 1 + ρ √ m , each of the following inequalities hold with probability at least 1 -2(L+1) m : L(θ 0 ) ≤ c 0,σ1 and L(θ) ≤ c ρ1,γ for θ ∈ B Spec ρ,ρ1 (θ 0 ), where c a,b = 2 a) 2 |g(b)| 2 and g(a) = a L + |ϕ(0)| L i=1 a i for any a, b ∈ R.

Local Smoothness of Loss). Under Assumptions 3, 1, and 2, with probability at least (1 -2(L+1) m ), L(θ), θ ∈ B Spec ρ,ρ1 (θ 0 ), is β-smooth with β = bϱ 2 + c H √ λt √ m with ϱ as in Lemma 4.1 and

for any l ∈ [L], where the last inequality follows from our assumption θ kρ 1 , by following a similar derivation as in the previous point.

4ϱc H ∥θ ′ -θ t ∥ 2 √ m   ∥θ ′ -θ t ∥ 2 2 ,where(a) follows by Cauchy-Schwartz inequality; (b) follows by Jensen's inequality (first term) and the use of Theorem 4.1 and Lemma 4.1 due to θt ∈ B Spec ρ,ρ1 (θ 0 ); (c) follows from θt -θ t 2 =

2kϱ 2 ∥θ ′ -θ∥ 2 2 , where (a) follows by the Cauchy-Schwartz inequality and (b) from Lemma 4.1.

). We focus on establishing RSC w.r.t. the tuple (B t , θ t ), where B Spec ρ,ρ1 (θ 0 ) becomes the feasible set for the optimization and B Euc ρ2 (θ t ) is a Euclidean ball around the current iterate. Assumption 3 (Loss function). The loss ℓ i

2 and c 2 = 2c H (2ϱρ 2 + kc ρ1,γ ), with c H is as in Theorem 4.1, ϱ as in Lemma 4.1, and c ρ1,γ as in Lemma 4.2. Consequently, L satisfies RSC w.r.t. ( Qt

θ t , ∂f h ( θt ; x i ) ∂θ ′ -θ t , ∂f h (θ t ; x i ) ∂θ θ ′ -θ t , ∂f h ( θt ; x i ) ∂θ -∂f h (θ t ; x i ∂θ

ACKNOWLEDGMENTS

AB is grateful for support from the National Science Foundation (NSF) through awards IIS 21-31335, OAC 21-30835, DBI 20-21898, as well as a C3.ai research award. MB and LZ are grateful for support from the National Science Foundation (NSF) and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning2 through awards DMS-2031883 and 814639 as well as NSF IIS-1815697 and the TILOS institute (NSF CCF-2112665).

