RESTRICTED STRONG CONVEXITY OF DEEP LEARNING MODELS WITH SMOOTH ACTIVATIONS

Abstract

We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the "near initialization" perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with L layers, m width, and σ 2 0 initialization variance. First, for suitable σ 2 0 , we establish a O( poly(L) √ m ) upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is Ω( poly(L) √ m ) for the square loss. We also present results for more general losses. The RSC based analysis does not need the "near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.

1. INTRODUCTION

Recent years have seen advances in understanding convergence of gradient descent (GD) and variants for deep learning models (Du et al., 2019; Allen-Zhu et al., 2019; Zou & Gu, 2019; Zou et al., 2020; Liu et al., 2022; Ji & Telgarsky, 2019; Oymak & Soltanolkotabi, 2020; Nguyen, 2021) . Despite the fact that such optimization problems are non-convex, a series of recent results have shown that GD has geometric convergence and finds near global solution "near initialization" for wide networks. Such analysis is typically done based on the Neural Tangent Kernel (NTK) (Jacot et al., 2018) , in particular by showing that the NTK is positive definite "near initialization," in turn implying the optimization problem satisfies a condition closely related to the Polyak-Łojasiewicz (PL) condition, which in turn implies geometric convergence to the global minima (Liu et al., 2022; Nguyen, 2021) . Such results have been generalized to more flexible forms of "lazy learning" where similar guarantees hold (Chizat et al., 2019) . However, there are concerns regarding whether such "near initialization" or "lazy learning" truly explains the optimization behavior in realistic deep learning models (Geiger et al., 2020; Yang & Hu, 2020; Fort et al., 2020; Chizat et al., 2019) . Our work focuses on optimization of deep models with smooth activation functions, which have become increasingly popular in recent years (Du et al., 2019; Liu et al., 2022; Huang & Yau, 2020) . Much of the theoretical convergence analysis of GD has focused on ReLU networks (Allen-Zhu et al., 2019; Nguyen, 2021) . Some progress has also been made for deep models with smooth activations, but existing results are based on a variant of the NTK analysis, and the requirements on the width of such models are high (Du et al., 2019; Liu et al., 2022) . Based on such background and context, the motivating question behind our work is: Are there other (meaningful) sufficient conditions beyond NTK which lead to (geometric) convergence of GD for deep learning optimization? Based on such motivation, we make two technical contributions in this paper which shed light on optimization of deep learning models with smooth activations and with L layers, m width, and σ 2 0 initialization variance. First, for suitable σ 2 0 , we establish a O( poly(L) √ m ) upper bound on the spectral norm of the Hessian of such models (Section 4). The bound holds over a large layerwise spectral norm (instead of Frobenius norm) ball B Spec ρ,ρ1 (θ 0 ) around the random initialization θ 0 , where the radius ρ < √ m, arguably much bigger than what real world deep models need. Our analysis builds on and sharpens recent prior work on the topic (Liu et al., 2020) . While our analysis holds for Gaussian random initialization of weights with any variance σ 2 0 , the poly(L) dependence happens when σ 2 0 ≤ 1 4+o(1) 1 m (we handle the 1 m scaling explicitly) . Second, based on our Hessian spectral norm bound, we introduce a new approach to the analysis of optimization of deep models with smooth activations based on the concept of Restricted Strong Convexity (RSC) (Section 5) (Wainwright, 2019; Negahban et al., 2012; Negahban & Wainwright, 2012; Banerjee et al., 2014; Chen & Banerjee, 2015) . While RSC has been a core theme in high-dimensional statistics especially for linear models and convex losses (Wainwright, 2019) , to the best of our knowledge, RSC has not been considered in the context of non-convex optimization of overparameterized deep models. For a normalized total loss function L(θ) = 1 n n i=1 ℓ(y i , ŷi ), ŷi = f (θ; x i ) with predictor or neural network model f parameterized by vector θ and data points {x i , y i } n i=1 , when ℓ corresponds to the square loss we show that the total loss function satisfies RSC on a suitable restricted set Q t κ ⊂ R p (Definition 5.2 in Section 5) at step t as long as 1 n n i=1 ∇ θ f (θ t ; x i ) 2 2 = Ω( 1 √ m ). We also present similar results for general losses for which additional assumptions are needed. We show that the RSC property implies a Restricted Polyak-Łojasiewicz (RPL) condition on Q t κ , in turn implying a geometric one-step decrease of the loss towards the minimum in Q t κ , and subsequently implying geometric decrease of the loss towards the minimum in the large (layerwise spectral norm) ball B Spec ρ,ρ1 (θ 0 ). The geometric convergence due to RSC is a novel approach in the context of deep learning optimization which does not depend on properties of the NTK. Thus, the RSC condition provides an alternative sufficient condition for geometric convergence for deep learning optimization to the widely-used NTK condition. The rest of the paper is organized as follows. We briefly present related work in Section 2 and discuss the problem setup in Section 3. We establish the Hessian spectral norm bound in Section 4 and introduce the RSC based optimization analysis in Section 5. We experimental results corresponding to the RSC condition in Section 6 and conclude in Section 7. All technical proofs are in the Appendix.

2. RELATED WORK

The literature on gradient descent and variants for deep learning is increasingly large, and we refer the readers to the following surveys for an overview of the field (Fan et al., 2021; Bartlett et al., 2021) . Among the theoretical works, we consider (Du et al., 2019; Allen-Zhu et al., 2019; Zou & Gu, 2019; Zou et al., 2020; Liu et al., 2022) as the closest to our work in terms of their study of convergence on multi-layer neural networks. For a literature review on shallow and/or linear networks, we refer to the recent survey (Fang et al., 2021) . Due to the rapidly growing related work, we only refer to the most related or recent work for most parts. et al., 2019; Liu et al., 2022) . The convergence analysis of the gradient descent in (Du et al., 2019; Allen-Zhu et al., 2019; Zou & Gu, 2019; Zou et al., 2020; Liu et al., 2022) relied on the near constancy of NTK for wide neural networks (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019; Liu et al., 2020) , which yield certain desirable properties for their training using gradient descent based methods. One such property is related to the PL condition (Karimi et al., 2016; Nguyen, 2021) , formulated as PL * condition in (Liu et al., 2022) . Our work uses a different optimization analysis based on RSC (Wainwright, 2019; Negahban et al., 



Du et al. (2019); Zou & Gu (2019); Allen-Zhu et al. (2019); Liu et al. (2022) considered optimization of square loss, which we also consider for our main results, and we also present extensions to more general class of loss functions. Zou & Gu (2019); Zou et al. (2020); Allen-Zhu et al. (2019); Nguyen & Mondelli (2020); Nguyen (2021); Nguyen et al. (2021) analyzed deep ReLU networks. Instead, we consider smooth activation functions, similar to (Du

