RESTRICTED STRONG CONVEXITY OF DEEP LEARNING MODELS WITH SMOOTH ACTIVATIONS

Abstract

We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the "near initialization" perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with L layers, m width, and σ 2 0 initialization variance. First, for suitable σ 2 0 , we establish a O( poly(L) √ m ) upper bound on the spectral norm of the Hessian of such models, considerably sharpening prior results. Second, we introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) which holds as long as the squared norm of the average gradient of predictors is Ω( poly(L) √ m ) for the square loss. We also present results for more general losses. The RSC based analysis does not need the "near initialization" perspective and guarantees geometric convergence for gradient descent (GD). To the best of our knowledge, ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models, thus becoming an alternative sufficient condition for convergence that does not depend on the widely-used Neural Tangent Kernel (NTK). We share preliminary experimental results supporting our theoretical advances.

1. INTRODUCTION

Recent years have seen advances in understanding convergence of gradient descent (GD) and variants for deep learning models (Du et al., 2019; Allen-Zhu et al., 2019; Zou & Gu, 2019; Zou et al., 2020; Liu et al., 2022; Ji & Telgarsky, 2019; Oymak & Soltanolkotabi, 2020; Nguyen, 2021) . Despite the fact that such optimization problems are non-convex, a series of recent results have shown that GD has geometric convergence and finds near global solution "near initialization" for wide networks. Such analysis is typically done based on the Neural Tangent Kernel (NTK) (Jacot et al., 2018) , in particular by showing that the NTK is positive definite "near initialization," in turn implying the optimization problem satisfies a condition closely related to the Polyak-Łojasiewicz (PL) condition, which in turn implies geometric convergence to the global minima (Liu et al., 2022; Nguyen, 2021) . Such results have been generalized to more flexible forms of "lazy learning" where similar guarantees hold (Chizat et al., 2019) . However, there are concerns regarding whether such "near initialization" or "lazy learning" truly explains the optimization behavior in realistic deep learning models (Geiger et al., 2020; Yang & Hu, 2020; Fort et al., 2020; Chizat et al., 2019) . Our work focuses on optimization of deep models with smooth activation functions, which have become increasingly popular in recent years (Du et al., 2019; Liu et al., 2022; Huang & Yau, 2020) . Much of the theoretical convergence analysis of GD has focused on ReLU networks (Allen-Zhu et al., 2019; Nguyen, 2021) . Some progress has also been made for deep models with smooth activations, but existing results are based on a variant of the NTK analysis, and the requirements on the width of 1

