ON THE MYSTERIOUS OPTIMIZATION GEOMETRY OF DEEP NEURAL NETWORKS

Abstract

Understanding why gradient-based algorithms are successful in practical deep learning optimization is a fundamental and long-standing problem. Most existing works promote the explanation that deep neural networks have smooth and amenable nonconvex optimization geometries. In this work, we argue that this may be an oversimplification of practical deep learning optimization by revealing a mysterious and complex optimization geometry of deep networks through extensive experiments. Specifically, we consistently observe two distinct geometric patterns in training various deep networks: a regular smooth geometry and a mysterious zigzag geometry, where gradients computed in adjacent iterations are extremely negatively correlated. Also, such a zigzag geometry exhibits a fractal structure in that it appears over a wide range of geometrical scales, implying that deep networks can be highly non-smooth in certain local parameter regions. Moreover, our results show that a substantial part of the training progress is achieved under such complex geometry. Therefore, the existing smoothness-based explanations do not fully match the practice.

1. INTRODUCTION

Training simple neural networks is known to be an NP-complete problem (Blum & Rivest, 1988) . However, in modern machine learning, training deep neural networks turns out to be incredibly easy in that many simple gradient-based optimization algorithms can consistently achieve low loss (Robbins, 2007; Kingma & Ba, 2015; Duchi et al., 2011) . Such an observation inspires researchers to think that there might be specific simple structures of deep neural networks that make nonconvex optimization easy and tractable. Many works have been developed in the past decade to seek justifiable explanations, either theoretically or empirically. Specifically, from a theoretical perspective, many nonconvex optimization theories have been developed to explain the success of deep learning optimization. The key idea is to prove that deep neural networks have certain nice geometries that guarantee convergence to the global minimum in nonconvex optimization. For example, many types of deep neural networks such as over-parameterized residual networks (Zhang et al., 2019; Allen-Zhu et al., 2019b; Du et al., 2019a) , recurrent networks (Allen-Zhu et al., 2019a), nonlinear networks (Zou et al., 2020; Zhou et al., 2016) , and linear networks (Frei & Gu, 2021; Zhou & Liang, 2017) have been shown to satisfy the so-called gradient dominant geometry (Karimi et al., 2016) . On the other hand, shallow ReLU networks (Soltanolkotabi, 2017; Zhong et al., 2017; Fu et al., 2019; Du et al., 2019b) , deep residual networks (Du et al., 2019a) , and some nonlinear networks (Mei et al., 2018; Du & Lee, 2018) have been shown to satisfy the local strong convexity geometry. Both geometry types guarantee the convergence of gradient-based algorithms to a global minimum at a linear rate. From an empirical perspective, researchers have found that skip connections and batch normalization of deep networks can substantially improve the smoothness of the optimization geometry (Li et al., 2018a; Santurkar et al., 2018; Zhou et al., 2019) . Furthermore, some other works found that there is a continuous low-loss path between the minima of deep networks (Verpoort et al., 2020; Draxler et al., 2018) . In particular, it is observed that a simple linear interpolation between the initialization point and global optimum encounters no significant barrier for many deep networks (Goodfellow

