ON THE MYSTERIOUS OPTIMIZATION GEOMETRY OF DEEP NEURAL NETWORKS

Abstract

Understanding why gradient-based algorithms are successful in practical deep learning optimization is a fundamental and long-standing problem. Most existing works promote the explanation that deep neural networks have smooth and amenable nonconvex optimization geometries. In this work, we argue that this may be an oversimplification of practical deep learning optimization by revealing a mysterious and complex optimization geometry of deep networks through extensive experiments. Specifically, we consistently observe two distinct geometric patterns in training various deep networks: a regular smooth geometry and a mysterious zigzag geometry, where gradients computed in adjacent iterations are extremely negatively correlated. Also, such a zigzag geometry exhibits a fractal structure in that it appears over a wide range of geometrical scales, implying that deep networks can be highly non-smooth in certain local parameter regions. Moreover, our results show that a substantial part of the training progress is achieved under such complex geometry. Therefore, the existing smoothness-based explanations do not fully match the practice.

1. INTRODUCTION

Training simple neural networks is known to be an NP-complete problem (Blum & Rivest, 1988) . However, in modern machine learning, training deep neural networks turns out to be incredibly easy in that many simple gradient-based optimization algorithms can consistently achieve low loss (Robbins, 2007; Kingma & Ba, 2015; Duchi et al., 2011) . Such an observation inspires researchers to think that there might be specific simple structures of deep neural networks that make nonconvex optimization easy and tractable. Many works have been developed in the past decade to seek justifiable explanations, either theoretically or empirically. Specifically, from a theoretical perspective, many nonconvex optimization theories have been developed to explain the success of deep learning optimization. The key idea is to prove that deep neural networks have certain nice geometries that guarantee convergence to the global minimum in nonconvex optimization. For example, many types of deep neural networks such as over-parameterized residual networks (Zhang et al., 2019; Allen-Zhu et al., 2019b; Du et al., 2019a) , recurrent networks (Allen-Zhu et al., 2019a ), nonlinear networks (Zou et al., 2020; Zhou et al., 2016) , and linear networks (Frei & Gu, 2021; Zhou & Liang, 2017) have been shown to satisfy the so-called gradient dominant geometry (Karimi et al., 2016) . On the other hand, shallow ReLU networks (Soltanolkotabi, 2017; Zhong et al., 2017; Fu et al., 2019; Du et al., 2019b) , deep residual networks (Du et al., 2019a) , and some nonlinear networks (Mei et al., 2018; Du & Lee, 2018) have been shown to satisfy the local strong convexity geometry. Both geometry types guarantee the convergence of gradient-based algorithms to a global minimum at a linear rate. From an empirical perspective, researchers have found that skip connections and batch normalization of deep networks can substantially improve the smoothness of the optimization geometry (Li et al., 2018a; Santurkar et al., 2018; Zhou et al., 2019) . Furthermore, some other works found that there is a continuous low-loss path between the minima of deep networks (Verpoort et al., 2020; Draxler et al., 2018) . In particular, it is observed that a simple linear interpolation between the initialization point and global optimum encounters no significant barrier for many deep networks (Goodfellow et al., 2015) . Moreover, many networks have been shown to possess wide and flat minima that tend to generalize well (Hao et al., 2019; Mulayoff & Michaeli, 2020) . Despite the comprehensiveness of these existing works, they all aim to promote nice geometries to demystify the success of deep learning optimization. While this is an important step toward understanding deep learning, sometimes the results and conclusions can be illusional: the theories and empirical evidence may not necessarily reflect the underlying challenge of optimization in practical deep learning. The success of deep learning optimization may be due to complicated unknown mechanisms oversimplified by existing works. This constitutes the goal of this paper -to investigate the optimization geometry of deep networks and reveal its complex and mysterious geometric patterns that may challenge the existing perception of deep learning optimization.

1.1. OUR CONTRIBUTIONS

We apply the full batch gradient descent to train various popular deep networks on different datasets and study the geometry along the optimization trajectory via gradient correlation-based metrics (defined in Section 2). Specifically, we observe the following distinct geometric patterns. • We consistently observe two distinct geometric patterns in all the experiments: (i) smooth geometry where gradients computed in adjacent iterations are highly positively correlated (in terms of the cosine similarity defined in eq. ( 2)) and point toward similar directions, and (ii) mysterious zigzag geometry where gradients computed in adjacent iterations are highly negatively correlated and point toward opposite directions. Interestingly, we find that for convolutional networks, the training starts with the smooth geometry and transfers to the zigzag geometry later on. On the contrary, the training of residual networks starts with the zigzag geometry and transfers to the smooth geometry afterward. Moreover, a substantial part of the training loss decrease is attained under the complex zigzag geometry in all of the experiments. • We further investigate the mysterious zigzag geometry of deep networks and find that it has a complex fractal structure. Specifically, when we zoom into the local geometry by training the deep networks with very small learning rates, we still observe the same zigzag geometry. It shows that the local geometry of deep networks can be highly non-smooth in a wide range of geometrical scales. Moreover, the zigzag geometric pattern tends to be stronger when we zoom into a smaller geometrical scale. These observations challenge the existing explanations of deep learning optimization based on smooth-type geometries. • Based on the local statistics of mean gradient correlation, we propose a low-cost geometry-adapted warm-up learning rate scheduling scheme for large-batch training of residual networks. We show that it leads to comparable convergence speed and test performance to those of the original heuristic version with parameter fine-tuning.

2. PRELIMINARIES ON GRADIENT CORRELATION

To understand the optimization geometry in deep learning, we propose investigating the gradients along the optimization trajectory generated by full-batch gradient descent. We consider full-batch gradient descent as it is noiseless and reflects the exact underlying gradient geometry of the objective function. Specifically, given a set of training samples {x i , y i } n i=1 where x i denotes the data and y i denotes the corresponding label, the training objective function and the full-batch gradient descent (GD) update at each step (k = 0, 1, . . .) are written as follows. (Objective function): L n (θ) := 1 n n i=1 ℓ(h θ (x i ), y i ), (GD): θ k+1 = θ k -η∇L n (θ k ), where h θ denotes the neural network model parameterized by θ, ∇ is the gradient operator with respect to the parameter θ, η is the learning rate, and ℓ is the loss function. We will consider classification tasks with the cross-entropy loss in this paper. In the training, we collect a set of gradients generated along the optimization trajectory of full-batch gradient descent, i.e., {∇L n (θ 0 ), ∇L n (θ 1 ), . . . , ∇L n (θ k ), . . .}. These gradients determine the direction of model updates and help understand the local optimization geometry of the nonconvex objective function. To provide a quantitative understanding, we investigate the following pairwise gradient correlation of

