WHICH LAYER IS LEARNING FASTER? A SYSTEMATIC EXPLORATION OF LAYER-WISE CONVERGENCE RATE FOR DEEP NEURAL NETWORKS

Abstract

The deeply hierarchical structures enable deep neural networks (DNNs) to fit extremely complex target functions. However, the complex interaction between layers also makes the learning process of a particular layer poorly understood. This work demonstrates that the shallower layers of DNNs tend to converge faster than the deeper layers. We call this phenomenon Layer Convergence Bias. We also uncover the fundamental reason behind this phenomenon: Flatter local minima of shallower layers make their gradients more stable and predictive, allowing for faster training. Another surprising result is that the shallower layers tend to learn the low-frequency components of the target function, while the deeper layers usually learn the high-frequency components. It is consistent with the recent discovery that DNNs learn lower frequency objects faster.

1. INTRODUCTION

Over the last decade, breakthrough progress has been made by deep neural networks (DNNs) on a wide range of complicated tasks in computer vision (Krizhevsky et al., 2017) , natural language processing (Sutskever et al., 2014) , speech recognition (Graves et al., 2013) , game playing (Silver et al., 2016) , and biomedical prediction (Jumper et al., 2021) . Such progress hinged on a number of advances in hardware technology, dataset construction, and model architectural designs. Among them, the invention and application of very-deep network architectures play a decisive role. Deepening the network is an effective way to empower its fitting ability. Extensive studies (Delalleau & Bengio, 2011; Eldan & Shamir, 2016; Lu et al., 2017) compared the power between deeper and wider neural networks and showed that the polynomial growth of depth has a similar effect to the exponential growth of width. Therefore, modern DNNs (Simonyan & Zisserman, 2014; He et al., 2016) usually contain tens of layers to ensure their modeling abilities for real-world applications. Although the practical success of deep architectures is indisputable, they make the learning hardly predictable since complex interaction happens between layers when co-adapting to the target (Yosinski et al., 2014) . By now, we still have a poor understanding of how different layers learn differently. Currently, a widely accepted view relates to the vanishing gradient problem Hochreiter (1991); Hochreiter et al. (2001) . The gradients are getting weaker and weaker as they move back through the hidden layers, making the shallower layers converge more slowly (Nielsen, 2015) . Informally, it is reasonable that larger gradient values bring higher learning speed. Even though this view somewhat makes sense, we seem to have little concrete evidence supporting it. In particular, it is dubious how higher-level features can be built based on the unstable features extracted by the unconverged shallower layers (Raghu et al., 2017) . This paper aims to find a credible answer for the parameters of which layer are learning faster towards the convergence point (defined as the convergence rate in this work) with a systematic exploration. Our results lead to somewhat startling discoveries. Our Contributions. Our point of start is illustrating that there does not seem to be a reliable positive correlation between the gradient magnitude and the convergence rate of a particular layer. Instead, we find that shallower layers tend to converge faster than the deeper ones, even with smaller gradients. The phenomenon is called layer convergence bias in this paper. We then turn our attention to excavating the underlying mechanism for the faster convergence of shallower layers. Specifically, we find out that the depth of a layer has a fundamental effect on its training: the parameters of shallower layers are usually optimized on flatter landscapes than deeper layers. This finding reveals that the gradients of shallower layers may be more predictive and thus have the potential to allow the larger learning rates (LRs) to be performed, making the convergence faster. Finally, we find that the layer convergence bias is also tied to the frequency of the function they are modeling. When fitting a complex target function, the shallower layers tend to fit the low-frequency (usually simpler) components. On the contrary, the deeper layers struggle to fit the remaining highfrequency components. It is a consistent result of the recent discovery that DNNs prioritize learning low-frequency components of the modeling function, while having very low learning speed on highfrequency components that tend to be more complex (Rahaman et al., 2019) . This finding provides us with another perspective to understand why deeper layers learn more slowly. We believe that understanding the roots of such a fundamental convergence bias can give us a better grasp of the complicated learning process of DNNs. In turn, it can motivate more in-depth algorithmic progress for the deep learning community. This paper is organized as follows. In Section 2, we introduce our method for measuring convergence speed for different layers, and formally define the layer convergence bias. In Section 3, we examine the relationship between gradient magnitude and convergence rate, and show that the shallower layers tend to converge faster even with smaller gradients. Then in Section 4, we analyze the mechanism behind the layer convergence bias in DNN training. The layer-frequency correspondence is demonstrated in Section 5. The practical significance of layer convergence bias is presented in Section 6. We further discuss the related work in Section 7 and conclude in Section 8.

2. LAYER CONVERGENCE BIAS

The deep architecture of DNNs is arguably one of the most important factors for their powerful fitting abilities. With the benefit brought by the deep structures, there are also extra complexities in the training process coming into being. So far, we do not have a firm conclusion about whether some layers are learning faster than others. For examining the convergence progress for a DNN, a common practice is checking its loss curve. However, this is not applicable for comparing the convergence between different layers. In this work, we define a measurement for layer-wise convergence in the following.  (t) l } L l=1 be f (x) = (T (t) L • T (t) L-1 • • • • • T (t) 1 )(x) : R i → R o , where i, o are the dimension of its inputs and outputs. We use θ l can finally converge to its optimal point θ * l when t → ∞, we define the convergence rate of θ l during the time interval [t 1 , t 2 ] to be C (t1,t2) l = 1 (t 2 -t 1 ) • ∥θ (t1) l -θ * l ∥ 2 -∥θ (t2) l -θ * l ∥ 2 ∥θ (t0) l -θ * l ∥ 2 , where t 0 denotes the time point when the training starts. In this definition, the numerator ∥θ (t1) l -θ * l ∥ 2 -∥θ (t2) l -θ * l ∥ 2 denotes how much the distance of the parameter θ l to the optimal point is shortened in the period [t 1 , t 2 ]. The denominator ∥θ (t0) l -θ * l ∥ 2 represents the distance between the initial point to the convergence point, whose primary function is to normalize the speed, allowing the convergence of different layers to compare with each other. Thus, the convergence rate of θ l can be understood as the ratio of normalized distance to time. Common optimization works (Yi et al., 1999; Nesterov, 2003) defined the rate of convergence for θ as lim k→∞ ∥θ (k+1) -θ * ∥2 ∥θ (k) -θ * ∥2 . It focuses on measuring an exponential level convergence when the



Definition 2.1 (Layer-wise convergence rate) At the training time t, let the deep neural network with L layers {T

