WHICH LAYER IS LEARNING FASTER? A SYSTEMATIC EXPLORATION OF LAYER-WISE CONVERGENCE RATE FOR DEEP NEURAL NETWORKS

Abstract

The deeply hierarchical structures enable deep neural networks (DNNs) to fit extremely complex target functions. However, the complex interaction between layers also makes the learning process of a particular layer poorly understood. This work demonstrates that the shallower layers of DNNs tend to converge faster than the deeper layers. We call this phenomenon Layer Convergence Bias. We also uncover the fundamental reason behind this phenomenon: Flatter local minima of shallower layers make their gradients more stable and predictive, allowing for faster training. Another surprising result is that the shallower layers tend to learn the low-frequency components of the target function, while the deeper layers usually learn the high-frequency components. It is consistent with the recent discovery that DNNs learn lower frequency objects faster.

1. INTRODUCTION

Over the last decade, breakthrough progress has been made by deep neural networks (DNNs) on a wide range of complicated tasks in computer vision (Krizhevsky et al., 2017) , natural language processing (Sutskever et al., 2014) , speech recognition (Graves et al., 2013) , game playing (Silver et al., 2016) , and biomedical prediction (Jumper et al., 2021) . Such progress hinged on a number of advances in hardware technology, dataset construction, and model architectural designs. Among them, the invention and application of very-deep network architectures play a decisive role. Deepening the network is an effective way to empower its fitting ability. Extensive studies (Delalleau & Bengio, 2011; Eldan & Shamir, 2016; Lu et al., 2017) compared the power between deeper and wider neural networks and showed that the polynomial growth of depth has a similar effect to the exponential growth of width. Therefore, modern DNNs (Simonyan & Zisserman, 2014; He et al., 2016) usually contain tens of layers to ensure their modeling abilities for real-world applications. Although the practical success of deep architectures is indisputable, they make the learning hardly predictable since complex interaction happens between layers when co-adapting to the target (Yosinski et al., 2014) . By now, we still have a poor understanding of how different layers learn differently. Currently, a widely accepted view relates to the vanishing gradient problem Hochreiter (1991) ; Hochreiter et al. (2001) . The gradients are getting weaker and weaker as they move back through the hidden layers, making the shallower layers converge more slowly (Nielsen, 2015) . Informally, it is reasonable that larger gradient values bring higher learning speed. Even though this view somewhat makes sense, we seem to have little concrete evidence supporting it. In particular, it is dubious how higher-level features can be built based on the unstable features extracted by the unconverged shallower layers (Raghu et al., 2017) . This paper aims to find a credible answer for the parameters of which layer are learning faster towards the convergence point (defined as the convergence rate in this work) with a systematic exploration. Our results lead to somewhat startling discoveries. Our Contributions. Our point of start is illustrating that there does not seem to be a reliable positive correlation between the gradient magnitude and the convergence rate of a particular layer. Published as a conference paper at ICLR 2023 Instead, we find that shallower layers tend to converge faster than the deeper ones, even with smaller gradients. The phenomenon is called layer convergence bias in this paper. We then turn our attention to excavating the underlying mechanism for the faster convergence of shallower layers. Specifically, we find out that the depth of a layer has a fundamental effect on its training: the parameters of shallower layers are usually optimized on flatter landscapes than deeper layers. This finding reveals that the gradients of shallower layers may be more predictive and thus have the potential to allow the larger learning rates (LRs) to be performed, making the convergence faster. Finally, we find that the layer convergence bias is also tied to the frequency of the function they are modeling. When fitting a complex target function, the shallower layers tend to fit the low-frequency (usually simpler) components. On the contrary, the deeper layers struggle to fit the remaining highfrequency components. It is a consistent result of the recent discovery that DNNs prioritize learning low-frequency components of the modeling function, while having very low learning speed on highfrequency components that tend to be more complex (Rahaman et al., 2019) . This finding provides us with another perspective to understand why deeper layers learn more slowly. We believe that understanding the roots of such a fundamental convergence bias can give us a better grasp of the complicated learning process of DNNs. In turn, it can motivate more in-depth algorithmic progress for the deep learning community. This paper is organized as follows. In Section 2, we introduce our method for measuring convergence speed for different layers, and formally define the layer convergence bias. In Section 3, we examine the relationship between gradient magnitude and convergence rate, and show that the shallower layers tend to converge faster even with smaller gradients. Then in Section 4, we analyze the mechanism behind the layer convergence bias in DNN training. The layer-frequency correspondence is demonstrated in Section 5. The practical significance of layer convergence bias is presented in Section 6. We further discuss the related work in Section 7 and conclude in Section 8.

2. LAYER CONVERGENCE BIAS

The deep architecture of DNNs is arguably one of the most important factors for their powerful fitting abilities. With the benefit brought by the deep structures, there are also extra complexities in the training process coming into being. So far, we do not have a firm conclusion about whether some layers are learning faster than others. For examining the convergence progress for a DNN, a common practice is checking its loss curve. However, this is not applicable for comparing the convergence between different layers. In this work, we define a measurement for layer-wise convergence in the following. Definition 2.1 (Layer-wise convergence rate) At the training time t, let the deep neural network with L layers {T (t) l } L l=1 be f (x) = (T (t) L • T (t) L-1 • • • • • T (t) 1 )(x) : R i → R o , where i, o are the dimension of its inputs and outputs. We use θ (t) l to denote the parameters of the l-th layer T (t) l . Assuming that θ (t) l can finally converge to its optimal point θ * l when t → ∞, we define the convergence rate of θ l during the time interval [t 1 , t 2 ] to be C (t1,t2) l = 1 (t 2 -t 1 ) • ∥θ (t1) l -θ * l ∥ 2 -∥θ (t2) l -θ * l ∥ 2 ∥θ (t0) l -θ * l ∥ 2 , where t 0 denotes the time point when the training starts. In this definition, the numerator ∥θ (t1) l -θ * l ∥ 2 -∥θ (t2) l -θ * l ∥ 2 denotes how much the distance of the parameter θ l to the optimal point is shortened in the period [t 1 , t 2 ]. The denominator ∥θ (t0) l -θ * l ∥ 2 represents the distance between the initial point to the convergence point, whose primary function is to normalize the speed, allowing the convergence of different layers to compare with each other. Thus, the convergence rate of θ l can be understood as the ratio of normalized distance to time. Common optimization works (Yi et al., 1999; Nesterov, 2003) defined the rate of convergence for θ as lim k→∞ ∥θ (k+1) -θ * ∥2 ∥θ (k) -θ * ∥2 . It focuses on measuring an exponential level convergence when the Published as a conference paper at ICLR 2023 optimization step goes to infinity. Since the difference in convergence rates between layers usually appears at an early stage of training, and it is not large enough to compare at an exponential level, we define our new convergence metric to present the convergence difference in a clearer way. Observation 2.1 (Layer convergence bias). For l 1 < l 2 , ∃ t > 0, such that C (t1,t2) l1 > C (t1,t2) l2 when t 1 < t 2 < t. Layer convergence bias indicates that at an early training phase t < t, the parameters θ l1 of a shallower layer l 1 tend to move to θ * l1 faster than a deeper layer θ l2 moving to θ * l2 . In the following, we use both synthetic and real datasets to show that the layer convergence bias appears for both fully-connected neural networks (FCNNs) and convolutional neural networks (CNNs).

3. VERIFICATION OF LAYER CONVERGENCE BIAS

In this section, we try to substantiate the central claim of this work. First, we use the FCNNs to show that the shallower layers tend to converge faster than the deeper layers on the regression task, even when the gradient values for shallower layers are smaller. We then use CNNs with modern architectures to verify that layer convergence bias is a common phenomenon in practical applications. All experimental settings in this work can be found in Appendix A.1.

3.1. LAYER CONVERGENCE BIAS IN FULLY-CONNECTED NETWORKS

For FCNNs, we construct a simple regression task to demonstrate that layers with smaller gradients do not necessarily learn more slowly than layers with larger gradients. The fitting target is f (x) = sin(x) + 1 3 sin(3x) + 1 10 sin(10x) + 1 30 sin(30x), with mean square error loss for training. First, we use the with the Sigmoid activations as a simple example. In the following analysis, the first fully-connected layer (1-32) is named Layer 1, and the subsequent two layers (32-32) are called Hidden layer 1, Hidden layer 2 respectively. The gradient values and the convergence processes for these layers are shown in Fig. 1 (a) . Two observations can be obtained from the plots: 1) The gradient of Hidden layer 1 is nearly always smaller than the gradient of Hidden layer 2. 2) Although shallower layers have smaller gradients, they seem to converge faster. For the first 50 epochs, the shallower layers are moving faster to their convergence point (e.g., C (t0,t50) Layer 1 ≈ 0.012, C (t0,t50) Hidden layer 1 ≈ 0.009, C (t0,t50) Hidden layer 2 ≈ 0.006), which is inconsistent with the previous view that higher gradients lead to faster learning (Nielsen, 2015) . To further validate the above results with a deeper network, we adopt residual connections (He et al., 2016) for the FCNN (deep network fails to be trained in this task without residual connections) and use the ReLU activation function. The -1] with four residual blocks of width 128 shows similar results to the shallow FCNN without residual connection (see Fig. 1 (b) ). In this case, the difference in layer-wise convergence rate can be observed even earlier (i.e., C (t0,t5) Res-Block 1 ≈ 2C (t0,t5) Res-Block 4 ), which shows that the layer convergence bias also happens for deeper FCNNs with residual connections. It is noteworthy that our convergence metric is crucial to observe the layer convergence bias, which is elaborated in Appendix A.2. Clearly, these results cannot reconcile with the previous view that larger gradients bring a higher learning speed for deeper layers, at least for the DNNs used in this work. Instead, from the optimization point of view, the parameters of shallower layers are learning faster to converge.

3.2. LAYER CONVERGENCE BIAS IN CONVOLUTIONAL NETWORKS

Real-world datasets are very different from the synthetic data used in our previous experiments. In order to utilize the layer convergence bias to understand and better improve DNNs in real applications, it is important to verify whether the layer convergence bias holds for CNNs on images. In the following experiments, we examine the layer-wise convergence process on ImageNet (Russakovsky et al., 2015) dataset with both ResNet-50 (He et al., 2016) and VGG-19 (Simonyan & Zisserman, 2014) . We train the CNNs for 120 epochs with learning rate decay at the 50th epoch (0.1 → 0.01) and the 100th epoch (0.01 → 0.001). The training processes are shown in Fig. 2 . For ResNet-50, we visualize the learning process of the first convolutional layer and its subsequent four stages. One can easily observe that at the beginning of training, the shallower layers converge much faster than the deeper layers (C (t0,t20) Stage 1 ≈ 3C (t0,t20) Stage 4 ). However, after the learning rate decays at the 50th epoch, deeper layers begin to learn effectively and achieve a higher convergence rate than the shallower layers (C (t50,t60) Stage 1 ≈ 0.5C (t50,t60) Stage 4 ). We conjecture that the initial learning rate is too large for the deeper layers to learn. For VGG-19, we visualize its 1st, 5th, 9th, 13th, and 17th layers. This network show a more significant convergence difference between layers than ResNet-50. At the first training stage with the initial learning rate, ∥θ (t5) l -θ * l ∥ > ∥θ (t0) l -θ * l ∥ for l ∈ {5, 9, 13, 17} , which means that all layers but the first one even slightly diverge. Usually, the divergence appears when the learning rate is too large. This phenomenon confirms that the deeper layers cannot effectively learn with the large learning rate at the beginning. The experiments of FCNNs and CNNs verify that layer convergence bias is a common phenomenon for DNNs. In Section 5 and Appendix A.3, A.4, we discuss the factors that would affect the phenomenon, and some in-depth findings they reveal.

4. MECHANISM BEHIND LAYER CONVERGENCE BIAS

So far, our investigation shows that the seemingly-right perspective for linking the layer-wise gradient and convergence rate is tenuous, at best. Both FCNNs and CNNs demonstrate an evident bias that shallower layers learn faster. Can we explain why this is the case? Gradient Predictiveness. Since gradient values cannot determine the convergence rate, we wonder if the directions of the gradients play a more critical role. More chaotic update directions make convergence slower. Here we examine the gradient predictiveness (Santurkar et al., 2018) of different layers. If the gradient behavior is "predictive", less change in the gradient directions would appear when 1) the gradients are calculated with different batches of data; 2) the parameters of other layers update. Predictiveness can also be simply understood as the stability of gradient direction. Definition 4.1 Let (x (t) , y (t) ) be a batch of input-label pairs for the DNN to train at time t, and (x ′(t) , y ′(t) ) be another batch of data. We define the gradient predictiveness of the lth layer at time t w.r.t. data be the cosine similarity sim(G l,t , G ′ l,t ) = ∥G l,t G ′ l,t ∥ ∥G l,t ∥∥G ′ l,t ∥ ∈ [-1, 1]. Likewise, the gradient predictiveness w.r.t. parameters is defined as sim(G l,t , G ′′ l,t ), where G l,t = ∇ θ (t) l L(θ (t) 1 , ..., θ (t) L ; x (t) , y (t) ) G ′ l,t = ∇ θ (t) l L(θ (t) 1 , ..., θ (t) L ; x ′(t) , y ′(t) ) G ′′ l,t = ∇ θ (t) l L(θ (t+1) 1 , ..., θ l-1 , θ l , θ (t+1) l+1 , ..., θ L ; x (t) , y (t) ) Here, G l,t corresponds to the gradient of θ l . G ′ l,t is the gradient of this layer with another batch of data, while G ′′ l,t means the gradient after all the other layers have updated to new values. Therefore, sim(G l,t , G ′ l,t ) indicates the stability of gradients with different data batches. sim(G l,t , G ′′ l,t ) reflects whether the currently estimated gradient is in a consistent decreasing direction when the loss landscape is affected by the updating of other layers' parameters. The gradient predictiveness during training is shown in Fig. 3 , where Res-Block 1 has more predictive gradients than Res-Block 4. Visualizing the Loss Landscapes. We are curious about why gradients for deeper layers have poorer predictiveness. A hypothesis is that the loss landscapes for deeper layers are more rugged, making the parameters fluctuate more. A straightforward method to validate this hypothesis is plotting the loss landscapes for the parameters. To do this for a particular layer l, one can choose a central point θ * l and two direction vectors d l,1 , d l,2 . Then the loss landscape can be drawn with f (β 1 , β 2 ) = L(θ * l + β 1 d l,1 + β 2 d l,2 ) in the 3D space with β 1 , β 2 forming a simplified parameter space. In this work, we generate random Gaussian directions for different layers, and normalize them to obtain the same norm of the corresponding layer. Specifically, we make the replacement d l ← d l ∥d l ∥ ∥θ * l ∥ for a fully connected layer. For a convolutional layer, we use filter-wise normalization (Li et al., 2018) , where d k l represents the kth filter of the lth layer. We set both β 1 and β 2 in the domain of [-1, 1]. Landscapes for FCNN. The loss landscapes for four residual blocks of the FCNN are shown in Fig. 4 . For the shallower blocks, the surfaces are flatter near the minimizer, meaning that the gradient magnitudes may be small. However, small gradients do not necessarily lead to slow learning speed in this case. Combined with the gradient predictiveness discussed above, a flatter loss landscape may lead to more consistent gradient directions, making the learning more smooth. d k l ← d k l ∥d k l ∥ ∥θ k * l ∥ as in Landscapes for CNNs. The loss landscapes for ResNet-50 and VGG-19 on ImageNet are shown in Fig. 5 . It is interesting that deep convolutional networks with/without residual connections present totally different loss landscapes. For ResNet-50, its landscapes near the convergence point θ * l are smooth and nearly convex, making the neural network easier to train. On the contrary, VGG-19 has much more shattered landscapes, the initial iterations probably lie in the chaotic regions, prohibiting its training (Balduzzi et al., 2017) . This may explain the much less efficient convergence towards the optimal point for VGG than ResNet at the initial phase (Fig. 2 ). The shallower layers for both networks have flatter minima, making them converge faster than the deeper layers. The plots for all layers can be found in Appendix A.5. Comparing different layers in the CNNs, the answer for layer convergence bias becomes clearer. The key difference between different layers' loss landscapes of ResNet-50 is the sharpness of the local minima (Fig. 5 (a,b)). We conjecture it is because of a well-known fact that the shallower layers of CNNs tend to learn general features which are applicable to various datasets and tasks, while the deeper layers usually learn task-specific features (Yosinski et al., 2014) . Before our work, (Zeiler & Fergus, 2014) also revealed that the general features in a five-layer CNN stabilized faster than the specific features. Since the general features are more evenly distributed, they usually cause less fluctuation for training, leading to flatter optima. Theoretically, flatter minimizers are easier to be found by SGD optimizers (Pan et al., 2020) . For VGG-19, its shallower and deeper layers also have flatter and sharper minima (Fig. 5 (c, d )), respectively. The shattered loss landscape for its deeper layers may also explain its inefficient learning process with a large learning rate (Fig. 2 (b) ). Here we summarize the mechanism behind layer convergence bias: the parameters of shallower layers are easier to optimize due to their flatter loss landscapes. At a higher level, shallower layers learn general features, which are usually easier.

5. DEEPER LAYERS FIT THE HIGH-FREQUENCY COMPONENTS

Recent advances in the learning process of DNNs (Rahaman et al., 2019; Ronen et al., 2019; Xu & Zhou, 2021) revealed that the low-frequency components of the target function are fitted much faster than the high-frequency components. There is a natural question about whether there is some inherent link between layer convergence bias and this result. In this section, we investigate the answer, and surprisingly find that: the low-frequency parts are usually fitted by the shallower layers, while the remaining higher frequencies are mainly learned by the deeper layers. It provides us with an alternative perspective to understand the layer convergence bias. The Correspondence for FCNN. With the residual structures, we can straightforwardly visualize what each block of a FCNN learns. Considering the FCNN with one input layer z 0 = T 0 (x) : R 1 → R 128 , four residual blocks z l = T ′ l (z l-1 ) = T l (z l-1 ) + z l-1 : R 128 → R 128 , l ∈ {1, 2, 3, 4}, and an output layer y = T 5 (z 4 ) : R 128 → R 1 . The whole network can be expressed as y = T 5 (z 1 + T 2 (z 1 ) + T 3 (z 2 ) + T 4 (z 3 )) = T 5 (z 1 ) + T 5 (T 2 (z 1 )) + T 5 (T 3 (z 2 )) + T 5 (T 4 (z 3 )) if the output layer T 5 is a linear transformation. The fitting results for each layer are shown in Fig. 6 . It can be seen that the deeper layers tend to fit the more complex components of the target function y = sin(x) + 1 3 sin(3x) + 1 10 sin(10x) + 1 30 sin(30x are also consistent with the amplitudes of the components. Specifically, the ranges of the four fitted functions are 2.3, 0.7, 0.5, and 0.06, which are similar to the four components. This result further confirms the relationship between layers and frequencies. The Correspondence for CNNs. For CNNs, we verify their layer-frequency correspondence through the response frequency (Xu et al., 2019) . In a nutshell, if an input-output mapping f possesses significant high frequencies, then a small change in its input induces a large change in the output. We generate standard Gaussian-distributed input x for different residual blocks of ResNet-50 and different layers of VGG-19. At the same time, small Gaussian perturbation ∆x is added to the input. A larger change ∆y of the layer output means the layer handles higher frequencies. The response frequencies are shown in Fig. 7 . At the first 5 epochs of training on ImageNet, different layers for both ResNet-50 and VGG-19 do not show significantly different response frequencies. But after about ten epochs, the response frequencies for deeper layers (e.g., stage 4 for ResNet-50, layer 13 for VGG-19) increase while the shallower layers show lower response frequencies. Therefore, we conclude that the layer-frequency correspondence also holds for CNNs. In addition, it is not an innate nature of the layers, but a result of the training process. How the target frequency affects layer convergence bias? To demonstrate the effect of layerfrequency correspondence on the layer convergence bias, we try fitting simpler targets with less highfrequency components, and see what would happen to the layer-wise convergence rate of FCNN. In Fig. 8 (a-d ), we only keep several lowest frequencies of the target, e.g., the target function y = sin(x) is named "Complexity=1", and y = sin(x) + 1 3 sin(3x) is named "Complexity=2", etc. After discarding more and more high-frequency components, the deeper layers converge faster and faster. In this case, the layer convergence bias does not strictly hold anymore. In Fig. 8 (b), the Res-Block 4 converges faster than Res-Block 3 after the 5th epoch. In Fig. 8 (c ), the Res-Block 4 converges with a similar speed as Res-Block 2, while the Res-Block 3 even learns faster than Res-Block 2. It seems that removing the high-frequency component that corresponds to a deep layer can effectively accelerate its training. For CNNs, we also observe similar phenomena (Fig. 8 (e-h )). On simpler targets (e.g., CIFAR 10), the deeper layers converge faster than on more complex targets (e.g., CIFAR100). An implication of this result is that the data complexity may be too low for the model. In practice, CIFAR datasets only need ResNet-18 to fit well (Wu et al., 2020) . In fact, (Rahaman et al., 2019) had shown that different layers have some links to different frequencies, but the authors did not provide further insight for this phenomenon. This work verifies the underlying relationship between layers and fitting frequencies, and establishes a connection for this relationship to the layer convergence bias. Published as a conference paper at ICLR 2023 

6. PRACTICAL SIGNIFICANCE

Up to now, we have been analyzing the layer convergence bias from a theoretical perspective. This section discusses its practical use to drive the development of DNN architecture design, and a new explanation for the acceleration effect of transfer learning with the help of layer convergence bias. Modern CNN architectures (He et al., 2016) usually contain layers from narrow to wide (e.g., 64 channels of the first layer to 2048 channels of the last layer). From the perspective of computational complexity, the narrower shallower layers make the corresponding large feature maps less computation-consuming. Considering the layer convergence bias, deeper layers with larger capacities are also beneficial for the corresponding high-frequencies to be learned easier. Although this is a common design for CNNs, Transformers (Dosovitskiy et al., 2020) usually apply the same architecture for all encoders. For a vision Transformer with 12 encoders, we use encoders with width 2/4/8 to construct three variants. The variants only differ in the arrangement of different encoders, we use W to denote the widths, and N to denote the number of each kind of encoders. The configures are summarized below:

6.1. DNN ARCHITECTURE DESIGN

• deeper encoders wider: W = (2, 4, 8), N = (6, 3, 3) • vanilla architecture: W = (4, 4, 4), N = (4, 4, 4) • deeper encoders narrower: W = (8, 4, 2), N = (3, 3, 6) Fig. 9 shows their performances, with the best accuracy of 80.75%, 78.88%, and 75.75%, respectively. We find that with the same number of parameters, putting the wider layers deeper results in higher training performance. This finding may serve as an effective way to improve the model capacity. The causal connection between layer complexity distribution and model performance is discussed in Appendix A.6. And layer convergence bias for ViT is analyzed in Appendix A.7. Published as a conference paper at ICLR 2023 6.2 ACCELERATION EFFECT OF TRANSFER LEARNING Transfer learning (fine-tuning with the pre-trained models) is a widely-used technique that can accelerate the model convergence (Shao et al., 2018b; a; Liang & Zheng, 2020) . We show the layer convergence curves w/o transfer learning on the Flowers dataset (Nilsback & Zisserman, 2006) . When training from scratch (Fig. 10 (a) ), the shallower layers converge faster so that the deeper layers can extract semantic features based on basic features. Local minima of Stage 4 is sharp in this case. However, with transfer learning (Fig. 10 (b )), deeper layers can directly be built on the pre-trained basic features. The Stage 4 shows a much higher convergence rate among all layers, its loss landscape also becomes flatter. Two observations that are not consistent with layer convergence bias are summarized in the following: 1) the pre-trained shallower layers are nearly optimal, so they don't present fast convergence in transfer learning; 2) although the pre-trained deeper layers are not as optimal as the shallower layers do, their loss landscapes are much flatter than training from scratch, which makes them converge much faster. 

7. RELATED WORK

DNNs with gradient-based training show great potential to fit targets with arbitrary complexities (Hornik et al., 1989; Leshno et al., 1993) , given sufficient width. With the advances in the last decade to verify the capability of the depth of universal approximators (Delalleau & Bengio, 2011; Eldan & Shamir, 2016; Lu et al., 2017) , practitioners tried to reduce the width of neural networks by adding more layers (Simonyan & Zisserman, 2014; He et al., 2016; Huang et al., 2017) . We are also inspired by research on local properties (sharpness/flatness) of loss functions at minima (Keskar et al., 2017; Li et al., 2018) and relationship between convergence rate and generalization (Hardt et al., 2016) . Furthermore, LARS optimizer (You et al., 2017) shares some valuable insights on layer convergence, which are discussed Appendix In practice, the idea of layer bias had been intuitively applied to accelerate DNN training (Huang et al., Brock et al., 2017) and mitigating catastrophic forgetting (Ramasesh et al., 2020) . The arrangement schemes of CNN/Transformer blocks were explored by (Liu et al., 2022b; a) .

8. CONCLUSION

In this work, we empirically studied the phenomenon that the shallower layers of DNNs tend to converge faster than the deeper layers, called layer convergence bias. This phenomenon is a natural preference in the process of DNN training: the shallower layers are responsible for extracting low-level features which are more evenly distributed and easier to learn, while deeper layers refine these features to do specific tasks. This makes the loss landscapes for shallower layers flatter than the landscapes for deeper layers, making shallower layers converge faster. In addition, this work established a connection between layers and learned frequencies. By showing deeper layers tend to fit the high-frequency components in the target function, we can understand the layer convergence bias from another perspective. We finally took DNN architecture design and transfer learning as two examples to show how theoretical findings in this work can shed light on the practical applications of deep learning. For progress to continue, a more in-depth understanding of the properties of neural networks is needed. We also hope that the layer convergence bias can inspire more practical improvements in the DNNs' architecture design and training schemes.

A APPENDIX

A For the ImageNet classification task with CNNs, we train ResNet-50 and VGG-19 for 120 epochs with SGD optimizers. The initial learning rate is 0.1, with learning rate decays at the 50th and 100th epoch to 0.01 and 0.001, respectively. The batch size is 256, the input image size is 224 2 , and the weight decay coefficient is 10 -4 . For Vision Transformers on ImageNet dataset, we train them for 200 epochs with Adam optimizers. The peak learning rate is set to 0.0003. We use linear learning rate warm-up for 10,000 iterations, and a subsequent cosine learning rate decay. The batch size is 256, the input image size is 224 2 , and the weight decay coefficient is 10 -4 . For CNN image classification on other datasets, we train models for 100 epochs with SGD optimizers. Initial learning rate of 0.01 and cosine learning rate scheduler are applied. The batch size is 128, the input image sizes are 32 2 (for CIFAR) and 224 2 (for Flowers, Aircraft, Caltech, CUB, and DomainNet), and the weight decay coefficient is 10 -4 . Published as a conference paper at ICLR 2023

A.2 CONVERGENCE MEASUREMENT USING WEIGHT VARIATION

In Section 2, we have introduced the convergence measurement in this work. This measurement is simple and straightforward, and it can show how each layer in a DNN converges during the whole training process (Fig. 1 for fully connected networks and Fig. 2 for CNNs) by examining the distance between the training parameters and the converged parameters. However, it has not been verified whether calculating the parameter distance variation to the convergence point between two adjacent epochs is necessary. After all, the measurement highly depend on the convergence point, which can only be obtained after the whole training process. We come up with a simplified convergence measurement. This method uses weight variation as a metric to examine how fast a layer is learning, and whether this layer reaches a state of convergence. If a layer is learning actively, it is reasonable that its weights variate drastically during training. For the converged layers, their weights usually keep stable. So we use ∥θ (t k ) l -θ (t k+1 ) l ∥ 2 /∥θ (t k ) l ∥ 2 , the normalized weight variation of layer l during epoch k and k + 1, to illustrate how actively it is learning. 11 . From this plot, we can see that after learning rate decays at epoch 50 and 100, the weight variations drop evidently. However, the weight variations of each layer do not show apparent decreasing trend when the learning rate keeps stable, which indicates that the training of DNNs do not converge as usual convex optimization problems do (e.g., linear programming). Therefore, it is hard for us to compare the convergence rates of different layers by observing their convergence curves. We cannot find a clear clue like what was given by the convergence measurement in Section 2 to get the layer convergence bias. All in all, we can safely claim that, it is crucial for the convergence metric to consider direction information to measure how fast different layers are learning towards their convergence points. Our previous convergence measurement really needs to examine convergence by calculating the parameter distance between the current point to convergence point. 𝜃𝜃𝑙𝑙 (𝑡𝑡𝑘𝑘) -𝜃𝜃𝑙𝑙 (𝑡𝑡𝑘𝑘+1) 2 𝜃𝜃𝑙𝑙 (𝑡𝑡𝑘𝑘)

Published as a conference paper at ICLR 2023 A.3 FACTORS AFFECTING LAYER CONVERGENCE BIAS

In Section 5, we have shown that the complexity of the datasets is an important factor affecting layer convergence bias. When the fitting target function is complex enough with both low and high frequency components, the shallower layers learn the low low-frequency components while the deeper layers learn the high-frequency components. Here we use the FCNNs with residual connections to show whether some other important factors would affect the layer convergence bias. All following experiments are conducted on the same regression task in Section 3. Model Depth. The default architecture used in previous experiments is the four-blocks FCNN, here we try adding more blocks to make the network deeper and see what change will happen. As shown in Fig. 12 , all the networks show layer convergence bias. With more and more res-blocks, the overall convergence of the network becomes slightly faster. Learning Rate. The results with different learning rates are shown in Fig. 13 . When the learning rate gets smaller, layer convergence bias becomes weaker. This is because the gradient predictiveness w.r.t. parameters of all layers get close to 1 (see Fig. 3 (b, right) for the predictiveness with the learning rate of 0.01). In this case, a layer is less influenced by the updates of parameters in other layers, only the gradient predictiveness w.r.t. data matters for the convergence rate. In addition, smaller learning rates are beneficial for the deeper layers to converge because of their sharper minima. 14 . We can see that when the weight decay becomes stronger, the residual blocks converge slower in a more and more similar convergence rate. We conjecture the reason is that weight decay dominates the total loss when its coefficient is large. In this way, the layer parameters with similar initialization scales tend to converge in similar speed toward zero. Because the residual blocks have identical architectures, they share the same initial parameter distribution, and converge in the same speed when weight decay is strong. Optimizer. In Section 4, we have discussed the mechanism behind layer convergence bias. The flatter/sharper minimizers of different layers make SGD learn at different speeds. This is because SGD is more good at finding flatter minimizers (Pan et al., 2020) . In Fig. 15 , we compare SGD with three adaptive optimizers: Adagrad, RMSprop, and Adam. It is evident that with adaptive optimizers, layer convergence bias does not hold anymore. We conjecture the reason behind this is that the adaptive optimizers heuristically assign different learning rates for different parameters, making their optimization hardly predictable. Normalization Methods. Like residual connection, batch normalization Ioffe & Szegedy (2015) is also a common design in modern DNN architectures. As discussed in previous literature, normalization in the neural networks helps to make the layer inputs more stable and make the loss landscapes smoother, thus accelerates the model training Santurkar et al. (2018) . In Section 3 and Section 4, we mainly use the FCNNs without normalization to verify and explore the layer convergence bias. Here we investigate how the normalization methods (i.e., batch normalization, layer normalization Ba et al. ( 2016), and group normalization Wu & He (2018) ) help the convergence, and whether the shallower layers still converge faster in these cases. As shown in Fig. 16 , all layers converge faster when adding batch normalization to them. Particularly, "Res-Block 1" accelerates the most and reach a similar convergence rate as "Layer 1". The layer convergence bias also holds for batch normalization. For layer normalization and group normalization, the models show a significantly faster convergence rate than the model using batch normalization. All layers show effective convergence at an early stage of training (i.e., the first five epochs). In these two cases, different layers have similar convergence rates, thus no evident layer convergence bias emerges. For verifying the layer convergence bias on more datasets, we show more convergence results on four harder image classification datasets (see Fig. 17 ). Most of the classes in these datasets only have < 100 samples, making them harder to learn. Note that the experiments are conducted with the learning rate of 0.01 (learning rate of 0.1 failed in some cases because these datasets have too many classes but not sufficient samples, leading to non-decreasing loss), some deeper layers have quite similar convergence rates because of the small learning rate. But roughly speaking, layer convergence bias still holds for these datasets. To examine whether the fast establishment of low-level features benefits model performance, we train four different FCNN models with the same amount of parameters, but different architectures, to fit the Sine target with four components. This experiment is based on a finding that a residual block with more layers in it tends to converge more slowly. We construct four FCNN models, each of them has four residual blocks (maybe in different sizes). The convergence processes are shown in Fig. 20 . We can see that the blocks with the largest complexity always converge the most slowly. As the block with depth=4 being placed shallower in the FCNN, the regression MSE loss goes higher. In other word, if a shallower layer converge slowly, the model gets poorer performance. This may due to the vulnerability of deeper layers. If they converge based on changing shallower layers, it is hard for them to learn good features based on their unstable inputs. The results can also be understood from another perspective. If the deeper block contains more parameters (with more fully connected layers in it), it would be helpful for this block to learn the corresponding high-frequency components of the target function. Therefore, the model can reach better performance. A similar observation is obtained in Section 6.1: when putting wider layers of the ViT deeper, the model can reach higher performance. Published as a conference paper at ICLR 2023

A.7 LAYER CONVERGENCE BIAS FOR VISION TRANSFORMERS

As discussed in Section 6, ViT can benefit from distributing more parameters in the deeper layers. This result comes from one of our main findings about layer convergence bias: the deeper layers tend to learn high-frequency components of the target function, thus converge more slowly. So adding more parameters for the deeper layers is beneficial for these layers to learn the high-frequency components which are usually harder. When making this claim, we do not verify the layer convergence bias for the ViT. The main difficulty for verifying layer convergence bias for ViTs is brought by its typical training scheme. A.8 CONNECTION TO LARS OPTIMIZATION SCHEME One of the most important factors that affect the optimization procedure is the learning rate. In this work, it is shown that the shallower layers can learn effectively with large learning rates, but the deeper layers only learn fast after learning rate decays. Is there any connection between layers and its suitable learning rate? LARS optimizer You et al. (2017) made a significant contribution to training DNNs with huge batch sizes and large learning rates. The key observation in the literature is that the weight-to-gradient ratio highly varies in different layers. If a layer has greater gradients and relatively smaller weights, it would be hard for it to converge due to the vigorous parameter update. So LARS considers the scale of the weights and its gradient norms in each layer and assigns a local learning rate for a layer to make it converge effectively and stably. For FCNNs in our work, its different hidden layers are initialized with the same scale due to their identical architecture, but the deeper layers usually have larger gradients. As a result, the larger gradients may make these layers struggle to converge. Similarly, the CNNs (i.e., have wider deeper layers. These layers have smaller initial parameters, so their gradients may lead to drastic weight variations if the learning rate is too large. In this way, we can understand why they cannot get close to their optimal points effectively at the early stage of training. It explains layer convergence bias from another perspective.



FCNN without residual connection (b) FCNN with four residual blocks

Figure 1: Left (a,b): The absolute mean gradient values for different layers for FCNNs w/o residual connections in training. For both networks, deeper layers have larger gradients. Right (a,b): The convergence process of different layers for FCNNs. Shallower layers converge faster.

Figure 2: The convergence process of ResNet-50 and VGG-19 on ImageNet. During the first 50 epochs, shallower layers converge much faster than deeper layers. After the learning rate decays at the 50th epoch, parameters of deeper layers accelerate to move to their convergence points.

Figure 3: The gradient predictiveness of shallower and deeper layers of FCNN. The learning rate decreases from 0.1 to 0.01 at Epoch 150.

Figure 4: The loss landscapes of different layers of FCNN. Deeper layers are optimized on more rugged landscapes, slowing down the learning process.

Figure 5: The loss landscapes of different layers of ResNet-50 (a,b) and VGG-19 (c,d) on ImageNet.The shallower layers for both networks have flatter minima, making them converge faster than the deeper layers. The plots for all layers can be found in Appendix A.5.

Figure 6: The visualization of what each residual block of the FCNN learns. From the first to the fourth block, the fitted function becomes more complex with smaller amplitude.

Figure 8: The convergence curves with different learning target complexities. (a-d): Decreasing target complexities for FCNNs. The deeper layers accelerate more than the shallower ones when high-frequency components are removed. (e-h): For CNNs, the deepest layers (i.e., Stage 4 / Layer 17) learn faster on CIFAR10 than on CIFAR100 while the other layers do not change much.

Figure 9: Performance of three variants of ViTs on Im-ageNet.

Figure 10: Effects of transfer learning on the training process. Left (a,b): The layer convergence process of ResNet-50. Right (a,b): The loss landscapes of Stage 4 w/o transfer learning.

(b) VGG-19 Val Acc 71.89% (a) ResNet-50 Val Acc 73.24%

Figure 11: The convergence processes of ResNet-50 and VGG-19 on ImageNet. The results are illustrated with weight variations. The learning rate decays at epoch 50 and epoch 100.

Figure 12: The convergence process of FCNNs with different number of res-blocks.

Figure 13: The convergence process of FCNNs with different learning rates.

Figure 14: The convergence process of FCNNs with different weight decay strengths.

Figure16: The convergence process of FCNNs with different normalization methods. When using group normalization, we set the group number to 8.

Figure 17: The convergence process of CNNs on four image classification tasks.

Figure20: The convergence process of FCNNs with different residual block sizes and their validation performance on the regression task. Each model has a four-layer residual block and three one-layer residual blocks (e.g., "Res-Blocks=(4,1,1,1)" means the first residual block has four layers, and the rest three blocks have only one layer).

Figure 21: The convergence curves of ViTs on ImageNet with different optimizers. With Adam optimizer, the ViT does not obey the layer convergence bias strictly. While SGD can ensure relatively ideal faster convergence processes of shallower layers.

). Besides the curvature, the fitted functions Published as a conference paper at ICLR 2023

.1 EXPERIMENTAL SETTINGS Datasets. The synthetic and real datasets are summarized in the Tab. 1 Descriptions and statistics of the datasets used in this work.

Complexities and architectures of DNNs used in this work.

A.6 MODELS OBEYING LAYER CONVERGENCE BIAS PERFORM BETTERIn Section 4 and Section 5, it is discussed that layer convergence bias indicates that the shallower layers are learning low-level features (or low-frequency components of the target function). It is reasonable learning low-level features first have greater potential to reach good model performance, since the model can establish its high-level features based on relatively stable low-level feature spaces.

ACKNOWLEDGMENTS

This work was supported by the Lustgarten Foundation for Pancreatic Cancer Research and the McGovern Foundation.

annex

Published as a conference paper at ICLR 2023

