TEMPORARY FEATURE COLLAPSE PHENOMENON IN EARLY LEARNING OF MLPS

Abstract

In this paper, we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs). We discover and explain the reason for the feature collapse phenomenon in the first phase, i.e., the diversity of features over different samples keeps decreasing in the first phase, until samples of different categories share almost the same feature, which hurts the optimization of MLPs. We find that such phenomena usually occur when MLPs are difficult to be trained. We explain such a phenomenon in terms of the learning dynamics of MLPs. Furthermore, we theoretically analyze the reason why four typical operations can alleviate the feature collapse. The code has been attached with the submission.

1. INTRODUCTION

It has been widely observed that in initialized neural networks, especially when the network is deep, the loss decrease is likely to have two phases during early epochs of learning, e.g., phenomena observed in (Saxe et al., 2013; Simsekli et al., 2019; Stevens et al., 2020) . As Figure 1 (a) shows, the first phase is usually relatively short, in which the training loss does not decrease or decreases very slowly. Then, in the second phase, the training loss suddenly begins to decrease fast. In particular, as Figure 1 (b) shows, the length of the first phase increases along with the network complexity. In some extreme cases when deep neural networks (DNNs) are very deep, the loss minimization gets stuck, which can be considered as a strong first phase with an infinite length, namely, a learning-sticking problem. In fact, the learning-sticking problem is quite common in practice. Jepkoech et al. (2021) and Stevens et al. (2020) empirically observed the learning-sticking problem without any theoretical analysis. People usually owed the learning-sticking problem to the over-parameterized settings of DNNs or the optimization ability of DNNs. However, we discover and attempt to further theoretically explain a new, quite common, yet counterintuitive phenomenon in the first phase (the learning-sticking phase). That is, as Figure 1 (b) shows, features of different categories become increasingly similar to each other. In some cases, the feature diversity keeps decreasing even until all samples of different categories share almost the same feature in the first phase. We can consider this as the temporary feature collapse (TFC). This TFC happens in various DNNs, including multi-layer perceptrons (MLPs), convolutional neural networks, and recurrent neural networks (see both Figure 2 and Appendix B ). DNNs trained with different loss functions and different learning rates may all exhibit TFC phenomena. The TFC phenomenon usually happens in the early epochs of the training process, especially when the DNN is difficult to optimize. According to our analysis, when the DNN is very deep, when the task is difficult, when the variance of initial weights is small, and when the DNN is trained without momentum or batch normalization layers, the DNN is more likely to exhibit the TFC phenomenon. Based on our theoretical analysis, we discover a set of conditions that strengthen the TFC phenomenon. Then, we can easily control such conditions by applying typical operations i.e., batch normalization, momentum, L 2 regularization, and network initialization. Specifically, we investigate the learning dynamics of the MLP. Moreover, we theoretically explain that these conditions make the training of DNNs more likely to perform like a "self-enhanced system" towards the TFC phenomenon in early iterations. In comparison, (Glorot & Bengio, 2010; Saxe et al., 2013) investigated the influence of initialization methods on the learning-sticking problem. To this end, we find that the effectiveness of initialization methods is probably owing to the large variance of initial weights, which avoids the TFC phenomenon during the learning-sticking phase. Samples of different categories share almost the same features at the end of the first phase. We can consider this as a TFC phenomenon. We visualize the learning dynamics of an intermediate-layer feature in a 9-layer MLP. We select 10 salient dimensions to illustrate the feature similarity. Fortunately, we discover that, when we use four typical operations to alleviate the TFC phenomenon, the learning-sticking problem can also be solved. Although previous studies have provided insightful analysis for these well-known operations, e.g., batch normalization and network initialization, we are the first to establish the relationship between the TFC phenomenon and these typical operations. This provides theoretical guidance for the design of DNNs. More crucially, the TFC phenomenon with the MLP is counter-intuitive, and has been neglected for a long time. The investigation of the learning dynamics of the TFC phenomenon would be useful for explaining complex optimization behaviors of DNNs and is of considerable value. Contributions of this study can be summarized as follows. (1) We discover the common TFC phenomenon in early learning of the MLP, which has been ignored for a long time. (2) We explain this phenomenon from the perspective of learning dynamics. (3) We explain why four types of operations can alleviate the TFC phenomenon.

2. DISCOVERING THE TFC PHENOMENON

It has been widely observed that the loss decrease of DNNs is likely to have two phases (Saxe et al., 2013; Simsekli et al., 2019; Stevens et al., 2020) . As Figure 1 (b) shows, the training loss does not decrease significantly in the first phase, and the training loss suddenly begins to decrease in the second phase. In this paper, we discover a new and counter-intuitive phenomenon in the first phase that both the diversity of intermediate-layer features over different samples and the diversity of feature gradients keep decreasing, until samples of different categories share almost the same feature in the first phase. We consider this as a TFC phenomenon. We consider an MLP f with L concatenated linear layers, each being followed by a ReLU layer. Only the last linear layer is followed by a softmax operation. Let W (l) t ∈ R h×d denote the weight matrix of the l-th linear layer with h neurons (1 ⩽ l ⩽ L), and W (l) t has been learned for t iterations. Given an input sample x, the layer-wise forward propagation in the l-th layer is represented as F (l) t = ReLU(W (l) t F (l-1) t ) = D (l) t W (l) t F (l-1) t , where F (l) t ∈ R h denotes the output feature of the l-th layer after the t-th iteration. D (l) t denotes a diagonal matrix, which represents gating states in the ReLU layer, and D (l) t,(i,i) ∈ {0, 1}. Thus, the TFC phenomenon is shown as follows. Given two input samples x1 and x2, the cosine similarity of features cos(F (l) t |x 1 , F (l) t |x 2 ), and the cosine similarity of gradients cos( Ḟ (l) t |x 1 , Ḟ (l) t |x 2 ) keep increasing, which demonstrates the phenomenon. Here, Ḟ (l) t denotes the gradient of the loss w.r.t. the feature F (l) t . Besides, the increasing trend of feature similarity only exists in the first phase. The TFC phenomenon is widely shared by different DNNs learned for different tasks. In early epochs (or iterations) of the training process, we observed such TFC phenomena on MLPs, VGG-11 The common direction of weight changes can be decomposed from two perspectives. Such enhancement explains the decrease of feature diversity Perspective 1 Perspective 2 These two perspectives enhance each other. The logic of explaining the TFC phenomenon. (Simonyan & Zisserman, 2014) , and the revised long short-term memory (LSTM) on different types of data, including image data (MNIST (LeCun et al., 1998) , CIFAR-10 (Krizhevsky et al., 2009) , and the Tiny ImageNet dataset (Le & Yang, 2015) ), tabular data (two UCI datasets of census income and TV news (Asuncion & Newman, 2007) ), and natural language data (CoLA (Warstadt et al., 2019) , SST-2 (Socher et al., 2013) , and AGNews (Del Corso et al., 2005) ). We also tested MLPs with different loss functions, with Leaky ReLU layers (Maas et al., 2013) , with different learning rates, and with different batch sizes. Figure 2(a, b ) shows TFC phenomena on these DNNs, and please Appendix B for results on more DNNs. Besides, the learning-sticking problem can be considered as an extreme long first phase. As Figure 1 (a) shows, the length of the first phase increases along with the network complexity (depth). In extreme cases when DNNs are very deep or the task is difficult, the first phase reaches an infinite length, and the learning gets stuck (please see Appendix C for more discussions).

3. EXPLAINING THE DYNAMICS OF THE TFC PHENOMENON

In this section, we aim to investigate dynamics of network parameters in early epochs, so as to explain the condition that may boost the likelihood of the TFC phenomenon. In Section 3.1, we find that the decreasing diversity of feature gradients over different samples is owing to the phenomenon that different neurons in a layer are optimized towards a common direction in the first phase. Therefore, we propose two perspectives to illustrate the significance of the common direction. Then, in Section 3.2, we compare these two perspectives to analyze learning dynamics, and we find that the significance of the common direction may be enhanced, just like a "self-enhanced system." Finally, the self-enhanced common direction can explain the TFC phenomenon. The overall logic of the explanation is illustrated in Figure 3 (b). In Section 3.3, we explain the reason why four types of operations can alleviate the TFC phenomenon based on our analysis.

3.1. TWO PERSPECTIVES TO ANALYZE THE COMMON DIRECTION OF LEARNING EFFECTS

In the beginning, let us first focus on the conjecture that the decreasing diversity of feature gradients over different samples can be explained by the phenomenon that different neurons in a layer are optimized towards a common direction in the first phase. For example, as Figure 3 (a) shows, at the beginning of the learning, different neurons are originally optimized towards different directions, but then gradients of different neurons gradually change to a similar direction. Let Ḟ (l) t denote the gradient of the loss w.r.t. the feature F (l) t at the l-th layer. Then, according to Eq. ( 1), the back propagation of feature gradients Ḟ (l) t ∈ R h at the l-th layer can be written as Ḟ (l-1) t = W (l) ⊤ t D (l) t Ḟ (l) t . The emergence of a common direction of weight changes means that gradients of the d weight vectors in W (l) ⊤ t = [w (l) t,1 , w (l) t,2 , • • • , w (l) t,d ] ⊤ ∈ R d×h , i.e. , ∂Loss/∂w (l) t,i , gradually become approximately collinear. According to Remark 1, we can explain why the enhancement of such a common direction decreases the diversity of feature gradients.

Remark 1. Let us assume that different weight vectors

[w (l) t,1 , w (l) t,2 , • • • , w (l) t,d ] ⊤ have a dominating common direction C (l) ∈ R h . Then, we can represent w (l) t,i = βiC (l) + ϵi, where βi ∈ R; ϵi ∈ R h denotes a small residual; β = [β1, β2, • • • , β d ], and ϵ = [ϵ1, ϵ2, • • • , ϵ d ] ⊤ . Then, we have Ḟ (l-1) t = (C (l) ⊤ D (l) t Ḟ (l) t ) • β + ϵD (l) t Ḟ (l) t . Remark 1 well explains the rationale for the above conjecture. That is, if ∂Loss/∂w (l) t,i on different samples are roughly collinear to each other, then such a collinearity would make feature gradients Ḟ (l-1) t of different samples similar to each other. Specifically, during the learning process, if the DNN keeps optimizing W (l) ⊤ t along the common direction C (l) for a long time, which keeps strengthening the value C (l) ⊤ D (l) t Ḟ (l) t ∈ R, then feature gradients Ḟ (l-1) t of different samples are gradually pushed towards the same direction β. In other words, as long as different weight vectors are optimized towards the same dominating direction, then feature gradients Ḟ (l-1) t are pushed in the same direction β. Therefore, the first core task of proving the decreasing diversity of feature gradients is to explain the existence of the common optimization direction shared by different weight vectors. Thus, we propose two perspectives to illustrate how different weight vectors w (l) t,i are changed ∆Wt along a common direction during the learning process. By comparing these two perspectives, we can further explain the reason why the significance of the common direction will be further boosted, just like a "self-enhanced system." Such a "self-enhanced system" will be proven in Section 3.2. Perspective 1. This perspective focuses on the influence of the common direction C (l) of the weight change in l-th layer. For clarity, we omit the superscript (l) to simplify the notation in the following paragraphs in Section 3.1, i.e., ∆w (l) t,i , ∆W (l) t , and C (l) can be simplified by ∆wt,i, ∆Wt, and C, respectively. Let ∆W ⊤ t = [∆wt,1, ∆wt,2, • • • , ∆w t,d ] ⊤ denote weight changes of d weight vectors in the l-th layer. We decompose ∆W ⊤ t into the component along a common direction C and a component along other directions as follows. ∆W ⊤ t = ∆VtC ⊤ + ∆εt, where ∆Vt = [∆vt,1, ∆vt,2, • • • , ∆v t,d ] ∈ R d denotes the coefficient vector for weight changes of different weight vectors along the common direction C. Specifically, ∆εt is relatively small "noise" term, which is orthogonal to  = ∆W ⊤ t C C ⊤ C and ∆εt= ∆W ⊤ t -∆W ⊤ t CC ⊤ C ⊤ C , s.t. ∆εtC = 0. Such settings minimize ∥∆εt∥F . Lemma 2. (We can also decompose the weight W (l) t into the component along the common direction C and the component εt in other directions. Proof is in Appendix F.) Given the weight W ⊤ t and the common direction C, the decomposition W ⊤ t = VtC ⊤ + εt can be conducted as Vt = W ⊤ t C C ⊤ C and εt= W ⊤ t -W ⊤ t CC ⊤ C ⊤ C s.t. εtC = 0. Such settings minimize ∥εt∥F . We conduct experiments to verify the strength of the primary common direction C. To this end, let us focus on the average weight change over different samples ∆W t = Ex∈X ∆Wt|x. Then, we 1e-3 1e-3 1e-3 1.9 1e-3 1e-3 2.1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Table 1 : Strength of components of weight changes along the common direction and other directions. We trained 9-layer MLPs on the CIFAR-10 dataset and the Tiny ImageNet dataset, respectively. Each layer of the MLP had 512 neurons. The strength of the primary common direction was much greater than those of other directions. Appendix D provides results on the MNIST dataset and Appendix H.2 explains the phenomenon that S (l) 1 , S (l) 2 , and S (l) 3 do not decrease monotonically. decompose ∆W t into components along five common directions as ∆W t = C1∆V ⊤ 1,t + C2∆V ⊤ 2,t + • • • + C5∆V ⊤ 5,t + ∆ε ⊤ 5,t , where C1=C is termed the primary common direction. C2, C3, C4 and C5 represent the second, third, forth, and fifth common directions, respectively. C1, C2, C3, C4, and C5 are orthogonal to each other. Ci and ∆V i,t are computed based on Lemma 1 when we remove the first (i -1) components along the direction C, • • • , Ci-1 from the ∆W t. Figure 4 shows that the strength of the primary common component C1∆V ⊤ 1 is approximately ten times greater than the strength of the secondary common component C2∆V ⊤ 2 . Please see Appendix G for more discussions. Perspective 2 based on feature gradients Ḟ (l+1) t . We decompose the weight change by considering the influence of the common direction of the upper layer C (l+1) . In order to distinguish variables belonging to different layers, we add the superscript (l) back to ∆W (l) t , ∆V (l) t , and ∆ε (l) t to denote the layer in the following paragraphs. Theorem 1. (Proof in Appendix H.1) The weight change made by a sample can be decomposed into (h + 1) terms after the t-th iteration as follows. ∆W (l) t = ∆W (l) primary,t + h k=1 ∆W (l,k) noise,t = Γ (l) t F (l-1) ⊤ t + κ (l) ⊤ t , where ∆W (l) primary,t = D (l) t V (l+1) t C (l+1) ⊤ C (l+1) ∆V (l+1) ⊤ t F (l) t F (l-1) ⊤ t /∥F (l) t ∥ 2 2 denotes the component along the primary common direction, and ∆W (l,k)  noise,t = D (l) t ε (l+1,k) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t /∥F (l) t ∥ 2 2 de- notes the component along the k-th common direction in the noise term. ε (l+1,k) t = Σ kk U k V ⊤ k , where the SVD of ε (l+1) t ∈ R h×h ′ is given as ε (l+1) t = UΣV ⊤ (h ≤ h ′ ), and Σ kk denotes the k-th singular value ∈ R. ε (l+1) t = k ε (l+1,k) t . U k and V k denote the k-th column of the matrix U and V, respectively. Besides, we have ∀k ∈ {1, 2, . . . , h}, U ⊤ k C (l+1) = 0. Consequently, we have Γ (l) t = D (l) t V (l+1) t C (l+1) ⊤ C (l+1) ∆V (l+1) ⊤ t F (l) t /∥F (l) t ∥ 2 2 ∈ R h , and κ (l) ⊤ t = D (l) t ε (l+1) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t /∥F (l) t ∥ 2 2 ∈ R h×d . Given weight changes ∆W (l) t made by a sample x, the primary term ∆W (l) primary,t represents the component of weight changes along the common direction C (l+1) . The k-th noise term ∆W (l,k) noise,t represents the component along the k-th direction U k , which is orthogonal to C (l+1) . We conduct experiments to verify the significant strength of the component along the common direction C (l+1) . We compute the average strength of the component along C (l+1) over all samples in X as S (l) primary = E t∈[Tstart,T end ] Ex∈X[∥∆W (l) primary,t |x∥F ]. Similarly, the strength of the component along the k-th noise direction is computed as 1 illustrates that the strength of the primary component S (l) primary is more than ten times greater than the strength of components along noise directions S (l) 1 , S (l) 2 , and S (l) 3 . Discussion about comparing with the sum of all other directions' significance. According to Table 1 , it seems that the sum of strengths of components along other directions is also large. However, different directions decomposed by the above method are orthogonal to each other. Therefore, weight changes along different directions are independent, and their strengths cannot be summed up. Thus, we can directly compare the strength of the component of weight changes along each direction to verify the significant strength of the primary direction. S (l) k = E t∈[Tstart,T end ] Ex∈X [∥∆W (l,k) noise,t |x∥F ]. Table

3.2. EXPLAINING THE ENHANCEMENT OF THE SIGNIFICANCE OF THE COMMON DIRECTION

The previous subsection owes the decreasing diversity of feature gradients to the typical common optimization direction shared by different weight vectors. In the current subsection, we explain that the common optimization direction phenomenon is very likely to be further enhanced, just like a "self-enhanced system." The self-enhancement of the optimization direction will explain the decreasing diversity. Specifically, the overall logic of this subsection has three steps. In Step 1, we explain the phenomenon that the significance of the common direction can be enhanced by training samples in a certain category in very early epochs. In Step 2, we extend the analysis of the enhancement of the common direction to a more generic case, i.e., explaining the enhancement caused by training samples from different categories. In Step 3, we further explain that the selfenhancement of the common direction decreases the diversity of features and feature gradients, i.e., explaining the TFC phenomenon. Before explaining the enhancement of the significance of the common direction, let us first clarify assumptions in the proof. (1) The direct proof of the emergence of a "self-enhanced system" from the very beginning of training an initialized MLP is difficult. Instead, we explain that the self-enhancement of the common direction probably started under the background assumption that features of different samples have been pushed a little bit towards a specific common direction. (2) The MLP usually first learns a few categories, instead of simultaneously learning all categories. Experimental results in Figure 7 and Appendix O have verified the trustworthiness of this assumption. According to Eq. (4) and Eq. ( 5), weight changes made by the sample x can be given as Perspective 1: ∆W (l) t = C (l) ∆V (l) ⊤ t + ∆ε (l) ⊤ t Perspective 2: ∆W (l) t = Γ (l) t F (l-1) ⊤ t + κ (l) ⊤ t (6) By comparing the above two perspectives, we discover an interesting potential that the common direction C (l) is similar to ±Γ (l) t , and the feature F (l-1) t is similar to the vector ±∆V (l) t . Inspired by this, we aim to prove the self-enhancement of the significance of the common direction, by explaining the intuition that the feature F (l-1) t and the vector V (l) t become more and more similar to each other in the first phase. As the first step of the proof, Theorem 2 shows that if we only consider training samples x ∈ Xc in the same category c, then features F (l-1) t of samples in this category would become increasingly similar to each other. On the other hand, such training samples have similar training effects, i.e., pushing weights of different neurons V (l) t all towards the average feature αcEx∈X c [F (l-1) t |x]. Step 1: Explaining the significance of the common direction is enhanced by all training samples in a certain category. Specifically, let us first consider the aforementioned background assumption that features F (l-1) t of different samples have been pushed a little bit towards a specific common direction. We can obtain that there exists at least one learning iteration in the first phase, in which ∆F (l-1) t and F (l-1) t of most samples have similar directions, and ∆V (l) t and V (l) t have similar directions (see Appendix J for more discussions). Note that the assumed initial common direction is quite vague, and it is far from the TFC phenomenon. However, if we take the vague common direction as the starting point, we can further prove the significant self-enhancement of the common direction, which is responsible for the TFC phenomenon. Thus, Theorem 2 explains how the significance of the common direction is enhanced by all training samples in the category c, i.e., F (l-1) t and αcV (l) t become increasingly similar. We can consider cos(αcV 𝛼𝛼 = 1 𝛼𝛼 = -1 𝛼𝛼 = 1 𝛼𝛼 = -1 𝛼𝛼 = -1 𝛼𝛼 = -1 𝛼𝛼 = -1 𝛼𝛼 = -1 𝛼𝛼 = -1 𝛼𝛼 = - (l) t , ∆F (l-1) t |x) ≥ 0 in Theorem 2 means that features of training samples in the same category c are all pushed towards a common direction αcV (l) t , and make ∆F (l-1) t |x highly similar to αcV (l) t , i.e., making sample features F (l-1) t |x in the category c become increasingly similar to each other. On the other hand, cos(αc∆V (l) t |x, F (l-1) t |x) ≥ 0 in Theorem 2 means that training samples in the category c all push V (l) t towards αcEx∈X c [F (l-1) t |x], and make ∆V (l) t roughly parallel to αcEx∈X c [F (l-1) t |x], i.e., pushing weights of different neurons V (l) t towards the average feature. This phenomenon is verified in Figure 5 , where cos(αc∆V (l) t , F (l-1) t ) is always positive over different samples of the same category. The above analysis also well explains the dynamics behind cos(∆V (l) t , F (l-1) t ) • cos(V (l) t , ∆F (l-1) t ) ≥ 0 in Lemma 3. Lemma 3. (Proof in Appendix K) Given an input sample x ∈ X and a common direction C (l) after the t-th iteration, if the noise term ε (l) t is small enough to satisfy |∆V (l) ⊤ t F (l-1) t V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t F (l-1) t | ≫ |∆V (l) ⊤ t F (l-1) t V (l) ⊤ t ε (l) t ∆ε (l) ⊤ t F (l-1) t |, we can obtain cos(∆V (l) t , F (l-1) t )•cos(V (l) t , ∆F (l-1) t ) ≥ 0, where ∆V (l) t = ∆W (l) ⊤ t C (l) C (l) ⊤ C (l) , and V (l) t = W (l) ⊤ t C (l) C (l) ⊤ C (l) . ∆F (l-1) t denotes the change of features ∆F (l-1) t = F (l-1) t+1 -F (l-1) t made by the training sample x after the t-th iteration. To this end, we approximately consider the change of features ∆F (l-1) t after the t-th iteration negatively parallel to feature gradients Ḟ (l-1) t , although strictly speaking, the change of features is not exactly equal to the feature gradients. Theorem 2. (Proof in Appendix L) Under the aforementioned background assumption, for any training samples x, x ′ ∈ X c in the category c, if [C (l) ⊤ D (l) t |x Ḟ (l) t |x]•[C (l) ⊤ D (l) t | x ′ Ḟ (l) t | x ′ ] > 0 (i.e., F (l) t |x and F (l) t | x ′ have kinds of similarity in very early iterations), then cos(αc∆V (l) t |x, F (l-1) t |x) ≥ 0, and cos(αcV (l) t , ∆F (l-1) t |x) ≥ 0, where αc ∈{-1,+1} is a constant shared by all samples in category c. We conduct experiments to verify the relationship between the feature F (l-1) t and the vector V (l) t . To this end, we measure the change of the value o (l) =cos(∆V (l) t , F (l-1) t )•cos(V (l) t , ∆F (l-1) t ). Figure 6 reports the average o (l) value over different samples at each iteration. For each sample x, o (l) is always positive and usually increases over iterations, which verifies Lemma 3. Besides, the assumption for a tiny ε (l) t in Lemma 3 is verified by experimental results in Appendix K. In sum, Step 1 focuses on the significance of the common direction enhanced by training samples in a certain category. In order to further analyze the learning effect of training samples from different categories, we propose Assumption 1 according to extensive experimental observations. Assumption 1. We assume that the MLP encodes features of very few (a single or two) categories in the first phase, instead of simultaneously learning all or most categories in this phase. Assumption 1 indicates that MLPs first learn a single or two categories in the first phase. Figure 7 verifies that only a single or two categories exhibit much higher accuracies than the random guessing at the end of the first phase. This means that the learning of the MLP is dominated by training samples of a single or two categories in very early iterations. Please see Appendix O for more results on different DNNs. Step 2: Extending the enhancement of the significance of the common direction to a more general case that considers all training samples from different categories, i.e., F (l-1) t and α ĉV (l) t become increasingly similar. The overall learning dynamics in the first phase can be roughly described, by combining Theorem 2 and Assumption 1 as follows. Assumption 1 indicates that MLPs encode features of very few (a single or two) categories in early epochs. In other words, the overall learning effects of all training samples are dominated by very few categories ĉ. Based on this, Theorem 2 indicates two effects. First, features F (l-1) t of different samples are all pushed towards the vector α ĉV (l) t , where α ĉ is determined by the dominating category/categories ĉ. Second, V (l) t is pushed towards α ĉEx∈X ĉ [F (l-1) t |x]. Therefore, features F (l-1) t of different samples and α ĉV (l) t enhance each other, just like a "self-enhanced system." The "self-enhanced system" starts from from the assumed state that ∆F (l-1) t and F (l-1) t of most samples have similar directions, and ∆V (l) t and V (l) t have similar directions. In other words, the component along the common direction C (l) ∆V (l) ⊤ t in ∆W (l) t = C (l) ∆V (l) ⊤ t +∆ε (l) ⊤ t will be further enhanced. Step 3: Explaining the increasing feature similarity and the increasing gradient similarity. i.e., explaining the TFC phenomenon. As aforementioned, features F (l-1) t of different samples are consistently pushed towards the same vector α ĉV (l) t . It increases the similarity between features of different samples E x,x ′ ∈X [cos(F (l-1) t |x, F (l-1) t | x ′ )] in the first phase. On the other hand, the increasing similarity between feature gradients can be also explained from two views. (1) The increasing feature similarity over different samples makes different training samples generate similar gating states D (l) t in each ReLU layer. The increasing similarity of ReLU layers' gating states between different samples also increases the similarity of feature gradients between different samples in the same category E x,x ′ ∈Xc [cos( Ḟ (l-1) t |x, Ḟ (l-1) t | x ′ )]. (2) Another view is that the component along the common direction C (l) V (l) ⊤ t in W (l) t is enhanced in the first phase. Because C (l) denotes the principle weight direction of the i-th column w (l) t,i of W (l) t , each weight vector w (l) t,i is optimized towards the common direction C (l) . Eq. ( 3) shows that the increasing cosine similarity between w (l) t,i and C (l) for all weight vectors will boost the similarity between feature gradients of different samples. Vanishing gradients on correctly classified samples destroy the "self-enhanced system." All our explanation focuses on the early epochs of training, when only a few training samples of one or two dominating categories can be confidently classified. However, when the optimization of a single or two dominating categories in the first phase soon saturates at the end of the first phase, gradients on the correctly classified samples of the dominating categories vanish. Then, gradients from training samples of other categories weaken the dominating role of a single or two categories in the learning of the MLP. Thus, the "self-enhanced system" is destroyed, and the learning of the MLP enters the second phase.

3.3. THEORETICALLY ALLEVIATING THE TFC PHENOMENON

In previous sections, we have discovered and explained a fundamental yet counter-intuitive TFC phenomenon with the MLP. This is the distinctive contribution of this study, which has not been theoretically explained for a long time. Besides, we find that we can use the above findings to explain that four typical operations can usually alleviate or strengthen the TFC phenomenon, i.e., normalization, momentum, initialization, and L2 regularization. Although these operations have been widely used, previous studies failed to theoretically explain their effectiveness. To this end, our analysis can explain a high likelihood for such operations to affect the TFC phenomenon, although it is not a proof of a strict sufficient condition or a necessary condition for the TFC phenomenon. Centering operations for normalization. Based on theoretical analysis, we explain that the centering operation in normalization operations (e.g., that in batch normalization (BN)) can alleviate the TFC phenomenon in the first phase. Specifically, according to Theorem 2, the "self-enhanced system" of decreasing feature diversity requires features F (l) t of any two training samples x and x ′ in the same category to be similar to each other. However, the centering operation prevents features F (l) t of different samples from being similar to each other, because it subtracts the mean feature F (l) t = Ex∈X [F (l) t |x] from features of all samples, i.e., F ′ (l) t |x = F (l) t |x - F (l) t . Therefore, the dissimilarity between features of different samples breaks the "self-enhanced system." Please see Appendix M.1 for more discussions. We conducted experiments to verify the above analysis. We compared MLPs trained with and without BN layers. Specifically, we added a BN layer after each linear layer to construct MLPs. Figure 8 (a) shows that the feature similarity in MLPs with BN layers kept decreasing. This verified that BN layers alleviated the TFC phenomenon. Momentum. Our theorems explain that momentum in gradient descent can alleviate the TFC phenomenon. Based on Lemma 3, the "self-enhanced system" of the decreasing of feature diversity requires weights along other directions ε (l) t to be small enough. However, because the momentum operation strengthens influences of the initialized noisy weights W (l) t=0 , it strengthens singular values of ε (l) t , to some extent, thereby alleviating the TFC phenomenon. Specifically, a larger momentum coefficient usually more alleviates the TFC phenomenon. To this end, we trained MLPs with different momentum coefficients, and experimental results in Appendix M.2 verified the above analysis. Initialization. We explain that the initialization of MLPs affects the TFC phenomenon. According to Lemma 3, the "self-enhanced system" requires very small weights along noise directions ε (l) t . However, increasing the variance of the initialized weights W (l) t=0 can boost singular values of ε (l) t , which alleviates the TFC phenomenon. Please see Appendix M.3 for more discussions. To verify the above claim, we conducted experiments by comparing MLPs trained using different initializations with different variances. We used γ to control the variance of the initialization, i.e., W (l) t=0 ∼ N (0, γσ 2 var I), where σvar is a constant computed following (Glorot & Bengio, 2010) . Figure 8 (b) verifies that the initialization with a large variance alleviated the TFC phenomenon. L 2 regularization (ridge loss). We also explain that the L 2 regularization (the ridge loss) can strengthen the TFC phenomenon. The total loss is given as L(Wt) = L CE (Wt) + λ∥Wt∥ 2 2 , where L CE (Wt) represents the cross entropy loss, and λ∥Wt∥ 2 2 denotes the ridge loss. As aforementioned, the TFC phenomenon requires singular values of ε (l) t to be small enough. However, because the loss of ∥Wt∥ 2 2 penalizes singular values of ε (l) t , it strengthens the TFC phenomenon. The experimental verification is provided in Appendix M.4.

4. CONCLUSION

In this paper, we find that in the early stage of the training process, the MLP exhibits a fundamental yet counter-intuitive TFC phenomenon, i.e., the feature diversity keeps decreasing in the first phase. We explain this phenomenon by analyzing the learning dynamics of the MLP. Furthermore, we explain the reason why four typical operations can alleviate the TFC phenomenon.

ETHIC STATEMENT

we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs). We discover and explain the reason for the feature collapse phenomenon in the first phase, i.e., the diversity of features over different samples keeps decreasing in the first phase, until samples of different categories share almost the same feature, which hurts the optimization of MLPs. There are no ethic issues with this paper.

REPRODUCIBILITY STATEMENT

We have provided the proof for all theoretical results in Appendix E, Appendix F, Appendix H.1, Appendix K, and Appendix L. We have also provided experimental details in Appendix B. The code has been attached with the submission.

A RELATED WORK

Understanding the optimization and the representation capacity of DNNs is an important direction to explain DNNs. The information bottleneck theory (Wolchover, 2017; Shwartz-Ziv & Tishby, 2017) quantitatively explained the information encoded by features in intermediate layers of DNNs. Xu & Raginsky (2017) , Achille & Soatto (2018), and Cheng et al. (2018) used the information bottleneck theory to evaluate and improve the DNN's representation capacity. Arpit et al. (2017) analyzed the representation capacity of DNNs with real training data and noises. In addition, several metrics were proposed to measure the generalization capacity or robustness of DNNs, including the stiffness (Fort et al., 2019) , the sensitivity metrics (Novak et al., 2018) , the Fourier analysis (Xu, 2018) , and the CLEVER score (Weng et al., 2018) . In comparison, we explain the MLP from the perspective of the learning dynamics, i.e., we explain the TFC phenomenon in early iterations of the MLP. Analyzing the learning dynamics is another perspective to understand DNNs. Many studies analyzed the local minima in the optimization landscape of linear networks (Baldi & Hornik, 1989; Saxe et al., 2013; Hardt & Ma, 2016; Daniely et al., 2016) and nonlinear networks (Choromanska et al., 2015; Kawaguchi, 2016; Safran & Shamir, 2018) . Some studies discussed the convergence rate of gradient descent on separable data (Soudry et al., 2018; Xu et al., 2018; Nacson et al., 2019) . Hoffer et al. (2017) and Jastrzębski et al. (2017) have investigated the effects of the batch size and the learning rate on SGD dynamics. In addition, some studies analyzed the dynamics of gradient descent in the overparameterization regime (Arora et al., 2018; Jacot et al., 2018; Lee et al., 2018; Du et al., 2018) . Besides, (Papyan et al., 2020; Han et al., 2021) explored the neural collapse phenomenon, which was observed at the end of the training stage. Unlike previous studies, we analyze the learning dynamics of features and weights of the MLP, in order to explain the TFC phenomenon in the early training process of the MLP.

B COMMON PHENOMENON SHARED BY DIFFERENT DNNS FOR DIFFERENT TASKS.

In this section, we aim to demonstrate an interesting phenomenon of the decrease of the feature diversity when we train an MLP in early iterations. Specifically, the training process of the MLP can usually be divided into the following two phases according to the training loss. In the first phase, the training loss does not decrease significantly, and the training loss suddenly begins to decrease in the second phase. The two-phase phenomenon of the training loss is well-known, because many previous studies (Simsekli et al., 2019; Saxe et al., 2013; Vogl, 2018; Nguyen et al., 2018; Arab et al., 2020; Jepkoech et al., 2021; Stevens et al., 2020) have shown this phenomenon during the training process in their papers. However, previous studies did not theoretically explain the emergence of such a phenomenon. Instead, they usually understood this phenomenon in an intuitive manner, i.e., initialized DNNs failed to find a clear optimization direction, and thus these DNNs usually spent a long time searching for a reliable optimization direction. In this way, the training loss did not decrease significantly in very early epochs of training. More crucially, the feature diversity decreases in the first phase. This phenomenon is widely shared by different DNNs with different architectures for different tasks. As Figure 1 , Figure 2 , and Figure 3 show, the feature diversity keeps decreasing (i.e., the cosine similarity between features of different samples keeps increasing) until samples of different categories share almost the same feature in the first phase. We can consider this as the temporary feature collapse (TFC). This TFC happens in various DNNs, including multi-layer perceptrons (MLPs), convolutional neural networks, and recurrent neural networks. DNNs trained with different loss functions and different learning rates may all exhibit TFC phenomenon. Specifically, we calculated the feature cosine similarity between fifty samples from ten categories on the CIFAR-10 dataset, the MNIST dataset, and the Tiny ImageNet dataset. The abscissa and ordinate of each heatmap represent the sample index. For each grid, color indicates the cosine similarity of that sample pair. Note that all the features are extracted after the ReLU layer. Thus, the cosine similarity is always greater than zero. Besides, as Figure 1 in the main paper shows, samples from different categories share diverse features in the beginning of the training, but share almost the same feature at the end of the training. Specifically, we used t-SNE for visualization (initialized by PCA). Let us take the 9-layer MLP trained on the CIFAR-10 dataset for an example, where each layer of the MLP had 512 neurons. As Figure 4 (e)(f) shows, before the 1300-th iteration (the first phase), both the feature diversity and the gradient diversity kept decreasing, i.e., both the cosine similarity between features over different samples and the cosine similarity between gradients kept increasing. After the 1300-th iteration (the second phase), the feature diversity and the gradient diversity suddenly began to increase, i.e. their similarities began to decrease. Therefore, the MLP had the lowest feature diversity and the lowest gradient diversity at around the 1300-th iteration. Specifically, the training loss was evaluated on the whole training set. In this subsection, we demonstrated that the two-phase phenomenon was shared by different MLPs on the CIFAR-10 dataset (Krizhevsky et al., 2009) . For different MLPs, we adopted the learning rate η = 0.1, the batch size bs = 100, the SGD optimizer, and the ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained on the CIFAR-10 dataset are shown in Figure 4 . 

B.2 ON THE MNIST DATASET

In this subsection, we demonstrated that the two-phase phenomenon was shared by different MLPs on the MNIST dataset (LeCun et al., 1998) . For different MLPs, we adopted the learning rate η = 0.01, the batch size bs = 100, the SGD optimizer, and the ReLU activation function. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained on the MNIST are shown in Figure 5 . 

B.3 ON THE TINY IMAGENET DATASET

In this subsection, we demonstrated that the two-phase phenomenon was shared by different MLPs on the Tiny ImageNet dataset (Le & Yang, 2015) . Specifically, we randomly selected the following 50 categories, orangutan, parking meter, snorkel, American alligator, oboe, basketball, rocking chair, hopper, neck brace, candy store, broom, seashore, sewing machine, sunglasses, panda, pretzel, pig, volleyball, puma, alp, barbershop, ox, flagpole, lifeboat, teapot, walking stick, brain coral, slug, abacus, comic book, CD player, school bus, banister, bathtub, German shepherd, black stork, computer keyboard, tarantula, sock, Arabian camel, bee, cockroach, cannon, tractor, cardigan, suspension bridge, beer bottle, viaduct, guacamole, and iPod for training. For different MLPs, we adopted the learning rate η = 0.1, the batch size bs = 100, the SGD optimizer, and the ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. Note that we took a random cropping with 32×32 sizes.The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained on the Tiny ImageNet are shown in Figure 6 .

B.4 ON THE CENSUS DATASET

In this subsection, we demonstrated that the two-phase phenomenon was shared by different MLPs on the UCI census income tabular dataset (Census) (Asuncion & Newman, 2007) . For different MLPs, we adopted the learning rate η = 0.1, the batch size bs = 1000, the SGD optimizer, and the ReLU activation function. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained on the census are shown in Figure 7 .

B.5 ON THE COMMERCIAL DATASET

In this subsection, we demonstrated that the two-phase phenomenon was shared by different MLPs on the UCI TV news channel commercial detection dataset (Commercial) (Asuncion & Newman, 2007) . For different MLPs, we adopted the learning rate η = 0.1, the batch size bs = 1000, the In this subsection, we demonstrated that the two-phase phenomenon was shared by the revised LSTMs on the CoLA dataset (Warstadt et al., 2019) . We used two-layer unidirectional LSTMs concatenated with MLPs. Specifically, we trained two LSTMs with 5-layer MLPs, where each layer of the MLP had 256 and 512 neurons. We adopted the learning rate η = 0. In this subsection, we demonstrated that the two-phase phenomenon was shared by the revised LSTMs on the SST-2 dataset (Socher et al., 2013) . We used unidirectional LSTMs concatenated with MLPs. Specifically, we trained three LSTMs with 4-layer MLPs, 4-layer MLPs, and 5-layer MLPs, respectively, where each layer of the MLP had 32, 64, 128 neurons. We adopted the learning rate η = 0.1, the batch size bs = 500, the SGD optimizer, and the ReLU activation function. Since the training of LSTMs on the SST-2 with the SGD optimizer is unstable, we randomly selected 15000 training samples from the training set. We trained LSTMs on these 15000 training samples. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of LSTMs trained on the SST-2 are shown in Figure 10 .

B.8 ON THE AGNEWS DATASET

In this subsection, we demonstrated that the two-phase phenomenon was shared by the revised LSTMs on the AGNEWS dataset. We used two-layer unidirectional LSTMs concatenated with MLPs. Specifically, we trained three LSTMs with 4-layer MLPs, 4-layer MLPs, 5-layer MLPs, respectively, where each layer of the MLP had 32, 64, and 128 neurons, respectively. We adopted the learning rate η = 0.1, the batch size bs = 500, the SGD optimizer, and the ReLU activation function. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of LSTMs trained on the AGNEWS are shown in Figure 11 .

B.9 DIFFERENT TRAINING BATCH SIZES

In this subsection, we demonstrated that the two-phase phenomenon was shared by MLPs trained on the CIFAR-10 dataset with different training batch sizes. For different MLPs, we adopted the learning rate η = 0.1, the SGD optimizer, and the ReLU activation function. Besides, we used 

B.10 DIFFERENT LEARNING RATES

In this subsection, we demonstrated that the two-phase phenomenon was shared by MLPs trained on the CIFAR-10 dataset with different learning rates. For different MLPs, we adopted the batch size bs = 100, the SGD optimizer, and the ReLU activation function. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained two 7-layer MLPs with 256 neurons in each layer, with learning rates η = 0.1, 0.01 respectively. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained with different learning rates are shown in Figure 13 . 

B.11 DIFFERENT ACTIVATION FUNCTIONS

In this subsection, we demonstrated that the two-phase phenomenon was shared by MLPs with different activation functions. For different MLPs, we adopted the learning rate η = 0.1, the batch size bs = 100, and the SGD optimizer. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained three 9-layer MLPs with 512 neurons in each layer with the ReLU activation function, the Leaky ReLU (slope=0.1) activation function, and the Leaky ReLU (slope=0.01) activation function, respectively. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained with different activation functions are shown in Figure 14 . 

B.14 THE TFC PHENOMENON WITH THE FOCAL LOSS

In this subsection, we demonstrated that the two-phase phenomenon was shared by MLPs learned on the CIFAR-10 dataset with the focal loss. Specifically, for different MLPs, we adopted the learning rate η = 0.1, the batch size bs = 100, and the SGD optimizer. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained 9-layer MLPs and 7-layer MLPs with 512 neurons in each layer with the ReLU activation function. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained with different focusing parameters γ are shown in Figure 27 and Figure 28 . In this subsection, we demonstrated that the two-phase phenomenon was shared by MLPs trained on the CIFAR-10 dataset with different train/test splits. There are 50000 samples in the training set and 10000 samples in the testing set on the CIFAR-10 dataset. We combined the training set and the testing set into one dataset and split it with the train/test split ratios of 5:1, 4:2, and 3:3, respectively. Note that the ratio of 5:1 was the official ratio for the CIFAR-10 dataset. For different MLPs, we adopted the learning rate η = 0.1, the batch size bs = 100, and the SGD optimizer. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping. We trained 9-layer MLPs with 512 neurons in each layer with the ReLU activation function on these three different datasets. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained on different train/test split ratios are shown in Figure 29 . Figure 29 shows that the TFC phenomenon was still observed by different train/test split ratios. In this section, we aim to discuss the learning-sticking problem in the learning of MLPs. In fact, this problem appears in various DNNs, including MLPs, CNNs, and RNNs, when the task is difficult enough. Explaining and solving the occasional sticking of the training of DNNs are of significant values on different tasks. We consider the learning-sticking problem as the first phase with an infinite length. Moreover, we theoretically explain mechanisms of several heuristic solutions to the learning-sticking problem. To this end, the learning-sticking problem can be solved based on our study, as shown in Figure 30 , Figure 31 , Figure 32 , Figure 33 , Figure 34 , and Figure 35 . Specifically, we trained a 9layer MLP on the CIFAR-10 dataset, where each layer of the MLP had 512 neurons and its initial weights were sample from N (0, Σ = γ1σ 2 var I). σ 2 var was computed following (Glorot & Bengio, 2010) and γ1 = 0.1. We trained a VGG-11 model on the CIFAR-10 dataset and its initial weights of fully connected layers were sample from N (0, Σ = γ1σ 2 var I) (γ1 = 0.1). We trained a VGG-13 model on the CIFAR-10 dataset and its initial weights of fully connected layers were sample from N (0, Σ = γ1σ 2 var I) (γ1 = 0.1). We trained two ResNet-18 models (without BN layers) on the CIFAR-10 dataset and the Tiny ImageNet dataset, respectively, and initial weights of fully connected layers were sample from N (0, Σ = γ1σ 2 var I) (γ1 = 0.1). We observed that these DNNs all suffered from the learning-sticking problem (i.e., the loss minimization of these DNNs get stuck), when their initial weights were sampled from N (0, Σ = γ1σ 2 var ) (orange curves). According to our study, the technique of increasing the variance of initial weights can shorten the first phase, thereby solving the learning-sticking problem. To this end, we trained compared versions of these DNNs, and the only difference from previous DNNs is that the variance of initial weights was increased to γ2σ 2 var I (γ2 = 1). 

D MORE RESULTS ON OTHER DATASETS

In this section, we provide more results on the MNIST dataset and the Tiny ImageNet dataset. Figure 36 and Table 1 empirically verify the strength of the primary common direction, which are supplementary to Figure 4 and Table 1 in the main paper, respectively. Figure 37 illustrates the change of o (l) = cos(∆V (l) t , F (l-1) t ) • cos(V (l) t , ∆F (l-1) t ) in the first phase, which is supplementary to Figure 6 in the main paper. ImageNet dataset. We trained a 9-layer MLP, where each layer of the MLP had 512 neurons. We computed the strength of common directions on the two categories with the highest training accuracies. si = ∥Ci∆V ⊤ i ∥F measures the strength of weight changes along the i-th common direction, where ∆V i = Et[∆V i,t]. It can be observed that the strength of the primary direction was much greater than the strength of other directions. Table 1 : Strength of components of weight changes along the primary common direction and other directions. We trained a 9-layer MLP on the MNIST dataset. Each layer of the MLP had 512 neurons. It can be observed that the strength of the primary common direction was much greater than those of other directions. t , F (l-1) t ) • cos(V (l) t , ∆F (l-1) t ) in the first phase. We trained a 9-layer MLP on the MNIST dataset. Each layer of the MLP had 512 neurons. The shade represents the standard deviation over different samples.

E PROOF FOR THE LEMMA 1

In this section, we present the detailed proof for Lemma 1. Lemma 1. For the decomposition ∆W ⊤ t = ∆VtC ⊤ + ∆εt, given weight changes over different samples ∆W ⊤ t , we can compute the common direction C by minimizing the fitting error ∆ϵt when we use ′ as follows: ∆ε ⊤ t [j] = λC + ∆ε ⊤ t [j] ′ , where C ⊤ ∆ε ⊤ t [j] ′ = 0, and λ is a scalar. Then, ∆ε ⊤ t [j] 2 2 = λC + ∆ε ⊤ t [j] ′ 2 2 = (λC + ∆ε ⊤ t [j] ′ ) ⊤ (λC + ∆ε ⊤ t [j] ′ ) = λ 2 C ⊤ C + (∆ε ⊤ t [j] ′ ) ⊤ ∆ε ⊤ t [j] ′ = λ 2 C ⊤ C + ∆ε ⊤ t [j] ′ 2 2 (2) Obviously, ∆ε ⊤ t [j] 2 2 is the smallest when λ = 0. In other words, ∆ε ⊤ t [j] does not contain the component along the direction C and C ⊤ ∆ε ⊤ t [j] = 0. Therefore, ∆ε ⊤ t [j] 2 2 reaches its minimum if and only if ∆ε t C = 0. When ∆ε ⊤ t [j] 2 2 reaches its minimum, ∥∆ε t ∥ 2 F becomes the smallest. Thus, we have: ∆W t =C∆V ⊤ t + ∆ε ⊤ t C ⊤ ∆W t =C ⊤ C∆V ⊤ t + C T ∆ε ⊤ t = C ⊤ C∆V ⊤ t + 0 Then, ∆V ⊤ t can be represented as follows. ∆V ⊤ t = C ⊤ ∆W t C ⊤ C Substituting Eq. 4 into ∆W t = C∆V ⊤ t + ∆ε ⊤ t , we have ∆ε t = ∆W ⊤ t -∆W ⊤ t CC ⊤ C ⊤ C F PROOF FOR THE LEMMA 2 In this section, we present the detailed proof for Lemma 2. Lemma 2. (We can also decompose the weight W (l) t into the component along the common direction C and the component ε t in other directions.) Given the weight W ⊤ t and the common direction C, the decomposition W ⊤ t = V t C ⊤ + ε t can be conducted as V t = W ⊤ t C C ⊤ C and ε t = W ⊤ t -W ⊤ t CC ⊤ C ⊤ C s.t. ε t C = 0. Such settings minimize ∥ε t ∥ F . . proof. Let ε ⊤ t [j] denote the j-th column of the matrix ε ⊤ t ∈ R h×d . We can represent ε ⊤ t [j] by the vector C and a residual term ε ⊤ t [j] ′ as follows: ε ⊤ t [j] = λC + ε ⊤ t [j] ′ , where C ⊤ ε ⊤ t [j] ′ = 0 and λ is a scalar. Then, ε ⊤ t [j] 2 2 = λC + ε ⊤ t (x)[j] ′ 2 2 = (λC + ε ⊤ t [j] ′ ) ⊤ (λC + ε ⊤ t [j] ′ ) = λ 2 C ⊤ C + (ε ⊤ t [j] ′ ) ⊤ ε ⊤ t [j] ′ = λ 2 C ⊤ C + ε ⊤ t [j] ′ 2 2 (7) Obviously, ε ⊤ t [j] 2 2 becomes the smallest when λ = 0. In other words, ε ⊤ t [j] does not contain the component along the direction C and C ⊤ ε ⊤ t [j] = 0. Therefore, ε ⊤ t [j] 2 2 reaches its minimum if and only if ε t C = 0. When ε ⊤ t [j] 2 2 reaches its minimum, ∥ε t ∥ 2 F becomes the smallest. Thus, we have: W t =CV ⊤ t + ε ⊤ t C ⊤ W t =C ⊤ CV ⊤ t + C ⊤ ε ⊤ t =C ⊤ CV ⊤ t + 0 Then, V ⊤ t can be written as follows. V ⊤ t = C ⊤ W t C ⊤ C Substituting Eq. 9 into W t = CV ⊤ t + ε ⊤ t , we have ε t = W ⊤ t -W ⊤ t CC ⊤ C ⊤ C

G DECOMPOSITION OF COMMON DIRECTIONS

Actually, the estimation of the common direction C is similar to the singular value decomposition (SVD), although there are slight differences. We compute the average weight change ∆W t = Ex∈X ∆Wt|x, where ∆Wt|x denotes the weight change made by the sample x. Then, we decompose ∆W t into components along five common directions as ∆W t = C1∆V ⊤ 1,t + C2∆V ⊤ 2,t + • • • + C5∆V ⊤ 5,t + ∆ε ⊤ 5,t , where C1=C is termed the primary common direction. C1, C2, C3, C4, and C5 are orthogonal to each other. C2, C3, C4 and C5 represent the second, third, forth, and fifth common directions, respectively. Ci represents the i-th common direction. ∆V i,t denotes the average weight change along the i-th common direction decomposed from ∆W t. Specifically, we first decompose the average weight change ∆W t after the t-th iteration as ∆W t = C∆V ⊤ t + ∆ε ⊤ t . We remove all components along the common direction C from ∆W t, and obtain ∆W new,t = ∆W t -C∆V ⊤ t = ∆ε ⊤ t . Then, we further decompose ∆W new,t = C2∆V ⊤ 2,t + ∆ε ⊤ 2,t . In this way, we can consider C2 as the secondary common direction, while C1 = C is termed as the primary common direction. Thus, we conduct this process recursively and obtain common directions {C1, C2, • • • C5}. Accordingly, ∆W t is decomposed into ∆W t = C1∆V ⊤ 1,t + C2∆V ⊤ 2,t + • • • + C5∆V ⊤ 5,t + ∆ε ⊤ 5,t .

H DECOMPOSITION OF THE WEIGHT CHANGE MADE BY A SAMPLE x

H.1 PROOF FOR THEOREM 1. In this subsection, we present the detailed proof for Theorem 1. Theorem 1. The weight change made by a sample can be decomposed into (h + 1) terms after the t-th iteration as follows. ∆W (l) t = ∆W (l) primary,t + h k=1 ∆W (l,k) noise,t rewritten = = = = Γ (l) t F (l-1) ⊤ t + κ (l) ⊤ t , where ∆W (l) primary,t = D (l) t V (l+1) t C (l+1) ⊤ C (l+1) ∆V (l+1) ⊤ t F (l) t F (l-1) ⊤ t /∥F (l) t ∥ 2 2 denotes the component along the primary common direction, and ∆W (l,k)  noise,t = D (l) t ε (l+1,k) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t /∥F (l) t ∥ 2 2 de- notes the component along the k-th common direction in the noise term. ε (l+1,k) t = Σ kk U k V ⊤ k , where the SVD of ε (l+1) t ∈ R h×h ′ is given as ε (l+1) t = UΣV ⊤ (h ≤ h ′ ), and Σ kk denotes the k-th singular value ∈ R. ε (l+1) t = k ε (l+1,k) t . U k and V k denote the k-th column of the matrix U and V, respectively. Besides, we have ∀k ∈ {1, 2, . . . , h}, U ⊤ k C (l+1) = 0. Consequently, we have Γ (l) t = D (l) t V (l+1) t C (l+1) ⊤ C (l+1) ∆V (l+1) ⊤ t F (l) t /∥F (l) t ∥ 2 2 ∈ R h , and κ (l) ⊤ t = D (l) t ε (l+1) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t /∥F (l) t ∥ 2 2 ∈ R h×d . proof. We can represent weight matrix as W (l) t = C (l) V (l) t ⊤ + ε (l) ⊤ t . In addition, according to the back propagation and chain rule, we have ∆W (l) t = -ηD (l) t Ḟ (l) t F (l-1) ⊤ t , where Ḟ (l) t = ∂Loss ∂F (l) t , and η denotes the learning rate. According to Lemma 1 and Lemma 2, we have ∆ε 

∆W

(l) t = -ηD (l) t Ḟ (l) t F (l-1) ⊤ t = -ηD (l) t W (l+1) ⊤ t D (l+1) t Ḟ (l+1) t F (l-1) ⊤ t = D (l) t W (l+1) ⊤ t ∆W (l+1) t F (l) t F (l-1) ⊤ t / F (l) t 2 2 = D (l) t V (l+1) t C (l+1) ⊤ + ε (l+1) t C (l+1) ∆V (l+1) ⊤ t + ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 = D (l) t [V (l+1) t C (l+1) ⊤ C (l+1) ∆V (l+1) ⊤ t + V (l+1) t C (l+1) ⊤ ∆ε (l+1) ⊤ t + ε (l+1) t C (l+1) ∆V (l+1) ⊤ t + ε (l+1) t ∆ε (l+1) ⊤ t ]F (l) t F (l-1) ⊤ t / F (l) t 2 2 = D (l) t V (l+1) t C (l+1) ⊤ C (l+1) ∆V (l+1) ⊤ t + ε (l+1) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 = D (l) t V (l+1) t C (l+1) ⊤ C (l+1) ∆V (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 + D (l) t ε (l+1) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 (12) ε (l+1,k) t = Σ kk U k V ⊤ k , where the singular value decomposition of ε (l+1) t is given as ε (l+1) t = UΣV ⊤ , and Σ kk denotes the k-th singular value. U k and V k denote the k-th column of the matrix U and V, respectively. We can derive the following equations. (l) t = D (l) t V (l+1) t C (l+1) T C (l+1) ∆V (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 + D (l) t ε (l+1) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 = D (l) t V (l+1) t C (l+1) T C (l+1) ∆V (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 + h k=1 D (l) t ε (l+1,k) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 . = ∆W (l) primary,t + h k=1 ∆W (l,k) t,noise In addition, if we set Γ (l) t = D (l) t V (l+1) t C (l+1) ⊤ C (l+1) ∆V (l+1) ⊤ t F (l) t /∥F (l) t ∥ 2 2 , and κ (l) ⊤ t = D (l) t ε (l+1) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t /∥F (l) t ∥ 2 2 . Then we can re-write the Eq. ( 13) as follows. In this subsection, we explain the phenomenon that S 3 does not decrease monotonically in Table 1 in Appendix and Table 1 in the main paper (Page 6). In fact, we first decompose ε ∆W (l) t rewritten = = = = Γ (l) t F (l-1) ⊤ t + κ (l) ⊤ t (l+1) t = k ε (l+1,k) t according to the SVD. Then ∆W (l,k) noise,t is computed as ∆W (l,k) noise,t = D (l) t ε (l+1,k) t ∆ε (l+1) ⊤ t F (l) t F (l-1) ⊤ t / F (l) t 2 2 . Accordingly, the strength of weight changes along the primary direction is computed as S According the Eq. ( 3) in the main paper, we have Ḟ (l-1) t = (C (l) ⊤ D (l) t Ḟ (l) t ) • β + ϵD (l) t Ḟ (l) t (15) Thus, if C (l) ⊤ D (l) t Ḟ (l) t is large enough (i.e., keeping optimizing W (l) ⊤ t along the common direction C (l) for a long time), then the feature gradients Ḟ (l-1) t of different samples will be roughly parallel to the same vector β. This is because C (l) ⊤ D (l) t Ḟ (l) t is a scalar and the term ϵD (l) t Ḟ (l) t is small. In other words, the diversity between feature gradients Ḟ (l-1) t of different samples decreases. Here, β = [β 1 , β 2 , • • • , β d ], and ϵ = [ϵ 1 , ϵ 2 , • • • , ϵ d ] ⊤ . J DISCUSSION ON THE BACKGROUND ASSUMPTION. In the above section, we demonstrate that on the ideal state, i.e., W (l) ⊤ t has been optimized towards the common direction C (l) for a long time, we can consider that the feature gradients Ḟ (l-1) t of different samples will be roughly parallel to the same vector β. In this way, we can explain that the diversity between feature gradients Ḟ (l-1) t of different samples decreases. In comparison, in the current section, we mainly discuss the trustworthiness of the background assumption in Section 4.2 in the main paper. We aim to discuss that on the assumption that features F (l-1) t of different samples have been pushed a little bit towards a specific common direction, we can find at least one learning iteration in the first phase where ∆F (l-1) t and F (l-1) t of most samples have similar directions, and V (l) t and ∆V (l) t have similar directions. The assumption that features F (l-1) t of different samples have been pushed a little bit towards a specific common direction is an intermediate state between the chaotic initial state of the MLP and the ideal state introduced in the above section. In this way, we can assume that C (l) ⊤ D (l) t Ḟ (l) t is large. According to Eq. ( 2) in the main paper and Lemma 2, we have Ḟ (l-1) t = W (l) ⊤ t D (l) t Ḟ (l) t and W (l) ⊤ t = V (l) t C (l) ⊤ + ε (l) ⊤ t . Thus, we have Ḟ (l-1) t = W (l) ⊤ t D (l) t Ḟ (l) t = (V (l) t C (l) ⊤ + ε (l) ⊤ t )D (l) t Ḟ (l) t = V (l) t C (l) ⊤ D (l) t Ḟ (l) t + ε (l) ⊤ t D (l) t Ḟ (l) t (16) If the scalar C (l) ⊤ D (l) t Ḟ (l) t is large, we can roughly consider Ḟ (l-1) t ≈ V (l) t C (l) ⊤ D (l) t Ḟ (l) t = V (l) t • (C (l) ⊤ D (l) t Ḟ (l) t ) // V (l) t It means that the feature gradient Ḟ (l-1) t is roughly parallel to the vector V (l) t . Furthermore, the feature gradient Ḟ (l-1) t and the change of feature ∆F (l-1) t can be considered negatively parallel to each other, we have ∆F (l-1) t // Ḟ (l-1) t // V (l) t Similarly, we have ∆F  ∆V (l) t = V (l) t+1 -V (l) t ≈ k t+1 ∆F (l-1) t+1 -k t ∆F (l-1) t (19) If features F (l-1) t of different samples have been pushed a little bit towards a specific common direction, then it is easy to find at least one learning iteration that ∆F . Meanwhile, we can find at least one learning iteration in the first phase where the change of feature in t-th iteration ∆F (l-1) t and (t + 1)th iteration ∆F (l-1) t+1 are roughly the same. In other words, ∆F (l-1) t ≈ ∆F (l-1) t+1 . Thus, we have ∆V (l) t ≈ (k t+1 -k t )∆F (l-1) t // ∆F (l-1) t // V (l) t (20) In this way, we can obtain that V (l) t and ∆V (l) t have similar directions.

K PROOF FOR LEMMA 3

In this section, we present the detailed proof for Lemma 3. Lemma 3. Given an input sample x ∈ X and a common direction C (l) after the t-th iteration, if the noise term ε (l) t is small enough to satisfy |∆V (l) ⊤ t F (l-1) t V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t F (l-1) t | ≫ |∆V (l) ⊤ t F (l-1) t V (l) ⊤ t ε (l) t ∆ε (l) ⊤ t F (l-1) t |, we can obtain cos(∆V (l) t , F (l-1) t ) • cos(V (l) t , ∆F (l-1) t ) ≥ 0, where ∆V (l) t = ∆W (l) ⊤ t C (l) C (l) ⊤ C (l) , and V (l) t = W (l) ⊤ t C (l) C (l) ⊤ C (l) . ∆F (l-1) t denotes the change of features ∆F (l-1) t = F (l-1) t+1 -F (l-1) t made by the training sample x after the t-th iteration. To this end, we approximately consider the change of features ∆F (l-1) t after the t-th iteration negatively parallel to feature gradients Ḟ (l-1) t , although strictly speaking, the change of features is not exactly equal to the feature gradients. proof. Given a sample x, we can prove that cos(∆V (l) t , F (l-1) t ) • cos(V (l) t , ∆F (l-1) t ) ≥ 0. According to chain rule, we have ∆W (l) t = -ηD (l) t Ḟ (l) t F (l-1) T t (21) According to Lemma 1 and Lemma 2, we have C (l) ⊤ ∆ε (l) ⊤ t = 0 and ε (l) t C (l) = 0. Then, we have cos(∆V (l) t , F (l-1) t ) • cos(V (l) t , Ḟ (l-1) t ) = ∆V (l) ⊤ t F (l-1) t ∥∆V (l) t ∥ • ∥F (l-1) t ∥ • V (l) ⊤ t Ḟ (l-1) t ∥V (l) t ∥ • ∥ Ḟ (l-1) t ∥ Therefore, we have sign(cos(∆V (l) t , F (l-1) t ) • cos(V (l) t , Ḟ (l-1) t )) = sign([∆V (l) ⊤ t F (l-1) t ] • [V (l) ⊤ t Ḟ (l-1) t ]/(∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t ] • [V (l) ⊤ t W (l) ⊤ t D (l) t Ḟ (l) t ]/(∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t ] • [V (l) ⊤ t (V (l) t C (l) ⊤ + ε (l) t )D (l) t Ḟ (l) t ]/(∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t ] • [V (l) ⊤ t (V (l) t C (l) ⊤ + ε (l) t )(∆W (l) t F (l-1) t /(-η F (l-1) t 2 2 ))] /(∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t ] • [(V (l) ⊤ t V (l) t C (l) ⊤ + V (l) ⊤ t ε (l) t )∆W (l) t F (l-1) t ] /(-η F (l-1) t 2 2 ∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t ] • [(V (l) ⊤ t V (l) t C (l) ⊤ + V (l) ⊤ t ε (l) t )(C (l) ∆V (l) ⊤ t + ∆ε (l) ⊤ t )F (l-1) t ] /(-η F (l-1) t 2 2 ∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t ] • [(V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t + V (l) ⊤ t ε (l) t ∆ε (l) ⊤ t + V (l) ⊤ t V (l) t C (l) ⊤ ∆ε (l) ⊤ t + V (l) ⊤ t ε (l) t C (l) ∆V (l) ⊤ t )F (l-1) t ]/(-η F (l-1) t 2 2 ∥∆V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥F (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t ] • [(V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t + V (l) ⊤ t ε (l) t ∆ε (l) ⊤ t )F (l-1) t ] /(-η F (l-1) t 2 2 ∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t ] • [V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t F (l-1) t + V (l) ⊤ t ε (l) t ∆ε (l) ⊤ t F (l-1) t ] /(-η F (l-1) t 2 2 ∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) = sign([∆V (l) ⊤ t F (l-1) t V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t F (l-1) t + ∆V (l) ⊤ t F (l-1) t V (l) ⊤ t ε (l) t ∆ε (l) ⊤ t F (l-1) t ] /(-η F (l-1) t 2 2 ∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) According to our assumption, the noise term ε (l) t is small enough to satisfy |∆V (l) ⊤ t F (l-1) t V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t F (l-1) t | ≫ |∆V (l) ⊤ t F (l-1) t V (l) ⊤ t ε (l) t ∆ε (l) ⊤ t F (l-1) t |. This assumption is verified in Figure 38 . Then we can ignore the last term and obtain sign([∆V The output feature of the l-th linear layer w.r.t. the input sample x can be described as (l) ⊤ t F (l-1) t V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t F (l-1) t + ∆V (l) ⊤ t F (l-1) t V (l) ⊤ t ε (l) t ∆ε (l) ⊤ t F (l-1) t ] /(-η F (l-1) t 2 2 ∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) ≈ sign([∆V (l) ⊤ t F (l-1) t V (l) ⊤ t V (l) t C (l) ⊤ C (l) ∆V (l) ⊤ t F (l-1) t ] (-η F (l-1) t 2 2 ∥∆V (l) t ∥ 2 ∥F (l-1) t ∥ 2 ∥V (l) t ∥ 2 ∥ Ḟ (l-1) t ∥ 2 )) ≤ 0 [f 1 , f 2 , . . . , f h ] = W (l) t F (l-1) t ∈ R h , where f i denotes the i-th dimension of the feature. In this way, the batch normalization operation can be formulated as BN(f i ) = γ scale [(f i -µ i )/σ i ] + β shift , where γ scale and β shift denote the scaling and the shifting parameters, respectively. In this way, the batch normalization operation subtracts the mean feature F (l) t = E x∈X [F (l) t | x ] from features of all samples. Therefore, features of different samples in a same category are no longer similar to each other. We also propose a simplified normalization operation (i.e., centering operations for normalization) to alleviate the TFC phenemonon in the first phase. The centering operations for normalization is given as norm 1 (f i ) = (f i -µ i )/σ i , where µ i and σ i denote the mean value and the standard deviation of f i over different samples, respectively. This operation is similar to the batch normalization (Ioffe & Szegedy, 2015) , but we do not compute the scaling and shifting parameters in the batch normalization. In order to verify the centering operations for normalization can alleviate the TFC phenemonon during the training process of the MLP, we trained 7-layer MLPs and 9-layer MLPs with and without the centering operations. Specifically, for the centering normalization operation norm 1 , we added the centering operations after each linear layer, except the last linear layer. Each linear layer in the MLP had 512 neurons. Figure 39 shows that the feature similarity in MLPs with centering operations kept decreasing, while the feature similarity of the MLP without centering operations kept increasing. This indicated that centering operations for normalization alleviate the TFC phenomenon.

M.2 MOMENTUM

We can explain that momentum in gradient descent can alleviate this phenomenon. Based on Lemma 3, the "self-enhanced system" of the TFC phenemonon requires singular values of weights along other directions ε (l) t to be small enough. However, because the momentum operation strengthens influences of the initialized noisy weights W (l) t=0 , it strengthens singular values of ε (l) t , to some extent, thereby alleviating the TFC phenemonon. Specifically, considering the momentum with the coefficient m, the dynamics of weights W t+1 can be described as, W t+1 = W t -η ∂Loss ∂W t -m ∂Loss ∂W t-1 , where η denotes the learning rate. Because we only focus on weights in a single layer, without causing ambiguity, we omit the superscript (l) to simplify the notation in this subsection. In this way, we can write the gradient descent as W T +1 = W 0 + η T t 1 -m T +1-t 1 -m ∂Loss ∂W t . ( ) Since 0 < m < 1, the coefficient 1-m T +1-t 1-m decreases when the variable t increases. Thus, a large m represents that influences of W 0 on W T +1 are significant. Because ε T +1 is decomposed from W T +1 and singular values of ε T +1 are mainly determined by the noisy W 0 . Accordingly, singular values of ε T +1 are relatively large, which disturb the "self-enhanced system" and alleviate the TFC phenemonon. To verify the above analysis, we trained MLPs with m = 0, 0.5, 0.9, respectively. Figure 40 (a) verifies that a larger value of m usually more alleviates the TFC phenomenon.

M.3 INITIALIZATION

We explain that the initialization of MLPs also affects the TFC phenemonon. According to Lemma 3, such "self-enhanced system" requires singular values of weights along other directions ε (l) t to be small enough. However, because increasing the variance of the initialized weights W (l) 0 will increase singular values of ε (l) t based on Lemma 2, alleviating the TFC phenemonon. Specifically, we initialize weights with Xavier normal distribution (Glorot & Bengio, 2010) , i.e. W 0 ∼ N (0, γσ 2 var I), where σ var = 2 f anout+f anin . f an in and f an out denote the input dimension and the output dimension of the linear layer, respectively. In this way, a large γ yields large singular values of initial weights W 0 . Based on Lemma 2, we also have ε (l) 0 = W (l) ⊤ 0 -W (l) ⊤ 0 C (l) C (l) ⊤ C (l) ⊤ C (l) . Large singular values of initial weights W 0 lead to large singular values of ε (l) 0 . Therefore, a large variance of initialized weights disturbs the "self-enhanced system" and alleviates the TFC phenemonon.

M.4 L 2 REGULARIZATION (RIDGE LOSS)

L 2 regularization is equivalent to the weight decay in the case of gradient descent. The total loss is given as L(Wt) = L CE (Wt) + λ∥Wt∥ 2 2 , where L CE (Wt) represents the cross entropy loss, and λ∥Wt∥ 2 2 denotes the ridge loss. In this way, we have the following iterates by using gradient descent W t+1 = W t -η∇L t (W t ) = W t -η∇L CE t (W t ) -2ηλW t = (1 -2ηλ)W t -η∇L CE t (W t ) , According to Lemma 3, such "self-enhanced system" requires singular values of weights along other directions ε (l) t to be small enough. Based on Lemma 2, we also have ε l) . In this way, a larger λ yields smaller singular values of ε (l) t , which disturbs the "self-enhanced system" and strengthens the TFC phenemonon.  (l) t = W (l) ⊤ t -W (l) ⊤ t C (l) C (l) ⊤ C (l) ⊤ C (

N EXPLANATIONS FOR MORE DNNS.

The theoretical analysis of this study can explain which kinds of DNNs are more likely to exhibit the TFC phenomenon in early epochs. In fact, we discovered the two-phase phenomenon and the TFC phenomenon in various DNNs, including MLPs and modern CNNs, e.g., VGG-11 models and VGG-13 models. Specifically, we trained VGG-11 models and VGG-13 models on the CIFAR-10 dataset and the Tiny ImageNet dataset. We adopted the learning rate η = 0.01, the batch size bs = 100, and the SGD optimizer. The training loss, the testing loss, the training accuracy, and the testing accuracy are shown in Figure 41 Furthermore, we found that our theoretical analysis can be generalized to modern CNNs and transformers. We conducted experiments on ResNet-18, ResNet-34 (He et al., 2015) , and Vision Transformers (ViTs) (Dosovitskiy et al., 2020) . Because both ResNets and ViTs were the two most classical network architectures that had been examined for years, it showed that ResNet-18, ResNet-34, and ViT did not exhibit the TFC phenomenon (or the TFC phenomenon only existed in very few iterations within the first epoch), owing to the use of normalization operations in these DNNs. However, according to our theoretical analysis, if the batch normalization operations in ResNet-18/34 and the layer normalization (LN) operations in ViTs were removed, then the TFC phenomenon was significantly strengthened. First, we trained ViTs, ResNet-18, and ResNet-34 models on the CIFAR-10 dataset. The classification heads in both ViTs and ResNet-18/34 were implemented by 4-layer MLP. Specifically, we trained two different ViTs with the patch size P = 4, the heads = 18, the dropout rate = 0.1, the embedding dropout rate = 0.1, the learning rate η = 0.1, the batch size bs = 100, and the SGD optimizer. For ResNet-18 and ResNet-34 models, we adopted the learning rate η = 0.01, the batch size bs = 100, and the SGD optimizer. Besides, we used two data augmentation methods, including random cropping and random horizontal flipping O MORE EXPERIMENTAL RESULTS OF ASSUMPTION 1. Assumption 1. We assume that the MLP encodes features of very few (a single or two) categories in the first phase, instead of simultaneously learning all or most categories in this phase. In this section, we aim to verify that Assumption 1 is a common fact in various DNNs, including MLPs, VGGs, and ResNets. To this end, we have conducted new experiments to show that DNNs encoded features of very few (a single or two) categories in early epochs. Specifically, we trained a 9-layer MLP on the CIFAR-10, the MNIST dataset, and the Tiny ImageNet dataset, respectively. Each layer of the MLP had 512 neurons. Besides, We trained a VGG-11 model, a VGG-13 model, and a ResNet-18 on the CIFAR-10 dataset. We evaluated the training accuracy at the end of the first phase. For the Tiny ImageNet dataset, we randomly selected the following 50 categories, orangutan, parking meter, snorkel, American alligator, oboe, basketball, rocking chair, hopper, neck brace, candy store, broom, seashore, sewing machine, sunglasses, panda, pretzel, pig, volleyball, puma, alp, barbershop, ox, flagpole, lifeboat, teapot, walking stick, brain coral, slug, abacus, comic book, CD player, school bus, banister, bathtub, German shepherd, black stork, computer keyboard, tarantula, sock, Arabian camel, bee, cockroach, cannon, tractor, cardigan, suspension bridge, beer bottle, viaduct, guacamole, and iPod for training. Figure 47 , Figure 48 , Figure 49 , and Figure 50 show that various DNNs encoded features of very few (a single or two) categories in early epochs. 



Figure 1: (a) The first phase (learning iterations before the dotted line) gets an increasing length and finally becomes the learning-sticking problem (purple curve), when the DNN has more layers. (b)Samples of different categories share almost the same features at the end of the first phase. We can consider this as a TFC phenomenon. We visualize the learning dynamics of an intermediate-layer feature in a 9-layer MLP. We select 10 salient dimensions to illustrate the feature similarity.

Figure 2: The TFC phenomenon. (a) Cosine similarity of features between samples in different categories E x,x ′ ∈X [cos(F (l) t |x, F (l) t | x ′ )] keeps increasing in the first phase (left to the dotted line), until the second phase. The low cosine similarity indicates the high diversity. (b) Cosine similarity of feature gradients between different samples of a category E x,x ′ ∈Xc [cos( Ḟ (l) t |x, Ḟ (l) t | x ′ )] keeps increasing in the first phase until the second phase, where Xc denotes samples of the category c. (c) Cosine similarity of weight changes between weight vectors in a layer Ex∈X cos(∆w (l) t,i |x, ∆w (l) t,j |x)

Figure 3: (a) Weights of neurons are changed towards a common direction. (b) The logic of explaining the TFC phenomenon.

Figure4: The strength of different common directions in the CIFAR-10 dataset. We trained 9-layer MLPs, where each layer of the MLP had 512 neurons. We illustrated results on the two categories with the highest training accuracies. si = ∥Ci∆V ⊤ i ∥F measures the strength of weight changes along the i-th common direction, where∆V i = Et[∆V i,t].The strength of the primary direction was much greater than the strength of other directions. Please see Appendix D for more results on the MNIST dataset and the Tiny ImageNet dataset.

Figure 5: The average cosine similarity between the feature F (l-1) t

Figure 6: The change of o (l) in the first phase. We trained 9-layer MLPs on the (a) CIFAR-10 and the (b) Tiny ImageNet. Each layer of the MLP had 512 neurons. The Appendix D provides results on the MNIST. The shade represents the standard deviation over different samples.

Figure 7: The training accuracy of MLPs on different categories at the end of the first phase. The MLP only learned features of a single or two categories in the first phase.

Figure 8: Effects of (a) normalization and (b) initialization. We trained L-layer MLPs, where each layer had 512 neurons. A shorter first phase indicates that the TFC phenomenon is more alleviated. Effects of momentum and L 2 regularization are shown in Appendix M.2.

Figure 1: Cosine similarity between features of different samples on the CIFAR-10 dataset. We trained a 9-layer MLP, where each layer had 512 neurons. The cosine similarity between features of different samples kept increasing until samples of different categories share almost the same feature in the first phase. The features were used in the fourth linear layer of the MLP. The TFC phenemonon happens in the 1000-th iteration. The abscissa and ordinate of each heatmap represent the sample index. For each grid, color indicates the cosine similarity of that sample pair.

Figure 2: Cosine similarity between features of different samples on the MNIST dataset. We trained a 9-layer MLP, where each layer had 512 neurons. The cosine similarity between features of different samples kept increasing until samples of different categories share almost the same feature in the first phase. The features were used in the fourth linear layer of the MLP. The TFC phenemonon happens in the 700-th iteration. The abscissa and ordinate of each heatmap represent the sample index. For each grid, color indicates the cosine similarity of that sample pair.1.0 0.8 0.6 0.4 0.2 0

Figure 4: (a) The training loss of four MLPs trained on the CIFAR-10 dataset. (b) The testing loss of four MLPs. (c) Training accuracies of four MLPs. (d) Testing accuracies of four MLPs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the third linear layer of MLPs.

Figure 5: (a) The training loss of four MLPs tranined on the MNIST dataset. (b) The testing loss of four MLPs. (c) Training accuracies of four MLPs. (d) Testing accuracies of four MLPs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the third linear layer of MLPs.

Figure 6: (a) The training loss of three MLPs tranined on the Tiny ImageNet dataset. (b) The testing loss of three MLPs. (c) Training accuracies of three MLPs. (d) Testing accuracies of three MLPs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The features and the feature gradient were used in the second linear layer of MLPs.

Figure 9: (a) The training loss of two LSTMs trained on the CoLA dataset. (b) The testing loss of two LSTMs. (c) Training accuracies of two LSTMs. (d) Testing accuracies of two LSTMs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the third linear layer of MLPs. B.7 ON THE SST-2 DATASET

Figure 10: (a) The training loss of three LSTMs trained on the SST-2 dataset. (b) The testing loss of three LSTMs. (c) Training accuracies of three LSTMs. (d) Testing accuracies of three LSTMs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the second linear layer of MLPs.

Figure 11: (a) The training loss of three LSTMs trained on the AGNEWS dataset. (b) The testing loss of three LSTMs. (c) Training accuracies of three LSTMs. (d) Testing accuracies of three LSTMs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the second linear layer of MLPs.two data augmentation methods, including random cropping and random horizontal flipping. We trained three 7-layer MLPs with 256 neurons in each layer, with bs = 100, 500, 1000 respectively. The training loss, the testing loss, the training accuracy, the testing accuracy, the cosine similarity of features, and the cosine similarity of feature gradients of MLPs trained with different batch sizes are shown in Figure12.

Figure 13: (a) The training loss of two MLPs trained with different learning rates. (b) The testing loss of two MLPs. (c) The training accuracies of two MLPs. (d) The testing accuracies of two MLPs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the second linear layer of MLPs.

Figure 14: (a) The training loss of three MLPs with different activation functions. (b) The testing loss of three MLPs. (c) Training accuracies of three MLPs. (d) Testing accuracies of three MLPs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the second linear layer of MLPs.

Figure 27: (a) The training loss of three 7-layer MLPs with different focusing parameters γ trained on the CIFAR-10 dataset. (b) The testing loss of three MLPs. (c) Training accuracies of three MLPs. (d) Testing accuracies of three MLPs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the third linear layer of MLPs.

Figure 29: (a) The training loss of three MLPs with different train/test dataset split ratios trained on the CIFAR-10 dataset. (b) The testing loss of three MLPs. (c) Training accuracies of three MLPs. (d) Testing accuracies of three MLPs. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the third linear layer of MLPs.

Figure 30: (a) The training loss of two MLPs trained on the CIFAR-10 dataset. When the loss minimization gets stuck (orange curve), we can consider it as the first phase with an infinite length. Therefore, the "learning-sticking" problem can be solved by techniques of shortening the first phase, such as the technique of increasing the variance of initial weights, which is a theoretically certificated solution in our study (blue curve). (b) The training accuracy of two MLPs. (c) The testing loss of two MLPs. (d) The testing accuracy of two MLPs.Training LossTraining Accuracy Testing Loss

Figure36: The strength of top-ranked common directions on the (a) MNIST dataset and the (b) Tiny ImageNet dataset. We trained a 9-layer MLP, where each layer of the MLP had 512 neurons. We computed the strength of common directions on the two categories with the highest training accuracies. si = ∥Ci∆V

Figure 37: The change of o (l) = cos(∆V(l)

(l+1) = 0. After the t-th iteration, the weight change made by a training sample x can be computed as follows.

THE EXPLANATION FOR THE PHENOMENON THAT S

primary = E t∈[Tstart,Tend] E x∈X ∥∆W (l,k) primary,t | x ∥ F . The strength of weight changes along the k-th noise direction is computed as S (l) k = E t∈[Tstart,Tend] E x∈X ∥∆W (l,k) noise,t | x ∥ F . In this way, S

F are directly decomposed from ε (l+1) t based on the SVD and decrease monotonically. I ANALYSIS BASED ON EQ. (3) IN THE MAIN PAPER AND EXPLANATION FOR THE PARALLELISM.

k t , k t+1 ∈ R are two scalars. Then, we can derive that

Figure 39: Cosine similarity of features between samples in different categories. We trained 7-layer MLPs and 9-layer MLPs on the CIFAR-10, the MNIST, and the Tiny ImageNet dataset. M DISCUSSION FOR FOUR TYPICAL OPERATIONS M.1 CENTERING OPERATIONS FOR NORMALIZATION

Figure 40: Effects of (a) momentum and (b) L 2 regularization. We trained L-layer MLPs, where each layer had 512 neurons. A shorter first phase indicates that the TFC phenomenon is more alleviated.

Figure 40(b) Figure 9(d) in the main paper verify that a larger coefficient λ more strengthened the TFC phenemonon.

and Figure 42.

Figure 41: (a) The training loss of a VGG-11 model and a VGG-13 model trained on the CIFAR-10 dataset. (b) The testing loss of two models. (c) Training accuracies of two models. (d) Testing accuracies of two models. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the third linear layer of the MLP in models.

Figure 43: (a) The training loss of a ResNet-18 and a ResNet-18 (without BN) trained on the CIFAR-10 dataset. (b) The testing loss of two models. (c) Training accuracies of two models. (d) Testing accuracies of two models. (e) Cosine similarity between features of different categories. (f) Cosine similarity between gradients of different samples in a category. The feature and the feature gradient were used in the third linear layer of the MLP in models.

Figure 47: The training accuracies of MLPs on the CIFAR-10 dataset, the MNIST dataset, and the Tiny ImageNet dataset. The accuracies were evaluated at the end of the first phase. MLPs encode features of very few (a single or two) categories in the first phase, instead of simultaneously learning all or most categories in this phase. (a) The training accuracy of a 9-layer MLP trained on the CIFAR-10 dataset. (b) The training accuracy of a 9-layer MLP trained on the MNIST dataset. (c) The training accuracy of a 9-layer MLP trained on the Tiny ImageNet dataset.

C, i.e., ∆εtC = 0. Lemma 1. (Proof in Appendix E) For the decomposition ∆W ⊤ t =∆VtC ⊤ +∆εt, given weight changes over different samples ∆W ⊤ t , we can compute the common direction C by minimizing the fitting error ∆ϵt when we use ∆vt,iC ⊤ to approximate ∆w ⊤ t,i over different samples across different iterations. I.e., min C,∆V t |x E t∈[Tstart,T end ] Ex∈X ∥∆εt|x∥ 2 F , s.t. ∆εt|x = ∆W ⊤ t |x -∆Vt|xC ⊤ . Thus, we obtain

1

∆vt,iC ⊤ to approximate ∆w ⊤ t,i over different samples across different iterations. I.e., min C,∆V t |x E t∈[Tstart,T end ] Ex∈X ∥∆εt|x∥ 2 F , s.t. ∆εt|x = ∆W ⊤ t |x -∆Vt|xC ⊤ . Thus, we obtain C ⊤ C , s.t. ∆εtC = 0. Such settings minimize ∥∆εt∥F . proof. Let ∆ε ⊤ t [j] denote the j-th column of the matrix ∆ε ⊤ t ∈ R h×d . Given a sample x, we can represent ∆ε ⊤ t [j] by the vector C and a residual term ∆ε ⊤ t [j]

iteration Frobenius Norm

Figure 38 :Visualization of the Frobenius norm of the two components. We trained a 9-layer MLP on the MNIST dataset, where each layer had 512 neurons. Iterations were chosen at the end of the first phase. Thus, sign(cos(∆VIn this paper, we approximately consider ∆F (l-1) t and Ḟ (l-1) t are negatively parallel to each other. Thus, we have sign(cos(∆V

L PROOF FOR THEOREM 2

In this section, we aim to prove that training samples of the same category have the same effect in the first phase.Theorem 2. Under the aforementioned background assumption, for any training samples x, l) t |x and F (l) t | x ′ have kinds of similarity in very early iterations), then cos(αc∆V (l) t |x, F (l-1) t |x) ≥ 0, and cos(αcV (l) t , ∆Fwhere αc ∈{-1,+1} is a constant shared by all samples in category c.proof. Given a sample x and a sample x ′ from the same category, we can prove that cos(∆VAccording to the assumption thatIn this way, for the category c, there exists a constant α c which satisfies sign(cos(α c ∆Vand training sampl e x ∈ X c belongs to the category c.According to Lemma 3, we have cos(∆V (l)In addition, the above proof indicates that sign(cos(αc∆V (l) t |x, F (l-1) t |x) ≥ 0. Therefore, we have sign(cos(αcV (l) t |x, ∆F 

P PROPOSE AN IMPROVED TRAINING METHOD

In this section, we use our theory to develop a new normalization method. The new normalization operation was designed considering the following two findings.• Our theoretical analysis told us that the centering operation in BN could alleviate the TFC phenomenon.• Previous studies found some shortcomings of the BN operation, i.e., the BN operation usually caused unstable features. Thus, the BN operation was found incompatible with the dropout (Li et al., 2019) , hurt the classification accuracy in adversarial training (Galloway et al., 2019) , and decreased the quality of images generated by generative models (Salimans et al., 2016) .Therefore, according to our analysis, we only need to update the dynamic normalization parameters (i.e., µi and σi in the following equation) in the first phase to avoid the learning-sticking problem, instead of applying the dynamic normalization parameters in the entire training process. In this way, we can simultaneously solve the learning-sticking problem and avoid unstable features.Specifically, we are given the output feature F = [f1, f2, . . . , f h ] ∈ R h of the l-th linear layer w.r.t. the input sample x, where fi denotes the i-th dimension of the feature. The new normalization operation is given aswhere µi and σi denote the mean value and the standard deviation of fi over different samples, respectively. We only update the mean value µi and the standard deviation σi in the first phase, as follows.where we keep updating at = 0.99at-] through all the t previous batches to represent the current cosine similarity between features of different samples. If a t is greater than a threshold τ = 0.3, then we consider the learning process to be in the first phase and normalize the feature. Otherwise, if at ≤ τ , then we consider it has already jumped to the second phase, stop updating µi and σ 2 i , and use constants µi and σ 2 i to generate stable features. We set m = 0.1 and compute µi,t and σ 2 i,t in the t-th batch as follows.To this end, we conducted experiments on two types of MLPs (i.e., 9-layer MLPs and 11-layer MLPs) to compare the proposed method with BN. For each type of MLP, we trained three versions MLPs on the CIFAR-10 dataset. The vanilla MLP had 512 neurons in each layer. We added the proposed norm operation after the first, the third, the fifth, and the seventh linear layers, and constructed the network MLP-norm. For a fair comparison, we constructed a baseline MLP, namely MLP-BN, by adding the BN operation in the same positions as in MLP-norm. In addition, scaling and shifting parameters in the BN operation were closed. Figure 51 shows that both the MLP-norm and MLP-BN alleviated the learning-sticking problem. However, MLP-norm was optimized much faster than MLP-BN, because our theoretical analysis told us that it was not necessary to continue updating µi and σ 2 i , if the learning process did not have a risk of feature collapse, thereby alleviating the optimization problems found in (Li et al., 2019; Galloway et al., 2019) . Note that the vibration of the blue curve could be explained as the failure of jumping out of the first phase, due to the strong power of the "self-enhancement system."

