TEMPORARY FEATURE COLLAPSE PHENOMENON IN EARLY LEARNING OF MLPS

Abstract

In this paper, we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs). We discover and explain the reason for the feature collapse phenomenon in the first phase, i.e., the diversity of features over different samples keeps decreasing in the first phase, until samples of different categories share almost the same feature, which hurts the optimization of MLPs. We find that such phenomena usually occur when MLPs are difficult to be trained. We explain such a phenomenon in terms of the learning dynamics of MLPs. Furthermore, we theoretically analyze the reason why four typical operations can alleviate the feature collapse. The code has been attached with the submission.

1. INTRODUCTION

It has been widely observed that in initialized neural networks, especially when the network is deep, the loss decrease is likely to have two phases during early epochs of learning, e.g., phenomena observed in (Saxe et al., 2013; Simsekli et al., 2019; Stevens et al., 2020) . As Figure 1 (a) shows, the first phase is usually relatively short, in which the training loss does not decrease or decreases very slowly. Then, in the second phase, the training loss suddenly begins to decrease fast. In particular, as Figure 1 (b) shows, the length of the first phase increases along with the network complexity. In some extreme cases when deep neural networks (DNNs) are very deep, the loss minimization gets stuck, which can be considered as a strong first phase with an infinite length, namely, a learning-sticking problem. In fact, the learning-sticking problem is quite common in practice. Jepkoech et al. (2021) and Stevens et al. (2020) empirically observed the learning-sticking problem without any theoretical analysis. People usually owed the learning-sticking problem to the over-parameterized settings of DNNs or the optimization ability of DNNs. However, we discover and attempt to further theoretically explain a new, quite common, yet counterintuitive phenomenon in the first phase (the learning-sticking phase). That is, as Figure 1 (b) shows, features of different categories become increasingly similar to each other. In some cases, the feature diversity keeps decreasing even until all samples of different categories share almost the same feature in the first phase. We can consider this as the temporary feature collapse (TFC). This TFC happens in various DNNs, including multi-layer perceptrons (MLPs), convolutional neural networks, and recurrent neural networks (see both Figure 2 and Appendix B ). DNNs trained with different loss functions and different learning rates may all exhibit TFC phenomena. The TFC phenomenon usually happens in the early epochs of the training process, especially when the DNN is difficult to optimize. According to our analysis, when the DNN is very deep, when the task is difficult, when the variance of initial weights is small, and when the DNN is trained without momentum or batch normalization layers, the DNN is more likely to exhibit the TFC phenomenon. Based on our theoretical analysis, we discover a set of conditions that strengthen the TFC phenomenon. Then, we can easily control such conditions by applying typical operations i.e., batch normalization, momentum, L 2 regularization, and network initialization. Specifically, we investigate the learning dynamics of the MLP. Moreover, we theoretically explain that these conditions make the training of DNNs more likely to perform like a "self-enhanced system" towards the TFC phenomenon in early iterations. In comparison, (Glorot & Bengio, 2010; Saxe et al., 2013) investigated the influence of initialization methods on the learning-sticking problem. To this end, we find that the effectiveness of initialization methods is probably owing to the large variance of initial weights, which avoids the TFC phenomenon during the learning-sticking phase. Samples of different categories share almost the same features at the end of the first phase. We can consider this as a TFC phenomenon. We visualize the learning dynamics of an intermediate-layer feature in a 9-layer MLP. We select 10 salient dimensions to illustrate the feature similarity. Fortunately, we discover that, when we use four typical operations to alleviate the TFC phenomenon, the learning-sticking problem can also be solved. Although previous studies have provided insightful analysis for these well-known operations, e.g., batch normalization and network initialization, we are the first to establish the relationship between the TFC phenomenon and these typical operations. This provides theoretical guidance for the design of DNNs. More crucially, the TFC phenomenon with the MLP is counter-intuitive, and has been neglected for a long time. The investigation of the learning dynamics of the TFC phenomenon would be useful for explaining complex optimization behaviors of DNNs and is of considerable value. Contributions of this study can be summarized as follows. (1) We discover the common TFC phenomenon in early learning of the MLP, which has been ignored for a long time. (2) We explain this phenomenon from the perspective of learning dynamics. (3) We explain why four types of operations can alleviate the TFC phenomenon.

2. DISCOVERING THE TFC PHENOMENON

It has been widely observed that the loss decrease of DNNs is likely to have two phases (Saxe et al., 2013; Simsekli et al., 2019; Stevens et al., 2020) . As Figure 1 (b) shows, the training loss does not decrease significantly in the first phase, and the training loss suddenly begins to decrease in the second phase. In this paper, we discover a new and counter-intuitive phenomenon in the first phase that both the diversity of intermediate-layer features over different samples and the diversity of feature gradients keep decreasing, until samples of different categories share almost the same feature in the first phase. We consider this as a TFC phenomenon. We consider an MLP f with L concatenated linear layers, each being followed by a ReLU layer. Only the last linear layer is followed by a softmax operation. Let W (l) t ∈ R h×d denote the weight matrix of the l-th linear layer with h neurons (1 ⩽ l ⩽ L), and W (l) t has been learned for t iterations. Given an input sample x, the layer-wise forward propagation in the l-th layer is represented as F (l) t = ReLU(W (l) t F (l-1) t ) = D (l) t W (l) t F (l-1) t , where F (l) t ∈ R h denotes the output feature of the l-th layer after the t-th iteration. D (l) t denotes a diagonal matrix, which represents gating states in the ReLU layer, and D (l) t,(i,i) ∈ {0, 1}. Thus, the TFC phenomenon is shown as follows. Given two input samples x1 and x2, the cosine similarity of features cos(F (l) t |x 1 , F (l) t |x 2 ), and the cosine similarity of gradients cos( Ḟ (l) t |x 1 , Ḟ (l) t |x 2 ) keep increasing, which demonstrates the phenomenon. Here, Ḟ (l) t denotes the gradient of the loss w.r.t. the feature F (l) t . Besides, the increasing trend of feature similarity only exists in the first phase. The TFC phenomenon is widely shared by different DNNs learned for different tasks. In early epochs (or iterations) of the training process, we observed such TFC phenomena on MLPs, VGG-11



Figure 1: (a) The first phase (learning iterations before the dotted line) gets an increasing length and finally becomes the learning-sticking problem (purple curve), when the DNN has more layers. (b)Samples of different categories share almost the same features at the end of the first phase. We can consider this as a TFC phenomenon. We visualize the learning dynamics of an intermediate-layer feature in a 9-layer MLP. We select 10 salient dimensions to illustrate the feature similarity.

