TEMPORARY FEATURE COLLAPSE PHENOMENON IN EARLY LEARNING OF MLPS

Abstract

In this paper, we focus on a typical two-phase phenomenon in the learning of multi-layer perceptrons (MLPs). We discover and explain the reason for the feature collapse phenomenon in the first phase, i.e., the diversity of features over different samples keeps decreasing in the first phase, until samples of different categories share almost the same feature, which hurts the optimization of MLPs. We find that such phenomena usually occur when MLPs are difficult to be trained. We explain such a phenomenon in terms of the learning dynamics of MLPs. Furthermore, we theoretically analyze the reason why four typical operations can alleviate the feature collapse. The code has been attached with the submission.

1. INTRODUCTION

It has been widely observed that in initialized neural networks, especially when the network is deep, the loss decrease is likely to have two phases during early epochs of learning, e.g., phenomena observed in (Saxe et al., 2013; Simsekli et al., 2019; Stevens et al., 2020) . As Figure 1 (a) shows, the first phase is usually relatively short, in which the training loss does not decrease or decreases very slowly. Then, in the second phase, the training loss suddenly begins to decrease fast. In particular, as Figure 1 (b) shows, the length of the first phase increases along with the network complexity. In some extreme cases when deep neural networks (DNNs) are very deep, the loss minimization gets stuck, which can be considered as a strong first phase with an infinite length, namely, a learning-sticking problem. In fact, the learning-sticking problem is quite common in practice. Jepkoech et al. (2021) and Stevens et al. (2020) empirically observed the learning-sticking problem without any theoretical analysis. People usually owed the learning-sticking problem to the over-parameterized settings of DNNs or the optimization ability of DNNs. However, we discover and attempt to further theoretically explain a new, quite common, yet counterintuitive phenomenon in the first phase (the learning-sticking phase). That is, as Figure 1 (b) shows, features of different categories become increasingly similar to each other. In some cases, the feature diversity keeps decreasing even until all samples of different categories share almost the same feature in the first phase. We can consider this as the temporary feature collapse (TFC). This TFC happens in various DNNs, including multi-layer perceptrons (MLPs), convolutional neural networks, and recurrent neural networks (see both Figure 2 and Appendix B ). DNNs trained with different loss functions and different learning rates may all exhibit TFC phenomena. The TFC phenomenon usually happens in the early epochs of the training process, especially when the DNN is difficult to optimize. According to our analysis, when the DNN is very deep, when the task is difficult, when the variance of initial weights is small, and when the DNN is trained without momentum or batch normalization layers, the DNN is more likely to exhibit the TFC phenomenon. Based on our theoretical analysis, we discover a set of conditions that strengthen the TFC phenomenon. Then, we can easily control such conditions by applying typical operations i.e., batch normalization, momentum, L 2 regularization, and network initialization. Specifically, we investigate the learning dynamics of the MLP. Moreover, we theoretically explain that these conditions make the training of DNNs more likely to perform like a "self-enhanced system" towards the TFC phenomenon in early iterations. In comparison, (Glorot & Bengio, 2010; Saxe et al., 2013) investigated the influence of initialization methods on the learning-sticking problem. To this end, we find that the effectiveness of initialization methods is probably owing to the large variance of initial weights, which avoids the TFC phenomenon during the learning-sticking phase.

