EMPIRICAL STUDIES ON THE CONVERGENCE OF FEA-TURE SPACES IN DEEP LEARNING Anonymous

Abstract

While deep learning is effective to learn features/representations from data, the distributions of samples in feature spaces learned by various architectures for different training tasks, e.g., latent layers of Autoencoders (AEs) and feature vectors in Convolutional Neural Network (CNN) classifiers, have not been well-studied or compared. We hypothesize that the feature spaces of networks trained by various architectures (AEs or CNNs) and tasks (supervised, unsupervised, or selfsupervised learning) share some common subspaces, no matter what types of architectures or whether the labels have been used in feature learning. To test our hypothesis, through Singular Value Decomposition (SVD) of feature vectors, we demonstrate that one could linearly project the feature vectors of the same group of samples to a similar distribution, where the distribution is represented as the top left singular vector (i.e., principal subspace of feature vectors), namely P-vector. We further assess the convergence of feature space learning using angles between P-vectors obtained from the well-trained model and its checkpoint per epoch during the learning procedure, where a quasi-monotonic converging trend from nearly orthogonal to smaller angles (e.g., 10 • ) has been observed. Finally, we carry out case studies to connect P-vectors to the data distribution, and generalization performance. Extensive experiments with practically-used Multi-Layer Perceptron (MLP), AE and CNN architectures for classification, image reconstruction, and self-supervised learning tasks on MNIST, CIFAR-10 and CIFAR-100 datasets have been done to support our claims with solid evidences.

1. INTRODUCTION

Blessed by the capacities of feature learning, deep neural networks (DNNs) (LeCun et al., 2015) have been widely used to perform learning tasks, ranging from classification, to generation (Goodfellow et al., 2014; Radford et al., 2015) , in various settings (e.g., supervised, unsupervised, and selfsupervised learning). To better analyze the features learned by deep models, numerous works have studied on interpreting the features spaces of the well-trained models (Simonyan et al., 2013; White, 2016; Zhu et al., 2016; Bau et al., 2017; 2019; Jahanian et al., 2020; Zhang & Wu, 2020) . Invariance beyond the use of architectures and labels. While existing studies primarily focus on the interpolation of a given model to discover mappings from the feature space to outputs of the model (e.g., classification (Bau et al., 2017) and generation (Jahanian et al., 2020) ), the work is so few that compares the feature spaces learned by deep models of varying architectures (e.g., MLP/CNN classifiers versus Autoencoders) for different learning paradigms (Chen et al., 2020; Khosla et al., 2020; Spinner et al., 2018) . More specifically, we are particularly interested in whether there exists certain "statistical invariance" in the feature space, no matter what type of architectures or whether label information (e.g., supervised vs. unsupervised vs. self-supervised (Chen et al., 2020) learning) are used in feature learning with the same training dataset. Hypotheses. It is not difficult to imagine that the feature spaces of well-trained DNN classifiers in supervised learning setting might share some linear subspace (Vaswani et al., 2018) . When models are well fitted to the same training set, the feature vectors of training samples should be projected to the ground-truth labels after a Fully-Connected Layer (i.e., a linear transform), while such linear subspace are supposed to distribute samples in a discriminative manner. We doubt that such c)-(h) present angles of P-vectors between the well-trained model and its checkpoint per training epoch of three learning paradigms, where the converging trends of P-vector angles from nearly-orthogonal to smaller ones have been observed in all models, no matter whether the feature extractors of these models are trained with / without labels. Note that we carried out experiments with different random seeds in 5 independent trials to obtain the averaged results above. More discussion are provided in Section 4. subspace might be not only shared by supervised learners but also with AEs which are trained to reconstruct input data without any label information in an unsupervised manner, or even shared with self-supervised DNN classifiers (e.g., SimCLR (Chen et al., 2020)) which train (1) CNN feature extractor (using contrastive loss without labels) and (2) linear classifiers (using discriminative loss based on labels) separately in an ad-hoc manner. More specifically, we hypothesize that (H.I:) there exists certain common feature subspaces shared by well-trained deep models using the same training dataset, even though the architectures (MLPs, CNNs, and AEs) and the learning paradigms (supervised, unsupervised, and self-supervised) are significantly different. Further, as the training procedure usually initializes the DNN models from random weights and learns features from the training set step-by-step, we hypothesize that (H.II:) the training procedure gradually shapes the feature subspace over training iterations and asymptotically converge to the common subspace in certain statistical measure. Finally, we hypothesize that (H.III) the convergence to the common feature subspace would connect to the data distribution and performance of models, as such behavior indicates how well the features are learned from data. This hypothesis is motivated by the observation that when the DNN model tends to be linear the DNN feature subspace should be close to the data subspace, while the well-trained DNN models should be locally linear (Zhang & Wu, 2020) or piecewise linear (Arora et al., 2018) . Contributions. To test above three hypotheses, this work makes contributions in proposing new measures to the DNN features, namely P-vectors, and conducting extensive experiments for empirical studies. We train deep models using various DNN architectures, multiple learning paradigms, and datasets, with the checkpoint restored per epoch. Then, we extract the feature vectors for either training or testing sample sets, from the model (Please see Section 3 for details) and discover some interesting relationships or associations as discussed below. I. P-vector and Convergence: Given the matrix of feature vectors (#samplesfoot_0 ×#features) for either training or testing samples, we perform the singular value decomposition (SVD) to obtain left and right singular vectors, characterizing the subspaces that samples distribute and the projection of features to subspaces respectively. We observe that deep models well-trained using the same



We follow the convenience that denotes # as the term "the number of" for short.



Figure 1: The Common Feature Subspace and Converging Trends of P-vector Angles with CIFAR-10. Figure 1. (a)-(b) present cosine (in the range of [0,1]) of angles between the P-vectors of well-trained models of various architectures under different learning paradigms, using training and testing datasets respectively. A well-trained model here is the one trained under the suggested settings after 200 epochs for supervised/self-supervised CNN classifiers and 100 epochs for unsupervised AEs. Figure 1. (c)-(h) present angles of P-vectors between the well-trained model and its checkpoint per training epoch of three learning paradigms, where the converging trends of P-vector angles from nearly-orthogonal to smaller ones have been observed in all models, no matter whether the feature extractors of these models are trained with / without labels. Note that we carried out experiments with different random seeds in 5 independent trials to obtain the averaged results above. More discussion are provided in Section 4.

