DEEP CLUSTERING AND REPRESENTATION LEARNING THAT PRESERVES GEOMETRIC STRUCTURES

Abstract

In this paper, we propose a novel framework for Deep Clustering and multimanifold Representation Learning (DCRL) that preserves the geometric structure of data. In the proposed DCRL framework, manifold clustering is done in the latent space guided by a clustering loss. To overcome the problem that clusteringoriented losses may deteriorate the geometric structure of embeddings in the latent space, an isometric loss is proposed for preserving intra-manifold structure locally and a ranking loss for inter-manifold structure globally. Experimental results on various datasets show that the DCRL framework leads to performances comparable to current state-of-the-art deep clustering algorithms, yet exhibits superior performance for manifold representation. Our results also demonstrate the importance and effectiveness of the proposed losses in preserving geometric structure in terms of visualization and performance metrics. The code is provided in the Supplementary Material. 1 This claim was first made by IDEC (Guo et al., 2017), but they did not provide experiments to support it. In this paper, however, we show that the geometry of the latent space is indeed disrupted by visualization of learned embeddings (Fig. 4 ), visualization of clustering process (Fig. A3 ), and statistical analysis (Fig. A5 ).

1. INTRODUCTION

Clustering, a fundamental tool for data analysis and visualization, has been an essential research topic in data science and machine learning. Conventional clustering algorithms such as K-Means (MacQueen, 1965) , Gaussian Mixture Models (GMM) (Bishop, 2006) , and spectral clustering (Shi & Malik, 2000) perform clustering based on distance or similarity. However, handcrafted distance or similarity measures are rarely reliable for large-scale high-dimensional data, making it increasingly challenging to achieve effective clustering. An intuitive solution is to transform the data from the high-dimensional input space to the low-dimensional latent space and then to cluster the data in the latent space. This can be achieved by applying dimensionality reduction techniques such as PCA (Wold et al., 1987) , t-SNE (Maaten & Hinton, 2008) , and UMAP (McInnes et al., 2018) . However, since these methods are not specifically designed for clustering tasks, some of their properties may be contrary to our expectations, e.g., two data points from different manifolds that are close in the input space will be closer in the latent space derived by UMAP. Therefore, the first question here is how to learn the manifold representation that favors clustering? The two main points for the multi-manifold representation learning are Point (1) preserving the local geometric structure within each manifold and Point (2) ensuring the discriminability between different manifolds. Most previous work seems to start with the assumption that the label of each data point is known, and then design the algorithm in a supervised manner, which greatly simplifies the problem of multi-manifold learning. However, it is challenging to decouple complex crossover relations and ensure discriminability between different manifolds, especially in unsupervised settings. One natural strategy is to achieve Point (2) through performing clustering in the input space to get pseudo-labels and then performing representation learning for each manifold. However, clustering is in fact contradictory to Point (1) (which will be analyzed in detail in Sec. 3.3), making it important to alleviate this contradiction so that clustering helps both point (1) and point (2). Thus, the second question here is how to cluster data that favors learning manifold representation? To answer these two questions, some pioneering work has proposed to integrate deep clustering and representation learning into a unified framework by defining a clustering-oriented loss. Though promising performance has been demonstrated on various datasets, we observe that a vital factor has been ignored by these work that the defined clustering-oriented loss may deteriorate the geometric structure of the latent space 1 , which in turn hurts the performance of visualization, clustering generalization, and manifold representation. In this paper, we propose to jointly perform deep clustering and multi-manifold representation learning with geometric structure preservation. Inspired by Xie et al. (2016) , the clustering centers are defined as a set of learnable parameters, and we use a clustering loss to simultaneously guide the separation of data points from different manifolds and the learning of the clustering centers. To prevent clustering loss from deteriorating the latent space, an isometric loss and a ranking loss are proposed to preserve the intra-manifold structure locally and inter-manifold structure globally. Finally, we achieve the following three goals related to clustering, geometric structure, and manifold representation: (1) Clustering helps to ensure inter-manifold discriminability; (2) Local structure preservation can be achieved with the presence of clustering; (3) Geometric structure preservation helps clustering. The contributions of this work are summarized as below: • Proposing to integrate deep clustering and multi-manifold representation learning into a unified framework with local and global structure preservation. • Unlike conventional multi-manifold learning algorithms that deal with all point pair relationships between different manifolds simultaneously, we set the clustering centers as a set of learnable parameters and achieve global structure preservation in a faster, more efficient, and easier to optimize manner by applying ranking loss to the clustering centers. • Analyzing the contradiction between two optimization goals of clustering and local structure preservation and proposing an elegant training strategy to alleviate it. • The proposed DCRL algorithm outperforms competing algorithms in terms of clustering effect, generalizability to out-of-sample, and performance in manifold representation.

2. RELATED WORK

Clustering analysis. As a fundamental tool in machine learning, it has been widely applied in various domains. One branch of classical clustering is K-Means (MacQueen, 1965) and Gaussian Mixture Models (GMM) (Bishop, 2006) , which are fast, easy to understand, and can be applied to a large number of problems. However, limited by Euclidean measure, their performance on high-dimensional data is often unsatisfactory. Spectral clustering and its variants (such as SC-Ncut (Bishop, 2006) ) extend clustering to high-dimensional data by allowing more flexible distance measures. However, limited by the computational efficiency of the full Laplace matrix, spectral clustering is challenging to extend to large-scale datasets. (Xie et al., 2016) , which learns a mapping from the input space to a low dimensional latent space through iteratively optimizing clustering-oriented objective. As a modified version of DEC, while IDEC (Guo et al., 2017) claims to preserve the local structure of the data, in reality, their contribution is nothing more than adding a reconstruction loss. JULE (Yang et al., 2016b) unifies unsupervised representation learning with clustering based on the CNN architecture to improve clustering accuracy, which can be considered as a neural extension of hierarchical clustering. DSC devises a dual autoencoder to embed data into latent space, and then deep spectral clustering (Shaham et al., 2018) is applied to obtain label assignments (Yang et al., 2019) . ASPC-DA (Guo et al., 2019) combines data augmentation with self-paced learning to encourage the learned features to be cluster-oriented. While sometimes they both evaluate performance in terms of accuracy, we would like to highlight that deep clustering and visual self-supervised learning (SSL) are two different research fields. SSL typically uses more powerful CNN architecture (applicable only to image data), and uses sophisticated techniques such as contrastive learning (He et al., 2020) , data augmentation (Chen et al., 2020) , and clustering (Zhan et al., 2020; Ji et al., 2019; Van Gansbeke et al., 2020) for better performance on large-scale datasets such as ImageNet. Deep clustering, however, uses general MLP architecture (applicable to both image and vector data), so it is difficult to scale directly to large datasets without considering those sophisticated techniques. Manifold Representation Learning. Isomap, as a representative algorithm of single-manifold learning, aims to capture global nonlinear features and seek an optimal subspace that best preserves the geodesic distance between data points (Tenenbaum et al., 2000) . In contrast, some algorithms, such as the Locally Linear Embedding (LLE) (Roweis & Saul, 2000) , are more concerned with the preservation of local neighborhood information. Combining DNN with manifold learning, the recently proposed Markov-Lipschitz Deep Learning (MLDL) algorithm achieves the preservation of local and global geometries by imposing Locally Isometric Smoothness (LIS) prior constraints (Li et al., 2020) . Furthermore, multi-manifold learning is proposed to obtain intrinsic properties of different manifolds. Yang et al. (2016a) proposed a supervised discriminant isomap where data points are partitioned into different manifolds according to label information. Similarly, Zhang et al. (2018) proposed a semi-supervised learning framework that applies the labeled and unlabeled training samples to perform the joint learning of local neighborhood-preserving features. In most previous work on multi-manifold learning, the problem is considered from the perspective that the label is known or partially known, which significantly simplifies the problem. However, it is challenging to decouple multiple overlapping manifolds in unsupervised settings, and that is what this paper aims to explore.

3. PROPOSED METHOD

Consider a dataset X with N samples, and each sample x i ∈ R d is sampled from C different manifolds {M c } C c=1 . Assume that each category in the dataset lies in a compact low-dimensional manifold, and the number of manifolds C is prior knowledge. Define two nonlinear mapping z i = f (x i , θ f ) and y i = g(z i , θ g ), where z i ∈ R m is the embedding of x i in the latent space, y i is the reconstruction of x i . The j-th cluster center is denoted as µ j ∈ R m , where {µ j } C j=1 is defined as a set of learnable parameters. We aim to find optimal parameters θ f and µ so that the embeddings {z i } N i=1 can achieve clustering with local and global structure preservation. To this end, a denoising autoencoder (Vincent et al., 2010) shown in Fig 1 is first pre-trained in an unsupervised manner to learn an initial latent space. Denoising autoencoder aims to optimize the self-reconstruction loss L AE = M SE(x, y), where the x is a copy of x with Gaussian noise added, that is, x = x + N (0, σ 2 ). Then the autoencoder is finetuned by optimizing the following clustering-oriented loss {L cluster (z, µ)} and structure-oriented losses {L rank (x, µ), L LIS (x, z), L align (z, µ)}. Since the clustering should be performed on features of clean data, instead of noised data x that is used in denoising autoencoder, the clean data x is used for fine-tuning.

，

Figure 1 : The framework of the proposed DCRL method. The encoder, decoder, latent space, and cluster centers are marked as blue, red, green, and purple, respectively.

3.1. CLUSTERING-ORIENTED LOSS

First, the cluster centers {µ j } C j=1 in the latent space Z are initialized (the initialization method will be introduced in Sec 4.1). Then the similarity between the embedded point z i and cluster centers {µ j } C j=1 is measured by Student's t-distribution: q ij = 1 + z i -µ j 2 -1 j 1 + z i -µ j 2 -1 The auxiliary target distribution is designed to help manipulate the latent space, defined as: p ij = q 2 ij /f j j q 2 ij /f j , where f j = i q ij (2) where f j is the normalized cluster frequency, used to balance the size of different clusters. Then the encoder is optimized by the following objective: L cluster = KL(P Q) = i j p ij log p ij q ij (3) The gradient of L cluster with respect to each learnable cluster center µ j can be computed as: ∂L cluster ∂µ j = - i 1 + z i -µ j 2 -1 • (p ij -q ij ) (z i -µ j ) L cluster facilitates the aggregation of data points within the same manifold, while data points from different manifolds are kept away from each other. However, we find that the clustering-oriented loss may deteriorate the geometric structure of the latent space, which hurts the clustering accuracy and leads to meaningless representation. To prevent the deterioration of clustering loss, we introduce isometry loss L LIS and ranking loss L rank to preserve the local and global structure, respectively.

3.2. STRUCTURE-ORIENTED LOSS

Intra-manifold Isometry Loss. The intra-manifold local structure is preserved by optimizing the following objective: L LIS = N i=1 j∈N Z i |d X (x i , x j ) -d Z (z i , z j )| • π(l(x i ) = l(x j )) where N Z i represents the neighborhood of data point z i in the feature space Z, and the kNN is applied to determine the neighborhood. π(•) ∈ {0, 1} is an indicator function, and l(x i ) is a manifold determination function that returns the manifold s i where sample x i is located, that is s i = l(x i ) = arg max j p ij . Then we can derive C manifolds {M c } C c=1 : M c = {x i ; s i = c, i = 1, 2, ..., N }. In a nutshell, the loss L LIS constrains the isometry within each manifold. Inter-manifold Ranking Loss. The inter-manifold global structure is preserved by optimizing the following objective: L rank = C i=1 C j=1 d Z (µ i , µ j ) -κ • d X v X i , v X j (6) where {v X j } C j=1 is defined as the centers of different manifolds in the original input space X with v X j = 1 |Mj | i∈Mj x i (j = 1, 2, ..., C ). The parameter κ determines the extent to which different manifolds move away from each other. The larger κ is, the further away the different manifolds are from each other. The derivation for the gradient of L rank with respect to each learnable cluster center µ j is placed in Appendix A.1. Note that L rank is optimized in an iterative manner, rather than by initializing {µ j } C j=1 once and then separating different clusters based only on the initialization results. Additionally, contrary to us, the conventional methods for dealing with inter-manifold separation typically impose push-away constraints on all data points from different manifolds (Zhang et al., 2018; Yang et al., 2016a) , defined as: L sep = - N i=1 N j=1 d Z (z i , z j ) • π(l(x i ) = l(x j )) The main differences between L rank and L sep are as follows: (1) L sep imposes constraints on embedding points {z i } N i=1 , which in turn indirectly affects the network parameters θ f . In contrast, L rank imposes rank-preservation constrains directly on learnable parameters {µ j } C j=1 in the form of regularization item to control the separation of the clustering centers. (2) L rank is easier to optimize, faster to process, and more accurate. L sep is imposed on all data points from different manifolds, which involves N ×N point-to-point relationships. This means that each point may be subject to the push-away force from other manifolds, but at the same time, each point has to meet the isometry constraint with its neighboring points. Under these two constraints, optimization is difficult and it is easy to fall into a local optimal solution and output inaccurate results. In contrast, L rank is imposed directly on the clustering centers, involving only C×C cluster-to-cluster relationships, which avoids the above problem and makes it easier to optimize. (3) The parameter κ introduced in L rank allows us to control the extent of separation between manifolds for specific downstream tasks. Alignment Loss. Note that the global ranking loss L rank is imposed directly on the learnable parameter {µ j } C j=1 , so optimizing L rank only updates {µ j } C j=1 rather the encoder's parameter θ f . However, the optimization of {µ j } C j=1 not only relies on L rank , but is also constrained by L cluster , which ensures that data points remain roughly distributed around cluster centers and do not deviate significantly from them during the optimization process. Alignment loss L align , as an auxiliary term, aims to help align learnable cluster centers {µ j } C j=1 with real cluster centers {v Z j } C j=1 and make this binding stronger: L align = C j=1 ||µ j -v Z j || where {v Z j } C j=1 are defined as v Z j = 1 |Mj | i∈Mj z i (j = 1, 2, ..., C). The derivation for the gradient of L align with respect to each learnable cluster center µ j is placed in Appendix A.1.

3.3.1. CONTRADICTION

The contradiction between clustering and local structure preservation is analyzed from the forces analysis perspective. As shown in Fig 2, we assume that there exists a data point (red point) and its three nearest neighbors (blue points) around a cluster center (gray point). When clustering and local structure preserving are optimized simultaneously, it is very easy to fall into a local optimum, where the data point is in steady-state, and the resultant force from its three nearest neighbors is equal in magnitude and opposite to the gravitational forces of the cluster. Therefore, the following training strategy is applied to prevent such local optimal solutions. 

3.3.2. ALTERNATING TRAINING AND WEIGHT CONTINUATION

Alternating Training. To solve the above problem and integrate the goals of clustering and structure preservation into a unified framework, we take an alternating training strategy. Within each epoch, we first jointly optimize L cluster and L rank in a mini-batch, with joint loss defined as L 1 = L AE + αL cluster + L rank (9) where α is the weighting factor that balances the effects of clustering and global rank-preservation. Then at each epoch, we optimize isometry loss L LIS and L align on the whole dataset, defined as L 2 = βL LIS + L align (10) Weight continuation. At different stages of training, we have different expectations for the clustering and structure-preserving. At the beginning of training, to successfully decouple the overlapping manifolds, we hope that the L cluster will dominate and L LIS will be auxiliary. When the margin between different manifolds is sufficiently pronounced, the weight α for L cluster can be gradually reduced, while the weight β for L LIS can be gradually increased, focusing on the preservation of the local isometry. The whole algorithm is summarized in Algorithm 1 in Appendix A.2. Three-stage explanation. Parameters settings. Currently we use MLP architecture for this version and will extend it to ConvAE in the future. The encoder structure is d-500-500-500-2000-10 where d is the dimension of the input data, and the decoder is its mirror. After pretraining, in order to initialize the learnable clustering centers, the t-SNE is applied to transform the latent space Z to 2 dimensions further, and then the K-Means algorithm is run to obtain the label assignments for each data point 2 . The centers of each category in the latent space Z are set as initial cluster centers {µ j } C j=1 . The batch size is set to 256, the epoch is set to 300, the parameter k for nearest neighbor is set to 5, and the parameter κ is set to 3 for all datasets. Sensitivity analysis for parameters k and κ is available in Appendix A.12. Besides, Adam optimizer (Kingma & Ba, 2014) with learning rate λ=0.001 is used. As described in Sec 3.3.2, the weight continuation is applied to train the model. The weight parameter α for L cluster decreases linearly from 0.1 to 0 within epoch 0-150. In contrast, the weight parameter β for L LIS increases linearly from 0 to 1.0 within epoch 0-150. In this paper, each set of experiments is run 5 times with different 5 random seeds, and the results are averaged into the final performance metrics. The implementation uses the PyTorch library running on NVIDIA v100 GPU. Evaluation Metrics. Two standard evaluation metrics: Accuracy (ACC) and Normalized Mutual Information (NMI) (Xu et al., 2003) are used to evaluate clustering performance. Besides, six evaluation metrics are adopted in this paper to evaluate the performance of multi-manifold representation learning, including Relative Rank Error (RRE), Trustworthiness (Trust), Continuity (Cont), Root Mean Reconstruction Error (RMRE), Locally Geometric Distortion (LGD) and Cluster Rank Accuracy (CRA). Limited by space, their precise definitions are available in Appendix A.4.

4.2.1. QUANTITATIVE COMPARISON

The metrics ACC/NMI of different methods on various datasets are reported in Tab 1. For those comparison methods whose results are not reported or the experimental settings are not clear on some datasets, we run the released code using the hyperparameters provided in their paper with the same random seeds and initialization, then report their average performance, and label them with (*). While ASPC-DA achieves the best performance on three datasets (MNIST-test, MNIST-full, and USPS), its performance gains do not come directly from clustering, but from sophisticated modules such as data augmentation and self-paced learning. Once these modules are removed, there is a very large degradation in its performance. For example, with data augmentation removed, ASPC-DA achieves less competitive performance, e.g., an accuracy of 0.931 (vs 0.988) on MNIST-full, 0.813 (vs 0.973) on MNIST-test and 0.768 (vs 0.982) on USPS. Though ASPC-DA is based on the MLP architecture, its image-based Data Augmentation (DA) cannot be applied directly to vector data, which explains why ASPC has no performance advantage on the vector-based REUTERS-10K and HAR datasets (even compared to DEC and IDEC). In a fairer comparison (without considering ASPC-DA), we find that DCRL outperforms K-Means and SC-Ncut by a significant margin and surpasses the other seven competing DNN-based algorithms on all datasets except MNIST-test. Even with the MNIST-test dataset, we still rank second, outperforming the third by 1.1%. In particular, we obtained the best performance on the Fashion-MNIST and HAR (vector) dataset , and more notably, our clustering accuracy exceeds the current SOTA method by 5.1% and 4.9%, respectively. Tab 2 demonstrates that a learned DCRL can generalize well to unseen data with high clustering accuracy. Taking MNIST-full as an example, DCRL was trained using 50,000 training samples and then tested on the remaining 20,000 testing samples using the learned model. In terms of the metrics ACC and MNI, our method is optimal for both training and testing samples. More importantly, there is hardly any degradation in the performance of our method on the testing samples compared to the training samples, while all other methods showed a significant drop in performance, e.g., DEC from 84.1% to 74.8%. This demonstrates the importance of geometric structure preservation for good generalizability. The testing visualization available in Appendix A.5 shows that DCRL still maintains clear inter-cluster boundaries even on the test samples, which demonstrates the great generalizability of our method. Additionally, the embedding of latent space during the training process is visualized in Appendix A.6, which is highly consistent with the three-stage explanation mentioned in Sec 3.3.2, showing that clustering-oriented does indeed do deteriorate the local geometric structure of the latent space, and designed L LIS helps to recover it. In addition, in the above experiments, the cluster number C is assumed to be a known prior (which is consistent with the assumptions of almost all deep clustering algorithms). Therefore, we provide an additional experiment to explore what happens when C is larger than the number of true clusters. It is found that there exists splitting of the clusters, but the different categories still maintain clear boundaries and are not mixed together, somewhat similar to hierarchical clustering. See Appendix A.7 for detailed experimental settings and analysis. 

4.3. EVALUATION OF MULTI-MANIFOLD REPRESENTATION LEARNING

Although numerous previous work has claimed that they brought clustering and representation learning into a unified framework, they all, unfortunately, lack an analysis of the effectiveness of the learned representations. In this paper, we compare DCRL with the other five methods in six evaluation metrics on six datasets. (Limited by space, only MNIST-full results are provided in the Tab 3 and the complete results are in Appendix A.8). The results show that DCRL outperforms all other methods, especially in the CRA metric, which is not only the best on all datasets but also reaches 1.0. This means that the "rank" between different manifolds in the latent space is completely preserved and undamaged, which proves the effectiveness of our global ranking loss L rank . Moreover, statistical analysis is performed in this paper to show the extent to which local and global structure is preserved in the latent space for each algorithm. Limited by space, they are placed in Appendix A.9. Furthermore, we also evaluated whether the learned representations are meaningful through downstream tasks, and this experiment is available in Appendix A.10. 

5. CONCLUSION

The proposed DCRL framework imposes clustering-oriented and structure-oriented constraints to optimize the latent space for simultaneously performing clustering and multi-manifold representation learning with local and global structure preservation. Extensive experiments on image and vector datasets demonstrate that DCRL is not only comparable to the state-of-the-art deep clustering algorithms but also able to learn effective and robust manifold representation, which is beyond the capability of those clustering methods that only care about clustering accuracy. Future work will focus on the adaptive determination of manifolds (clusters) numbers and extend our work to CNN architecture for large-scale datasets.

APPENDIX

A.1 GRADIENT DERIVATION In the paper, we have emphasized time and again that {µ j } C j=1 is a set of learnalbe parameters, which means that we can optimize it while optimizing the network parameter θ f . In Eq. ( 4) of the paper, we have presented the gradient of L cluster with respect to µ j . In addition to L cluster , both L rank and L align are involving µ j . Hence, the detailed derivations for the gradient of L rank and L align with respect to µ j are also provided. The gradient of L rank with respect to each learnalbe cluster center µ j can be computed as: ∂L rank ∂µ j = ∂ C i =1 C j =1 d Z (µ i , µ j ) -κ * d X v X i , v X j ∂µ j = C i =1 C j =1 ∂ d Z (µ i , µ j ) -κ * d X v X i , v X j ∂µ j The Euclidean metric is used for both the input space and the hidden layer space, i.e., d Z (µ i , µ j ) = µ i -µ j . In addition, the symbols are somewhat abused for clear derivation, representing κ * d X v X i , v X j with K. Accordingly, Eq. ( 11) can be further derived as follows: ∂L rank ∂µ j = C i =1 C j =1 ∂ d Z (µ i , µ j ) -κ * d X v X i , v X j ∂µ j = C i =1 C j =1 ∂ µ i -µ j -K ∂µ j = C i =1 ∂ µ i -µ j -K ∂µ j + C j =1 ∂ µ j -µ j -K ∂µ j = C i =1 ∂ ( µ i -µ j -K) ∂µ j • µ i -µ j -K µ i -µ j -K + C j =1 ∂ ( µ j -µ j -K) ∂µ j • µ j -µ j -K µ j -µ j -K = C i =1 ∂ µ i -µ j ∂µ j • µ i -µ j -K µ i -µ j -K + C j =1 ∂ µ j -µ j ∂µ j • µ j -µ j -K µ j -µ j -K = C i =1 µ j -µ i µ j -µ i • µ j -µ i -K µ j -µ i -K + C j =1 µ j -µ j µ j -µ j • µ j -µ j -K µ j -µ j -K = 2 C i =1 µ j -µ i µ j -µ i • µ j -µ i -K µ j -µ i -K = 2 C i =1 µ j -µ i µ j -µ i • µ j -µ i -κ * d X v X i , v X j µ j -µ i -κ * d X v X i , v X j = 2 C i =1 µ j -µ i d Z (µ j , µ i ) • d Z (µ j , µ i ) -κ * d X v X i , v X j d Z (µ j , µ i ) -κ * d X v X i , v X j The gradient of L align with respect to each learnalbe cluster center µ j can be computed as:  ∂L align ∂µ j = ∂ C j =1 ||µ j -v Z j || ∂µ j = C j =1 ∂||µ j -v Z j || ∂µ j = ∂||µ j -v Z j || ∂µ j = ∂(µ j -v Z j ) ∂µ j • µ j -v Z j µ j -v Z j = µ j -v Z j µ j -v Z j A.2

A.3 DATASETS

To show that our method works well with various kinds of datasets, we choose the following six image and vector datasets. Some example images are shown in Fig A1 , and the brief descriptions of the datasets are given in Tab A1. • MNIST-full (LeCun et al., 1998) : The MNIST-full dataset consists of 70,000 handwritten digits of 28 × 28 pixels. Each gray image is reshaped to a 784-dimensional vector. • MNIST-test (LeCun et al., 1998) : The MNIST-test is the testing part of the MNIST dataset, which contains a total of 10000 samples. • USPSfoot_1 : The USPS dataset is composed of 9298 gray-scale handwritten digit images with a size of 16x16 pixels. • Fashion-MNIST (Xiao et al., 2017) : This Fashion-MNIST dataset has the same number of images and the same image size as MNIST-full, but it is fairly more complicated. Instead of digits, it consists of various types of fashion products. • REUTERS-10K: REUTERS (Lewis et al., 2004 ) is composed of around 810000 English news stories labeled with a category tree. Four root categories (corporate/industrial, government/social, markets, and economics) are used as labels and excluded all documents with multiple labels. Following DEC (Xie et al., 2016) , a subset of 10000 examples are randomly sampled, and the tf-idf features on the 2000 most frequent words are computed. The sampled dataset is denoted REUTERS-10K. • HAR: HAR is a time series dataset consisting of 10,299 sensor samples from a smartphone. It was collected from 30 people performing six different activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying. The following notations are used for the definitions: d X (i, j): the pairwise distance between x i and x j in input space X; d Z (i, j): the pairwise distance between z i and z j in latent space Z; N k,X i : the set of indices to the k-nearest neighbor (kNN) of x i in input space X; N k,Z i : the set of indices to the k-nearest neighbor (kNN) of z i in latent space Z; r X (i, j): the rank of the closeness (in Euclidean distance) of x j to x i in input space X; r Z (i, j): the rank of the closeness (in Euclidean distance) of z j to z i in latent space Z. The eight evaluation metrics are defined below: (1) ACC (Accuracy) measures the accuracy of clustering: ACC = max m N i=1 1 {l i = m (s i )} N where l i and s i are the true and predicted labels for data point x i , respectively, and m(•) is all possible one-to-one mappings between clusters and label categories. (2) NMI (Normalized Mutual Information) NMI calculates the normalized measure of similarity between two labels of the same data N M I = I(l; s) max{H(l), H(s)} where I(l, s) is the mutual information between the real label l and predicted label s, and H(•) represents their entropy. (3) RRE (Relative Rank Change) measures the average of changes in neighbor ranking between two spaces X and Z: RRE = 1 (k 2 -k 1 + 1) k2 k=k1 M R k X→Z + M R k Z→X where k 1 and k 2 are the lower and upper bounds of the k-NN. M R k X→Z = 1 H k N i=1 j∈N k,Z i |r X (i, j) -r Z (i, j)| r Z (i, j) M R k Z→X = 1 H k N i=1 j∈N k,X i |r X (i, j) -r Z (i, j)| r X (i, j) where H k is the normalizing term, defined as H k = N k l=1 |N -2l| l . (4) Trust (Trustworthiness) measures to what extent the k nearest neighbors of a point are preserved when going from the input space to the latent space: T rust = 1 k 2 -k 1 + 1 k2 k=k1    1 - 2 N k(2N -3k -1) N i=1 j∈N k,Z i ,j / ∈N k,X i (r X (i, j) -k)    where k 1 and k 2 are the bounds of the number of nearest neighbors. ( 5) Cont (Continuity) is defined analogously to T rust, but checks to what extent neighbors are preserved when going from the latent space to the input space: Cont = 1 k 2 -k 1 + 1 k2 k=k1    1 - 2 N k(2N -3k -1) N i=1 j / ∈N k,Z i ,j∈N k,X i (r Z (i, j) -k)    where k 1 and k 2 are the bounds of the number of nearest neighbors. (6) d-RMSE (Root Mean Square Error) measures to what extent the two distributions of distances coincide: d -RM SE = 1 N 2 N i=1 N j=1 (d X (i, j) -d Z (i, j)) 2 (7) LGD (Locally Geometric Distortion) measures how much corresponding distances between neighboring points differ in two metric spaces and is the primary metric for isometry, defined as: LGD = k2 k=k1 M i j∈N k,(l) i (d l (i, j) -d l (i, j)) 2 (k 2 -k 1 + 1) 2 M (#N i ) . where k 1 and k 2 are the lower and upper bounds of the k-NN. (8) CRA (Cluster Rank Accuracy) measures the changes in ranks of cluster centers from the input space X and to the latent space Z: CRA = C i=1 C j=1 1(r X (v X i , v X j ) = r Z (v Z i , v Z j )) C 2 where C is the number of clusters, v X j is the cluster center of the jth cluster in the input space X, v Z j is the cluster center of the jth cluster in the latent space Z, r X (v X i , v X j ) denotes the rank of the closeness (in terms of Euclidean distance) of v X i to v X j in space X in the input space X, and r Z (v Z i , v Z j ) denotes the rank of the closeness (in terms of Euclidean distance) of v Z i to v Z j in space Z.

A.5 VISUALIZATION IN GENERALIZABILITY

The visualization results on the testing samples are shown in Fig A2 ; even for testing samples, our method still shows distinguishable inter-cluster discriminability, while all the other methods without exception coupled different clusters together. Taking MNIST-test dataset as an example, we present the embedding visualization with assumed number of clusters C being 10, 11, and 12, respectively. We find that when C is larger than the number of true clusters (10), data originally belonging to the same cluster will be split, e.g., a cluster is split into two, but the different categories of data still hold clear boundaries and are not mixed together, somewhat similar to hierarchical clustering. A.8 QUANTITATIVE EVALUATION OF REPRESENTATION LEARNING Our method is compared with the other five methods in six evaluation metrics on six datasets. The complete results in Tab A2 demonstrate the superiority of our method, especially on metrics RRE, Trust, Cont, and CRA. As shown in Tab A2, DCRL outperforms all other methods, especially in the CRA metric, which is not only the best on all datasets but also reaches 1.0. This means that the "rank" between different manifolds in the latent space is completely preserved and undamaged, which proves the effectiveness of our global ranking loss L rank . A.9 STATISTICAL ANALYSIS The statistical analysis is presented to show the extent to which local and global structure is preserved from the input space to the latent space. Taking MNIST-full as an example, the statistical analysis of the global rank-preservation is shown in Fig A5 (a)-(f). For the i-th cluster, if the rank (in terms of Euclidean distance) between it and the j-th cluster is preserved from input space to latent space, then the grid in the i-th row and j-th column is marked as blue, otherwise yellow. As shown in the figure, only our method can fully preserve the global rank between different clusters, while all other methods fail. Finally, we perform statistical analysis for the local isometry property of each algorithm. For each sample x i in the dataset, it forms a number of point pairs with its neighborhood samples {(x i , x j )|i = 1, 2, ..., N ; x j ∈ N X i }. We compute the difference in the distance of these point pairs from the input space to the latent space {d Z (x i , x j ) -d X (x i , x j )|i = 1, 2, ..., N ; x j ∈ N i }, and plot it as a histogram. As shown in Fig A5 (g), the curves of DCRL are distributed on both sides of the 0 value, with maximum peak height and minimum peak-bottom width, respectively, which indicates that DCRL achieves the best local isometry. Although IDEC claims that they can preserve the local structure well, there is still a big gap between their results and ours. A.10 QUANTITATIVE EVALUATION OF DOWNSTREAM TASKS Numerous deep clustering algorithms have recently claimed to obtain meaningful representations, however, they do not analyze and experiment with the so-called "meaningful" ones. Therefore, we are interested in whether these proposed methods can indeed learn representations that are useful for downstream tasks. Four different classifiers, including a linear classifier (Logistic Regression; LR), two nonlinear classifiers (MLP, SVM), and a tree-based classifier (Random Forest Classifier; RFC) are used as downstream tasks, all of which use default parameters and default implementations in sklearn (Pedregosa et al., 2011) for a fair comparison. The learned representations are frozen and used as input for training. The classification accuracy evaluated on the test set serves as a metric to evaluate the effectiveness of learned representations. In Tab A3, DCRL outperformed the other methods overall on all six datasets, with MLP, RFC, and LR as downstream tasks. Additionally, we surprisingly find that with MLP and RFC as downstream tasks, all methods other than DCRL do not even match the accuracy of AE on the MNIST-full dataset. Notably, DEC and IDEC show a sharp deterioration in performance on downstream tasks, falling short of even the simplest AEs, again showing that clustering-oriented loss can disrupt the geometry of the data. 

A.12 PARAMETER SENSITIVITY

We also evaluated the sensitivity of parameters k and κ on the MNIST-test dataset and the results are shown in Tab A5. The parameters k and κ are found to have little effect on the clustering performance (ACC/NMI), and some combinations of k and κ even produce better clustering performance than the metrics reported in the main paper. However, the effect of k and κ on representation learning is more pronounced, and different combinations of k and κ may increase or decrease performance. In general, this paper focuses on the design of the algorithm itself and has not performed the parameter search to find the best performance. 



Since cluster centers {µj} C j=1 are learnable and updated in an iterative manner, we believe that a proper initialization is sufficient, and the exploration of initialization methods is beyond the scope of this paper. https://cs.nyu.edu/∼roweis/data.html



Figure 2: Force analysis of the contradiction between clustering and local structure preservation.

The training process can be roughly divided into three stages, as shown in Fig3, to explain the training strategy more vividly. At first, four different manifolds overlap. At Stage 1, L cluster dominates, thus data points within each manifold converge towards cluster centers to form spheres, but the local structure of manifolds is destroyed. At Stage 2, L rank dominates, thus different manifolds in the latent space move away from each other to increase the manifold margin and enhance the discriminability. At stage 3, the manifolds gradually recover their original local structure from the spherical shape with L LIS dominating. It is worth noting that the above losses may coexist with each other rather than being completely independent at different stages, but that the role played by different losses varies due to the alternating training and weight continuation.

Figure 3: Schematic of training strategy. Four different colors and shapes represent four intersecting manifolds, and three stages involve the clustering, separation, and structure recovery of manifolds.4 EXPERIMENTS 4.1 EXPERIMENTAL SETUPSIn this section, the effectiveness of the proposed framework is evaluated in 6 benchmark datasets: MNIST-full, MNIST-test, USPS, Fashion-MNIST, REUTERS-10K and HAR, on which our method is compared with 9 other methods mentioned in Sec 2 in 8 evaluation metrics including metrics designed specifically for clustering and manifold representation learning. The brief descriptions of the datasets are given in Appendix A.3.

Figure 4: Visualization of the embeddings learned by different algorithms on MNIST-full dataset.

Figure A1: The image samples from three datasets (MNIST, USPS, and Fashion-MNIST)

Figure A2: The visualization of the obtained embeddings on the testing samples to show the generalization performance of different algorithms on MNIST-full dateset.

Figure A4: Clustering visualization with different assumed cluster number C on MNIST-test dateset.

Figure A5: Statistical analysis of different algorithms to compare the capability of global and local structure preservation from the input space to the latent space.

Clustering performance (ACC/NMI) of different algorithms on six datasets.

Generalizability evaluated by ACC/NMI.

Performance for multi-manifold representation learning.

Ablation study of loss items and training strategies on MNIST-full dataset.

Description of Datasets.

Representation learning performance of different algorithms on five datasets.

Performance of different algorithms in downstream tasks.

Parameter Sensitivity with different parameters k and κ on the MNIST-test dataset. .929 0.0190 0.9943 0.9615 29.4607 2.5410 1.00 k=5, κ=10 0.972/0.929 0.0195 0.9951 0.9597 37.7661 3.1434 1.00

A.11 MORE ABLATION EXPERIMENTS

The results of the ablation experiments on the MNIST-full dataset have been presented in Tab 4 in Sec 4.3. Here, we provide four more sets of ablation experiments on the other four datasets. The conclusion is similar (note that the clustering performance of the model without clustering-oriented losses is very poorly, so the "best" metric numbers are not meaningful and are shown in gray color):(1) CL is very important for obtaining good clustering. (2) SL is beneficial for both clustering and representation learning. (3) Our training strategies (WC and AT) are very superior in improving metrics such as ACC, RRE, Trust, Cont, and CRA. 

