SUBSPACE CLUSTERING VIA ROBUST SELF-SUPERVISED CONVOLUTIONAL NEURAL NETWORK

Abstract

Deep subspace clustering (SC) algorithms recently gained attention due to their ability to successfully handle nonlinearities in data. However, the insufficient capability of existing SC methods to deal with data corruption of unknown (arbitrary) origin hinders their generalization ability and capability to address realworld data clustering problems. This paper proposes the robust formulation of the self-supervised convolutional subspace clustering network (S 2 ConvSCN) that incorporates the fully connected (FC) layer and, with an additional spectral clustering module, is capable of estimating the clustering error without using the ground truth. Robustness to data corruptions is achieved by using the correntropy induced metric (CIM) of the error that also enhanced the generalization capability of the network. The experimental finding showed that CIM reduces sensitivity to overfitting during the learning process and yields better clustering results. In a truly unsupervised training environment, Robust S 2 ConvSCN outperforms its baseline version by a significant amount for both seen and unseen data on four well-known datasets.

1. INTRODUCTION

Subspace clustering approaches have achieved encouraging performance when compared with the clustering algorithms that rely on proximity measures between data points. The main idea behind the subspace model is that the data can be drawn from low-dimensional subspaces which are embedded in a high-dimensional ambient space (Lodhi & Bajwa, 2018) . Grouping such data associated with respective subspaces is known as the subspace clustering (Vidal, 2011) . That is, each low-dimensional subspace corresponds to a class or category. Up to now, two main approaches for recovering lowdimensional subspaces are developed: models that are based on the self-representation property, and non-linear generalization of subspace clustering called union of subspaces (UoS) (Lodhi & Bajwa, 2018; Lu & Do, 2008; Wu & Bajwa, 2014; 2015) . UoS algorithms are out of the scope of this work. Self-representation subspace clustering is achieved in two steps: (i) learning representation matrix C from data X and building corresponding affinity matrix A = |C| + |C T |; (ii) clustering the data into k clusters by grouping the eigenvectors of the graph Laplacian matrix that correspond with the leading k eigenvalues. This second step is known as spectral clustering (Ng et al., 2002; Von Luxburg, 2007) . Owning to the presumed subspace structure, the data points obey the self-expressiveness or self-representation property (Elhamifar & Vidal, 2013; Peng et al., 2016b; Liu et al., 2012; Li & Vidal, 2016; Favaro et al., 2011) . In other words, each data point can be represented as a linear combination of other points in a dataset: X=XC. The self-representation approach is facing serious limitations regarding real-world datasets. One limitation relates to the linearity assumption because in a wide range of applications samples lie in nonlinear subspaces, e.g. face images acquired under non-uniform illumination and different poses (Ji et al., 2017) . Standard practice for handling data from nonlinear manifolds is to use the kernel trick on samples mapped implicitly into high dimensional space. Therein, samples better conform to linear subspaces (Patel et al., 2013; Patel & Vidal, 2014; Xiao et al., 2015; Brbić & Kopriva, 2018) . However, identifying an appropriate kernel function for a given data set is quite a difficult task (Zhang et al., 2019b) . The second limitation of existing deep SC methods relates to their assumption that the origin of data corruption is known, in which case the proper error model can be employed. In real-word applications origin of data corruption is unknown. That can severely harm the algorithm's learning process if the non-robust loss function is used. Furthermore, validation (i.e. stopping of the learning process) in most of the deep SC methods often requires access to the ground-truth labels. That stands for violation of the basic principle of unsupervised machine learning and yields the overly-optimistic results. Dataset size is also a limitation when it comes to memory requirements. Since the self-representation subspace clustering is based on building the affinity matrix, memory complexity increases as the square of the dataset size. However, the latter limitation is not in the main focus of this work. Motivated by the exceptional ability of deep neural networks to capture complex underlying structures of data and learn discriminative features for clustering (Hinton & Salakhutdinov, 2006; Dilokthanakul et al., 2016; Ghasedi Dizaji et al., 2017; Tian et al., 2014; Xie et al., 2016) , deep subspace clustering approaches emerged recently (Ji et al., 2017; Abavisani & Patel, 2018; Peng et al., 2016a; Yang et al., 2019; Zhou et al., 2018; Ji et al., 2019b; Peng et al., 2018; 2017; Zhou et al., 2019; Zhang et al., 2019a; Kheirandishfard et al., 2020) . In particular, it is shown that convolutional neural networks (CNNs), when applied to images of different classes, can learn features that lie in a UoS (Lezama et al., 2018) . Mostly, the base of the recently developed deep subspace-clustering networks is convolutional autoencoder. It is an end-to-end fully convolutional network that is based on the minimization of the reconstruction error. Together, the autoencoder and an additional self-expression (SE) module are forming a Deep subspace clustering network (DSCNet) (Ji et al., 2017) . Hence, the total loss function of DSCNet is composed of reconstruction loss and SE model loss. That is, during the learning process the clustering quality is not taken into account. Self-supervised convolutional SC network (S 2 ConvSCN) (Zhang et al., 2019a) addressed this issue through the addition of a fully connected layer (FC) module and a spectral clustering module that, respectively, generate soft-and pseudo-labels. Dual self-supervision is achieved by forcing these two modules to converge towards consensus. Related accumulated loss, therefore, participates in enhancing the self-representation matrix and the quality of features extracted in the encoder layer. The architecture of S 2 ConvSCN has a possibility of direct classification once the learning process is completed. A trained encoder and the FC module can make a new network that can directly classify unseen data, also known as an out-of-sample problem. However, while this network can be validated and compared with other algorithms on a separate data set, such an ablation study was not completed. Furthermore, the main disadvantage of the DSCNet architecture, and indirectly S 2 ConvSCN, is that the network training is stopped when the accuracy is highest (Ji et al., 2019a) . First, it is a direct violation of the unsupervised learning principle as the ground-truth labels are exposed. Second, the reported performance (Zhang et al., 2019a; Ji et al., 2017) is overly-optimistic and can not be compared to other algorithms. Also, as mentioned in (Haeffele et al., 2020) , most self-expressive based deep subspace clustering models suffer from the need of post-processing the self-representation matrix. Compared to the baseline model, we significantly reduced the post-processing while maintaining the noise-free matrix. Mentioned research problems led to three main contributions of proposed Robust S 2 ConvSCN: • robustness to errors of the unknown (arbitrary) origin is achieved by using the correntropy induced metric (CIM) in the self-expression loss, • the network is trained using the early-stopping method while monitoring only the accumulated loss, • thanks to correntropy based loss function the training process is less sensitive to data corruptions which enables the network to generalize better. This study has, also, three side-contributions: • the performance of models is estimated using the unseen (out-of-sample) data, • block-diagonal regularization of self-representation matrix is integrated into the gradient descent learning process, • post-processing of self-representation matrix is reduced to a significant extent. A complete head to head comparison of the baseline S 2 ConvSCN model and our robust approach can be seen in Figure 1 .

