A CRITIQUE OF SELF-EXPRESSIVE DEEP SUBSPACE CLUSTERING

Abstract

Subspace clustering is an unsupervised clustering technique designed to cluster data that is supported on a union of linear subspaces, with each subspace defining a cluster with dimension lower than the ambient space. Many existing formulations for this problem are based on exploiting the self-expressive property of linear subspaces, where any point within a subspace can be represented as linear combination of other points within the subspace. To extend this approach to data supported on a union of non-linear manifolds, numerous studies have proposed learning an embedding of the original data using a neural network which is regularized by a self-expressive loss function on the data in the embedded space to encourage a union of linear subspaces prior on the data in the embedded space. Here we show that there are a number of potential flaws with this approach which have not been adequately addressed in prior work. In particular, we show the model formulation is often ill-posed in that it can lead to a degenerate embedding of the data, which need not correspond to a union of subspaces at all and is poorly suited for clustering. We validate our theoretical results experimentally and also repeat prior experiments reported in the literature, where we conclude that a significant portion of the previously claimed performance benefits can be attributed to an ad-hoc post processing step rather than the deep subspace clustering model.

1. INTRODUCTION AND BACKGROUND

Subspace clustering is a classical unsupervised learning problem, where one wishes to segment a given dataset into a prescribed number of clusters, and each cluster is defined as a linear (or affine) subspace with dimension lower than the ambient space. There have been a wide variety of approaches proposed in the literature to solve this problem (Vidal et al., 2016) , but a large family of state-of-the-art approaches are based on exploiting the self-expressive property of linear subspaces. That is, if a point lies in a linear subspace, then it can be represented as a linear combination of other points within the subspace. Based on this fact, a wide variety of methods have been proposed which, given a dataset Z ∈ R d×N of N d-dimensional points, find a matrix of coefficients C ∈ R N ×N by solving the problem: min C∈R N ×N F (Z, C) ≡ 1 2 ZC -Z 2 F + λθ(C) = 1 2 Z Z, (C -I)(C -I) + λθ(C) . (1) Here, the first term ZC -C 2 F captures the self-expressive property by requiring every datapoint to represent itself as an approximate linear combination of other points, i.e., Z i ≈ ZC i , where Z i and C i are the i th columns of Z and C, respectively. The second term, θ(C), is some regularization function designed to encourage each data point to only select other points within the correct subspace in its representation and to avoid trivial solutions (such as C = I). Once the C matrix has been solved for, one can then define a graph affinity between data points, typically based on the magnitudes of the entries of C, and use an appropriate graph-based clustering method (e.g., spectral clustering (von Luxburg, 2007) ) to produce the final clustering of the data points. One of the first methods to utilize this approach was Sparse Subspace Clustering (SSC) (Elhamifar & Vidal, 2009; 2013) , where θ takes the form θ SSC (C) = C 1 + δ(diag(C) = 0), with • 1 denoting the 1 norm and δ an indicator function which takes value ∞ if an element of the diagonal of C is non-zero and 0 otherwise. By regularizing C to be sparse, a point must represent itself using the smallest number of other points within the dataset, which in turn ideally requires a point to only select other points within its own subspace in the representation. Likewise other variants, with Low-Rank Representation (LRR) (Liu et al., 2013)  (C) = C * and θ EnSC (C) = C 1 + τ C 2 F + δ(diag(C) = 0) , where • * denotes the nuclear norm (sum of the singular values). A significant advantage of the majority of these methods is that it can be proven (typically subject to some technical assumptions regarding the angles between the underlying subspaces and the distribution of the sampled data points within the subspaces) that the optimal C matrix in (1) will be "correct" in the sense that if C i,j is non-zero then the i th and j th columns of Z lie in the same linear subspace (Soltanolkotabi & Candès, 2012; Lu et al., 2012; Elhamifar & Vidal, 2013; Soltanolkotabi et al., 2014; Wang et al., 2015; Wang & Xu, 2016; You & Vidal, 2015a; b; Yang et al., 2016; Tsakiris & Vidal, 2018; Li et al., 2018; You et al., 2019; Robinson et al., 2019) , which has led to these methods achieving state-of-the-art performance in many applications.

1.1. SELF-EXPRESSIVE DEEP SUBSPACE CLUSTERING

Although subspace clustering techniques based on self-expression display strong empirical performance and provide theoretical guarantees, a significant limitation of these techniques is the requirement that the underlying dataset needs to be approximately supported on a union of linear subspaces. This has led to a strong motivation to extend these techniques to more general datasets, such as data supported on a union of non-linear low-dimensional manifolds. From inspection of the right side of (1), one can observe that the only dependence on the data Z comes in the form of the Gram matrix Z Z. As a result, self-expressive subspace clustering techniques are amendable to the "kerneltrick", where instead of taking an inner product kernel between data points, one can instead use a general kernel κ(•, •) (Patel & Vidal, 2014) . Of course, such an approach comes with the traditional challenge of how to select an appropriate kernel so that the embedding of the data in the Hilbert space associated with the choice of kernel results in a union of linear subspaces. The first approach to propose learning an appropriate embedding of an initial dataset X ∈ R dx×N (which does not necessarily have a union of subspaces structure) was given by Patel et al. (2013; 2015) who proposed first projecting the data into a lower dimensional space via a learned linear projector, Z = P l X, where P l ∈ R d×dx (d < d x ) is also optimized over in addition to C in (1). To ensure that sufficient information about the original data X is preserved in the low-dimensional embedding Z, the authors further required that the linear projector satisfy the constraint that P l P l = I and added an additional term to the objective with form X -P l P l X 2 F . However, since the projector is linear, the approach is not well suited for nonlinear manifolds, unless it is augmented with a kernel embedding, which again requires choosing a suitable kernel. 2020a) have attempted to learn an appropriate embedding of the data (which ideally would have a union of linear subspaces structure) via a neural network, Φ E (X, W e ), where W e denotes the parameters of a network mapping defined by Φ E , which takes a dataset X ∈ R dx×N as input. In an attempt to encourage the embedding of the data, Φ E (X, W e ), to have this union of subspaces structure, these approaches minimize a self-expressive loss term, with form given in (1), on the embedded data, and a large majority of these proposed techniques can be



More recently, given the success of deep neural networks, a large number of studies Peng et al. (2017); Ji et al. (2017); Zeng et al. (2019b;a); Xie et al. (2020); Sun et al. (2019); Li et al. (2019); Yang et al. (2019); Jiang et al. (2019); Tang et al. (2018); Kheirandishfard et al. (2020b); Zhou et al. (2019); Jiang et al. (2018); Abavisani & Patel (2018); Zhou et al. (2018); Zhang et al. (2018; 2019b;a); Kheirandishfard et al. (

, Low-Rank Subspace Clustering (LRSC) (Vidal & Favaro, 2014) and Elastic-net Subspace Clustering (EnSC) (You et al., 2016) being wellknown examples, take the same form as (1) with different choices of regularization. For example, θ LRR

