Fusion over the Grassmann Manifold for Incomplete-Data Clustering

Abstract

This paper presents a new paradigm to cluster incomplete vectors using subspaces as proxies to exploit the geometry of the Grassmannian. We leverage this new perspective to develop an algorithm to cluster and complete data in a union of subspaces via a fusion penalty formulation. Our approach does not require prior knowledge of the number of subspaces, is naturally suited to handle noise, and only requires an upper bound on the subspaces' dimensions. In developing our model, we present local convergence guarantees. We describe clustering, completion, model selection, and sketching techniques that can be used in practice, and complement our analysis with synthetic and real-data experiments.

1. Introduction

Suppose we observe a subset of entries in a data matrix X whose columns lie near a union of subspaces, for example: , where the unobserved entries are marked with * . Our goals are (i) to complete the unobserved entries, (ii) to cluster the columns according to the subspaces, and (iii) to learn the underlying subspaces. In the example above, we should (i) obtain the following (ground truth) completion:        1 - X =        1 -4 6 9 16 -1 8 -7 3 1 5 4 14 16 5 -7 -18 10 8 7 6 -12 2 18 -1 28 -18 2 1 5 4 11 4 1 0 -1 5 1 9 6 19 9 5 2 -3 8 1 -1 7 3 14 9 -13 8        , we should also (ii) cluster the columns of X into two groups, {x 1 , x 2 , x 6 , x 7 } and {x 3 , x 4 , x 5 , x 8 , x 9 }, and (iii) obtain bases for two 2-dimensional subspaces (given by any subset of linearly independent columns from each group). This problem is often known as high-rank matrix completion (HRMC) [24, 21] or as subspace clustering with missing data, and it has a wide range of applications, including tracking moving objects in computer vision [13, 14, 30, 31, 33, 35, 42] , predicting target interactions for drug discovery [28, 44, 45, 49] , and identifying groups in recommender systems [37, 66, 77] . While there exists theory detailing conditions under which the HRMC goals above are feasible (e.g., sufficient sampling and subspaces genericity) [51] , existing algorithms present a variety of shortcomings (more details in Section 2 below). The fundamental difficulty that all HRMC approaches face lies in assessing distances (e.g., euclidean, or in the form of inner products) between partially observed vectors, for the simple reason that this requires overlapping observations which become increasingly unlikely in low-sampling regimes [24] . To circumvent this problem, we introduce a new paradigm to cluster incomplete vectors, using subspaces as proxies, thus avoiding the need to calculate distances or inner products or other notions of similarity between incomplete vectors, as other methods require. To this end we assign each (incomplete-data) point its own (full-data) subspace, and simultaneously minimize over the Grassmann manifold: (a) the chordal distance between each point and its assigned subspace, to guarantee that the subspace stays near the observed entries, and (b) the geodesics between subspaces of all data, to encourage the subspaces from points that belong together to fuse (i.e, represent the same space). At the end of this minimization, clustering the proxy subspaces using standard procedures like k-means or spectral clustering [9, 23, 56, 61, 67, 71, 73] can be done as a proxy for clustering the incomplete-data (goal ii). The ability to cluster the subspaces rather than the incomplete-data is the key strength we gain by moving to the Grassmannian. After clustering, the missing entries can be filled (goal i) using low-rank matrix completion. Once the data is clustered and completed, the underlying subspaces can be trivially inferred (goal iii) with a singular value decomposition. Local convergence guarantees follow easily from known manifold optimization results. We complement our theoretical results with experiments on both synthetic and real data that show the potential of the foundational fusion-over-the-Grassmann formulation. Due to its broad applicability, HRMC has attracted considerable attention in recent years. Existing approaches can be divided in three main groups: generalizations from low-rank matrix completion (LRMC), generalizations from subspace clustering (SC), and methods specifically tailored for HRMC (see [38] for a recent survey).

2. Related Work

HRMC vs LRMC. LRMC seeks to exactly recover the missing entries of a data matrix X whose columns lie in a single low-dimensional subspace [11] . One can view HRMC as a generalization of LRMC, where the columns of X are known to lie in a union of subspaces (UoS), each of low dimension, but it is not known to which subspace each column belongs (see Figure 1 ). Research in LRMC over the last decades has resulted in theory and algorithms that guarantee perfect recovery under reasonable assumptions (e.g., random sampling and bounded-coherence of the data) [10, 11, 15, 16, 29, 53] . Hence, given a HRMC problem, if the number of underlying subspaces, say K, and the maximum of their dimension, say r, are low, one could be tempted to cast HRMC as a LRMC problem. In such case, the single subspace containing all the columns of X would have dimension no larger than r ′ := r • K. This would, however, completely ignore the union structure present in the data, and therefore require more observed entries in order to complete X. We can see this by noting that each column must have more observed entries than the subspace containing it [51] . This means that even in the fortunate case where r ′ is low enough, using LRMC would require K times more observations than HRMC. This is especially prohibitive in applications such as Metagenomics or Drug Discovery, where data is extremely sparse and costly to acquire. In general, r ′ may be too large to even allow the use of LRMC. HRMC vs SC. SC aims to cluster the columns of a full-data matrix X according to a UoS that is not known a priori [22] . One can thus view HRMC as the generalization of SC to the case where data is missing (see Figure 1 ). There exists a vast repertoire of theory and algorithms that guarantee perfect clustering under reasonable assumptions (e.g., sufficient sampling and subspace separation) [68, 43, 65, 2, 59, 19] . Hence, a natural approach to HRMC is thus to fill missing entries naively (with zeros, means, or LRMC) prior to clustering with a full-data method, like sparse subspace clustering [22, 40, 75] . Unfortunately, this approach may work if data is missing at a rate inversely proportional to the dimension of the subspaces [64], but fails with moderate volumes of missing data, as data filled naively no longer lies in a union of subspaces [21] .



Figure 1: HRMC is a generalization of principal component analysis (PCA), LRMC, and SC.

