Fusion over the Grassmann Manifold for Incomplete-Data Clustering

Abstract

This paper presents a new paradigm to cluster incomplete vectors using subspaces as proxies to exploit the geometry of the Grassmannian. We leverage this new perspective to develop an algorithm to cluster and complete data in a union of subspaces via a fusion penalty formulation. Our approach does not require prior knowledge of the number of subspaces, is naturally suited to handle noise, and only requires an upper bound on the subspaces' dimensions. In developing our model, we present local convergence guarantees. We describe clustering, completion, model selection, and sketching techniques that can be used in practice, and complement our analysis with synthetic and real-data experiments.

1. Introduction

Suppose we observe a subset of entries in a data matrix X whose columns lie near a union of subspaces, for example: , where the unobserved entries are marked with * . Our goals are (i) to complete the unobserved entries, (ii) to cluster the columns according to the subspaces, and (iii) to learn the underlying subspaces. In the example above, we should (i) obtain the following (ground truth) completion:        1 - X =        1 -4 6 9 16 -1 8 -7 3 1 5 4 14 16 5 -7 -18 10 8 7 6 -12 2 18 -1 28 -18 2 1 5 4 11 4 1 0 -1 5 1 9 6 19 9 5 2 -3 8 1 -1 7 3 14 9 -13 8        , we should also (ii) cluster the columns of X into two groups, {x 1 , x 2 , x 6 , x 7 } and {x 3 , x 4 , x 5 , x 8 , x 9 }, and (iii) obtain bases for two 2-dimensional subspaces (given by any subset of linearly independent columns from each group). This problem is often known as high-rank matrix completion (HRMC) [24, 21] or as subspace clustering with missing data, and it has a wide range of applications, including tracking moving objects in computer vision [13, 14, 30, 31, 33, 35, 42] , predicting target interactions for drug discovery [28, 44, 45, 49] , and identifying groups in recommender systems [37, 66, 77] . While there exists theory detailing conditions under which the HRMC goals above are feasible (e.g., sufficient sampling and subspaces genericity) [51] , existing algorithms present a variety of shortcomings (more details in Section 2 below). The fundamental difficulty that all HRMC approaches face lies in assessing distances (e.g., euclidean, or in the form of inner products) between partially observed vectors, for

