Fusion over the Grassmann Manifold for Incomplete-Data Clustering

Abstract

This paper presents a new paradigm to cluster incomplete vectors using subspaces as proxies to exploit the geometry of the Grassmannian. We leverage this new perspective to develop an algorithm to cluster and complete data in a union of subspaces via a fusion penalty formulation. Our approach does not require prior knowledge of the number of subspaces, is naturally suited to handle noise, and only requires an upper bound on the subspaces' dimensions. In developing our model, we present local convergence guarantees. We describe clustering, completion, model selection, and sketching techniques that can be used in practice, and complement our analysis with synthetic and real-data experiments.

1. Introduction

Suppose we observe a subset of entries in a data matrix X whose columns lie near a union of subspaces, for example: , where the unobserved entries are marked with * . Our goals are (i) to complete the unobserved entries, (ii) to cluster the columns according to the subspaces, and (iii) to learn the underlying subspaces. In the example above, we should (i) obtain the following (ground truth) completion: , we should also (ii) cluster the columns of X into two groups, {x 1 , x 2 , x 6 , x 7 } and {x 3 , x 4 , x 5 , x 8 , x 9 }, and (iii) obtain bases for two 2-dimensional subspaces (given by any subset of linearly independent columns from each group).        1 - X =        1 - This problem is often known as high-rank matrix completion (HRMC) [24, 21] or as subspace clustering with missing data, and it has a wide range of applications, including tracking moving objects in computer vision [13, 14, 30, 31, 33, 35, 42] , predicting target interactions for drug discovery [28, 44, 45, 49] , and identifying groups in recommender systems [37, 66, 77] . While there exists theory detailing conditions under which the HRMC goals above are feasible (e.g., sufficient sampling and subspaces genericity) [51] , existing algorithms present a variety of shortcomings (more details in Section 2 below). The fundamental difficulty that all HRMC approaches face lies in assessing distances (e.g., euclidean, or in the form of inner products) between partially observed vectors, for the simple reason that this requires overlapping observations which become increasingly unlikely in low-sampling regimes [24] . To circumvent this problem, we introduce a new paradigm to cluster incomplete vectors, using subspaces as proxies, thus avoiding the need to calculate distances or inner products or other notions of similarity between incomplete vectors, as other methods require. To this end we assign each (incomplete-data) point its own (full-data) subspace, and simultaneously minimize over the Grassmann manifold: (a) the chordal distance between each point and its assigned subspace, to guarantee that the subspace stays near the observed entries, and (b) the geodesics between subspaces of all data, to encourage the subspaces from points that belong together to fuse (i.e, represent the same space). At the end of this minimization, clustering the proxy subspaces using standard procedures like k-means or spectral clustering [9, 23, 56, 61, 67, 71, 73] can be done as a proxy for clustering the incomplete-data (goal ii). The ability to cluster the subspaces rather than the incomplete-data is the key strength we gain by moving to the Grassmannian. After clustering, the missing entries can be filled (goal i) using low-rank matrix completion. Once the data is clustered and completed, the underlying subspaces can be trivially inferred (goal iii) with a singular value decomposition. Local convergence guarantees follow easily from known manifold optimization results. We complement our theoretical results with experiments on both synthetic and real data that show the potential of the foundational fusion-over-the-Grassmann formulation.

Missing Data

PCA SC

HRMC (This paper)

Figure 1 : HRMC is a generalization of principal component analysis (PCA), LRMC, and SC. Due to its broad applicability, HRMC has attracted considerable attention in recent years. Existing approaches can be divided in three main groups: generalizations from low-rank matrix completion (LRMC), generalizations from subspace clustering (SC), and methods specifically tailored for HRMC (see [38] for a recent survey). HRMC vs LRMC. LRMC seeks to exactly recover the missing entries of a data matrix X whose columns lie in a single low-dimensional subspace [11] . One can view HRMC as a generalization of LRMC, where the columns of X are known to lie in a union of subspaces (UoS), each of low dimension, but it is not known to which subspace each column belongs (see Figure 1 ). Research in LRMC over the last decades has resulted in theory and algorithms that guarantee perfect recovery under reasonable assumptions (e.g., random sampling and bounded-coherence of the data) [10, 11, 15, 16, 29, 53] . Hence, given a HRMC problem, if the number of underlying subspaces, say K, and the maximum of their dimension, say r, are low, one could be tempted to cast HRMC as a LRMC problem. In such case, the single subspace containing all the columns of X would have dimension no larger than r ′ := r • K. This would, however, completely ignore the union structure present in the data, and therefore require more observed entries in order to complete X. We can see this by noting that each column must have more observed entries than the subspace containing it [51] . This means that even in the fortunate case where r ′ is low enough, using LRMC would require K times more observations than HRMC. This is especially prohibitive in applications such as Metagenomics or Drug Discovery, where data is extremely sparse and costly to acquire. In general, r ′ may be too large to even allow the use of LRMC. HRMC vs SC. SC aims to cluster the columns of a full-data matrix X according to a UoS that is not known a priori [22] . One can thus view HRMC as the generalization of SC to the case where data is missing (see Figure 1 ). There exists a vast repertoire of theory and algorithms that guarantee perfect clustering under reasonable assumptions (e.g., sufficient sampling and subspace separation) [68, 43, 65, 2, 59, 19] . Hence, a natural approach to HRMC is thus to fill missing entries naively (with zeros, means, or LRMC) prior to clustering with a full-data method, like sparse subspace clustering [22, 40, 75] . Unfortunately, this approach may work if data is missing at a rate inversely proportional to the dimension of the subspaces [64] , but fails with moderate volumes of missing data, as data filled naively no longer lies in a union of subspaces [21] . Tailored HRMC algorithms. Algorithms specifically designed to solve the HRMC problem can be further divided in the following subgroups: (1) neighborhood methods that cluster points according to their overlapping coordinates [24] , (2) alternating methods, like EM [50] , k-subspaces [7] , group-lasso [52, 54] , S 3 LR [39] , or MCOS [41] (3) liftings, which exploit the second-order algebraic structure of unions of subspaces [68, 70, 48, 26, 27] , and (4) integer programming [57] . Neighborhood methods require either abundant observations or a super-polynomial number of samples (to produce enough overlaps). Liftings require squaring the dimension of an already high-dimensional problem, which severely limits their applicability. Integer programming approaches are similarly restricted to small data. To summarize, while much research has been devoted to HRMC, current algorithms have shortcomings, and little is known regarding their theoretical guarantees. Our work in context. Among the methods discussed above, the approach of this paper is perhaps closer in principle to [47] , which uses a similar Grassmannian optimization model to study the single-subspace problem of LRMC. This paper generalizes these ideas to the much harder multiple-subspace problem of HRMC, while maintaining the local convergence guarantees of Proposition 5.1 in [47] . The main difference between [47] and our formulation is that the former only considers a predefined subset of geodesic distances (see equations ( 17)- (19) in [47] ), which determine the Grassmannian points that must be matched. In [47] , these subsets of geodesics can be chosen somewhat arbitrarily, because in LRMC all points belong to the same subspace. A so-called gossip protocol is therefore suitable in the easier problem of LRMC. In contrast, HRMC requires that only certain subsets of the Grassmannian be matched (the points corresponding to the unknown clustering to be learnt). Without knowing a priori the correct clusters, one cannot utilize the gossip method and must therefore use all pairwise geodesics so as to not introduce bias. Note: Appendices A-E contain a review on the mathematical background involved in our formulation.

3. Model and Main Results

Let x 1 , . . . , x n ∈ R m lie near a union of subspaces with dimension upper bounded by r. Let x Ω i ∈ R |Ωi| denote the observed entries of x i , indexed by Ω i ⊂ {1, . . . , m}. We propose assigning to each observed vector x Ω i a proxy subspace U i := span(U i ). Our goal is to estimate the true subspace U ⋆ i to which x i belongs by (a) enforcing that the proxy space U i contains a possible completion of x Ω i and (b) minimizing the distance between individual proxy spaces U i and U j to build consensus. This is done via the following optimization problem, where the first term achieves goal (a) and the second term achieves goal (b): min U1,...,Un∈S(m,r) n i=1 d 2 c (x Ω i , U i ) + λ 2 n i,j=1 d 2 g (U i , U j ), where d c (x Ω i , U i ) := 1 -σ 2 1 (X 0T i U i ) and d g (U i , U j ) := r ℓ=1 arccos 2 σ ℓ (U T i U j ). Here S(m, r) denotes the Stiefel manifold of m × r orthonormal matrices, λ ≥ 0 is a regularization parameter, σ ℓ (•) denotes the ℓ th largest singular value, and X 0 i is the orthonormal matrix spanning all the possible completions of a non-zero x Ω i . The space of all possible completions of x Ω i is therefore X 0 i := span(X 0 i ), which clearly contains the true data x i . The matrix X 0 i can be easily constructed as follows. If x Ω i = 0, then X 0 i = I, the identity matrix. Otherwise, X 0 i is the m × (m -|Ω i | + 1) matrix formed with x Ω i normalized and filled with zeros in the unobserved rows, concatenated with the (m -|Ω i |) canonical vectors indicating The semi-spheres represent the Grassmannian G(m, r), where each point Ui represents a subspace (in the particular case of G(3, 1), the line going from the origin to Ui). Left: Intuitively, the chordal distance dc(x Ω i , Ui) is an informal measure of distance between the subspace Ui and an incomplete point x Ω i . The left image should only be taken as intuition since X 0 i may not live on the same Grassmannian and the chordal distance should not be thought of as a geodesic distance. Center: The geodesic distance dg(Ui, Uj) measures the distance over the Grassmannian between Ui and Uj. Right: The Euclidean gradient vector ∇i falls out of the Grassmann manifold; to account for the Grassmannian curvature, each geodesic step needs to be adjusted according to (4) . the unobserved rows of x Ω i . For example, if x Ω i ̸ = 0 is observed in the first |Ω i | rows, then X 0 i =      x Ω i ∥x Ω i ∥ 0 0 I m-|Ωi|+1      |Ω i |    m -|Ω i |. When x Ω i is fully observed, X 0 i simplifies to x i normalized. Recall that Grassmannians G(m, r) are quotient spaces of Stiefel manifolds S(m, r) by action of the orthogonal group of r × r orthonormal matrices. Since both terms d c (x Ω i , U i ) and d g (U i , U j ) are invariant under this quotient, the objective function in (1) does not depend on the choice of basis, and descends to a function on the Grassmannian.

Why should this work?

The chordal distance d c (x Ω i , U i ), as defined in [72] , is not a formal distance on the Grassmannian, but rather measures how far U i is from containing a possible completion of x Ω i . More precisely, d c (x Ω i , U i ) is the cosine of the angle between the nearest completion of x Ω i and the r-plane U i . If the top singular value σ 1 (X 0T i U i ) is 1, then X 0 i and U i intersect on at least a line, meaning that the proxy space U i contains a possible completion of x Ω i . While merely forcing U i to contain a possible completion offers no way to distinguish one possible completion to another, consensus among data is built as different proxies U i and U j are forced towards one another by the geodesic term. In other words, U i and U j are allowed to be near each other, and hence form clusters, only if they both contain possible completions of both x Ω i and x Ω j . The term chordal distance used in this way is adopted from [72] and should not be confused with the more common chordal distance between points on the Grassmannian [18] . See Figure 2 to build some intuition. Solving (1) The gradients of (1) with respect to U i over the Grassmannian are given by: ∇d 2 c (x Ω i , U i ) = -2σ 1 (X 0T i U i )(I -U i U T i )v i w T i , ∇d 2 g (U i , U j ) = - r ℓ=1 2 arccos σ ℓ (U T i U j ) 1 -σ 2 ℓ (U T i U j ) • (I -U i U T i )v ℓ ij w ℓT ij , where v i and w i are the leading left and right singular vectors of X 0 i X 0T i U i , and v ℓ ij , w ℓ ij are the ℓ th left and right singular vectors of U j U T j U i . The key behind these expressions is that tangent vectors on the Grassmannian can be computed as projections of gradient vectors in Euclidean space [1, 20] . In fact, that is exactly what the gradient expressions in (2) and ( 3) are: -2σ 1 (X 0T i U i )v i w T i in (2) is the gradient of d 2 c (x Ω i , U i ) with respect to the entries of U i in Euclidean space. The multiplication by I -U i U T i takes the horizontal direction of the tangent vector with respect to the quotient, thus mapping the gradient from Euclidean space to the Grassmannian. The same is true for the term I -U i U T i in (3). To summarize, (2) and ( 3) are gradient directions in the manifold of subspaces, rather than matrices: (2) is the steepest direction along which the subspace U i can be descended to a potential subspace that contains x Ω i , and (3) is the steepest direction along which the subspace U i can be descended to the subspace U j . The derivations of these gradients are in Appendix E. Putting together the chordal and geodesic gradients, our overall descent direction for U i is given by ∇ i := ∇d 2 c (x Ω i , U i ) + λ 2 n j=1 ∇d 2 g (U i , U j ). Observe that U i -η∇ i falls out of the Grassmannian for every step size η ̸ = 0 (see Figure 2 ). To adjust for the curvature of the manifold, the update after taking a geodesic step of size η over the Grassmannian in the direction of -∇ i is given by equation (2.65) in [20] , which in our context reduces to: U i ← U i E i Γ i diag cos(ηΥ i ) diag sin(ηΥ i ) E T i , where Γ i Υ i E T i is the compact singular value decomposition of -∇ i . In our implementation, we use Armijo step sizes, given by η = β ν η 0 , where η 0 > 0, and β, γ ∈ (0, 1) are the Armijo tuning parameters related to the initial step size and step search granularity [1] , and ν is the smallest non-negative integer such that n i=1 fi(Ui) -fi(R U i (β ν η0∇i)) ≥ -γ n i=1 ⟨∇i, β ν η0(-∇i)⟩, where f i (U i ) is the component of the objective function (1) holding i fixed and ranging over j, and R Ui (∆) performs the geodesic step described by (4) in the direction of ∆. Convergence guarantees. One advantage of our approach is that we can use standard techniques as in [47] to obtain local convergence guarantees like the following: Proposition 1. Let {(U 1 , U 2 , . . . , U n )} be a sequence of iterates generated by the geodesic steps given by equation ( 4) with Armijo steps sizes η as defined above. Then the sequence will converge to a critical point of (1). Proof. It suffices to show the (rather technical) fact that the gradient steps in (4) are an instance of Accelerated Line Search (ALS) given in [1] and outlined in Algorithm 1, where the product manifold G n will serve as the Riemmannian manifold M, the tangent space of which is the cartesian product of tangent spaces of each constituent G. To see this, let T M denote the tangent bundle of (set of all tangent vectors to) M, and let T U M denote the tangent space of M at U ∈ M. In our case, U is the tuple (U 1 , . . . , U n ), and equation (4) serves as the retractionfoot_0 R Ui on each component, so that R U = (R U1 , . . . , R Un ). One can verify that this is indeed a retraction by recognizing (4) as the exponential map Exp : T U G → G and noting that, on a Riemannian manifold, the exponential map is a retraction, and that the product of exponential maps is again an exponential map [1] . For our sequence of gradient-related 2 tangent vectors, we use the negative gradient, which is clearly always gradient-related. The Algorithm 1: Accelerated Line Search (ALS) Require: Riemannian manifold M; continuously differentiable scalar field f on M; retraction R from T M to M; scalars η 0 > 0, c, β, γ ∈ (0, 1). Input: Initial iterate U ∈ M Output: Sequence of iterates {U t }. for t = 0, 1, 2, . . . do Pick ∆ t ∈ T Ut M such that the sequence of tangent vectors {∆ t } is gradient related. Select U t+1 such that f (U t ) -f (U t+1 ) ≥ c(f (U t ) -f (R Ut (η t ∆ t ))), where η t is the Armijo step size for the given η 0 , β, γ, ∆ t . end gradient on the product manifold is the cartesian product of the gradients on each constituent manifold, i.e., ∇(f ) = (∇(f 1 ), . . . , ∇(f n )). Moreover, the inner product on the tangent space is the sum of the inner products on the constituent tangent spaces. Therefore, if {∆ i,t }, ∆ i,t ∈ T Ut M i is gradient-related for each M i , then {(∆ 1,t , . . . , ∆ n,t )} is gradient-related on the product manifold. Furthermore, setting U t+1 = R Ut (η t ∆ t ) satisfies the bound in (5) with c = 1. Thus Proposition 1 follows as consequence of Theorem 4.3.1 and Corollary 4.3.2 in [1] .

4. Fusion in Practice

Clustering, completion, and subspace inference. Recall that the solution to (1) provides an estimator U i of U ⋆ i , the true subspace from which x i is drawn. After solving (1), one may form the matrix D whose (i, j) th entry is given by d g (U i , U j ) and use it as input to any distance-based clustering method, such as k-means [9, 23, 56] , spectral clustering [61, 67, 71, 73] , or DBSCAN [25, 32, 55] . While prior knowledge of the number of subspaces K may be required for some clustering methods (e.g., k-means, or spectral clustering), it is not required at all to solve (1). Hence, by choosing a clustering method that doesn't require knowing K (e.g., DBSCAN), our approach can be applied to situations where K is unknown. After clustering, one can agglomerate all the data points corresponding to the k th cluster in the same matrix XΩ k , and run any low-rank matrix completion (LRMC) algorithm (e.g., [6, 8, 10, 11, 12, 36, 53, 69] ) to estimate its completion Xk . Finally, one can run principal component analysis (PCA) [46, 74] on Xk to recover an estimate basis Ûk of the k th underlying subspace U ⋆ k . Penalty parameter and model selection. Intuitively, the chordal term in (1) forces each subspace to be close to its assigned data point, and the geodesic term forces subspaces from different data points to be close to one another. The tradeoff between these two quantities is determined by the penalty parameter λ ≥ 0. If λ = 0, then the geodesic term is ignored and there is a trivial solution where each subspace exactly contains its assigned data point (thus attaining the minimum, zero, for the chordal distance). If λ > 0, the geodesic term forces subspaces from different data points to get closer, even if they no longer contain exactly their assigned data points. As λ grows, subspaces get closer and closer (see Figure 3 ). The extreme case (λ = ∞) forces all subspaces to fuse into one (to attain zero in the second term), allowing only one subspace to explain all data, which is the equivalent of PCA in the complete-data case, and LRMC if data is missing. In other words, PCA and LRMC are the special cases of our formulation with λ = ∞. Ultimately, the effect of λ will be reflected in the distance matrix D, which in turn determines the number of clusters. The smaller λ, the more clusters, up to the extreme where each data point is in its own cluster. Conversely, the larger λ, the fewer clusters, up to the extreme point where all points are clustered together. The more subspaces, the more accuracy, but the more degrees of freedom (overfitting). To determine the best λ, one can compute a goodness of fit test, like the minimum effective dimension [33, 68] , that quantifies the tradeoff between accuracy and degrees of freedom. Similarly, we can iteratively increase r in ( 1) to find all the data points that lie in 1-dimensional subspaces, then all the data points that lie in 2-dimensional subspaces, and so on (pruning the data at each iteration). This will result in an estimate of the number of subspaces K, and their dimensions. Initialization. In our implementation we initialize (1) with a solution to the problem when λ = 0, i.e., when each subspace perfectly contains the observed entries of its assigned data point. To this end, for each i we first construct an m × r matrix whose first column is equal to x Ω i in its observed entries, and whose remaining entries are filled with standard normal entries, known to produce incoherent and uniformly distributed subspaces [24] . This matrix is then orthonormalized to produce the initial estimate U i , which, by construction, contains x Ω i , thus producing d c (x Ω i , U i ) = 0. Computational complexity. We point out that the main caveat of our approach is its quadratic complexity in the number of samples. Fortunately, subspace clustering allows a simple approach to sketching both samples and features [62] . That is, one may solve (1) with a subset of n ′ ≤ n columns, and a subset of m ′ ≤ m rows (e.g., those with most observations), resulting in an improved complexity, quadratic in n ′ as opposed to n. With the solution of (1), one can use a clustering method, a LRMC algorithm, and PCA, as described above, to produce subspace estimates Û1 , . . . , ÛK ′ , with K ′ ≤ K. Each of the remaining n -n ′ incomplete data points x Ω i that were not used to solve (1) and that have more than r observations (a fundamental requirement of subspace clustering [51] ) can be trivially assigned to the subspace estimate producing the largest projection coefficient θ k i = ( ÛΩT k ÛΩ k ) -1 x Ω i , where ÛΩ k ∈ R |Ωi|×r denotes the restriction of Ûk to the observed rows of x Ω i (notice that ÛΩT k ÛΩ k is invertible for almost every rank-r Ûk whenever |Ω i | > r [51] ). If x Ω i is assigned to Ûk , its completion can be trivially estimated as xi = Ûk θ k i . All the data points x Ω i that are too far from all of the subspace estimates (equivalently, the data points whose coefficients are smaller than a pre-determined parameter) can be used to solve (1) again for a refined clustering.

5. Experiments

In this section we present a series of experiments on real and synthetic data, in particular the Hopkins155 dataset [63] , and the Smartphone dataset for Human Activity Recognition in Ambient Assisted Living (AAL) [4] . Rather than establishing a new state-of-the-art, these experiments have the intention to serve as proof of concept, showing the potential of our approach, which in this first introduction and basic formulation performs comparable to prominent methods [75] . In our experiments we initialize (1) as described in Section 4, with r fixed, as is known a priori in both, the simulations, and the real datasets. We do not specify K, and we make no special adjustments to handle noise, as is not required by our approach. The attained solution to (1) is used as input to spectral clustering [61, 67, 71, 73] (though, as described in Section 4, other clustering algorithms, such as k-means [9, 23, 56] or DBSCAN [25, 32, 55] could be used). We measure accuracy in terms of clustering error, given by min M 1 n n i=1 1 {M (ŷ)̸ =y} , where 1 denotes the indicator function, and M is a function that maps the estimated cluster labels ŷ ∈ {1, . . . , K} n assigned to x Ω 1 , . . . , x Ω n , to the true labels y ∈ {1, . . . , K} n . Baseline comparisons and full-data sub-optimality. A recent survey [38] shows that most state-of-the-art algorithms for HRMC (including MC+SSC [75] , EM [50] , GSSC [52] , ksubspaces [7] , and more) have similar performance, with varying winners on specific scenarios depending on subspace vs ambient dimension gap, fraction of missing data, and number of subspaces. Based on this recent survey [38] , and others [52] , we chose ZF+SSC and SSC-EWZF as baselines, which has been seen in [38] to have nearly identical performance as MC+SSC, EM, GSSC, k-subspaces and k-GROUSE [7] in the scenarios discussed in our paper. Synthetic data. In all our simulations we first generate K matrices U ⋆ k ∈ R m×r with i.i.d. N (0, 1) entries, to use as bases of the true subspaces. For each k we generate a matrix Θ ⋆ k ∈ R r×n k , also with i.i.d. N (0, 1) entries, to use as coefficients of the columns in the k th subspace. We then form X as the concatenation [U ⋆ 1 Θ ⋆ 1 , U ⋆ 2 Θ ⋆ 2 , . . . , U ⋆ K Θ ⋆ K ]. To induce missing data we sample each entry independently with probability p. Figure 4 shows the clustering results as a function of the sampling rate for a variety of settings, tuning the parameter λ manually. Notice that, even with this first formulation, we perform comparable to existing methods, and even better in some cases, especially in low-sampling regimes. Object tracking in Hopkins 155. This dataset contains 155 videos of K = 2 or K = 3 moving objects, such as checkerboards, vehicles, and pedestrians. In each video, a collection of n mark points are tracked through all frames. The locations over time of the i th point are stacked to produce x i ∈ R m , so that the points corresponding to the same object lie near a low-dimensional subspace [60, 35] (r varies from video to video, from 1 to 3). In all cases we fixed the penalty parameter λ to 1. To induce missing data (e.g., produced by occlusions) we sample each entry independently with probability p. Figure 5 shows the clustering results. Human activity recognition in Smartphone AAL Dataset. This dataset contains n = 5744 instances, each with m = 561 features related to pre-processed accelerometer and gyroscope time series and summary statistics [3] , related to K = 2 activities: walking, and other movements, each approximated by a subspace of dimension r = 4. Recall that the complexity of ( 1) is quadratic in n, so if solved directly, this dataset that would produce an unmanageable computational complexity. However, using the sketching techniques described in Section 4, it can be solved quite efficiently. In particular, we only used m ′ = 158 features (related to the accelerometer's and gyroscope's minimum, maximum, standard deviation, and mean parameters over time), and n ′ = 100 samples selected at random, evenly distributed among classes. In all cases we fixed the penalty parameter λ to 10 -5 . The results are summarized in Figure 5 . Notice that our approach outperforms existing methods in the low-sampling regime. Lastly, we note the disparity in performance between our model and existing algorithms when the missing data rate is low. The main motivation for our approach is incomplete data. While our formulation can certainly be used with full data, we acknowledge that it would be an over-kill, and consequently suboptimal in that scenario, which has been extensively studied. Hence, it is not surprising that methods tailored for full-data outperform ours in such setting. However, no full-data method outperforms ours when data is missing. 6 Future Directions and Challenges The main formulation presented in this paper is a non-convex optimization that relies on the simultaneous interactions of many terms. A proper analysis of the model is therefore challenging. One difficulty confronted is the complex geometry of the zero-set of the chordal distance term, or more precisely, the intersections of many of these zero-sets, one for each column of data. While intuition suggests that the model is encouraged by the geodesic term to find regions of "dense" intersection, and therefore build consensus, a more precise formulation of this intuition has evaded us. This is further highlighted in Figures 4 and 5 by the fact that the performance of our model decreases as K grows, indicating that it is not currently understood how the combination of the chordal and geodesic terms encourage consensus amongst so many cross-cluster terms. It is our belief that a well-designed weighted version of (1), such as min U1,...,Un∈S m×r n i=1 d 2 c (x Ω i , U i ) + λ 2 n i,j=1 w ij d 2 g (U i , U j ), where the weights w ij ≥ 0 quantify how much attention is given to each penalty, is key to unlocking better performance and understanding of the model. Our immediate future work will focus on investigating options for these weights, such as inverse distance functions, or k-nearest neighbors, known to dramatically improve the performance, computational complexity, and tolerance to K of fusion formulations in Euclidean space [5, 17, 58, 76] .

7. Conclusions

This paper presents a new paradigm for clustering incomplete datapoints using subspaces as proxies to leverage the geometry of the Grassmannian. This new perspective enables clustering and completion of data in a union of subspaces. This work should be understood as the first introduction to the idea of fusion penalties in the Grassmann manifold, for the problem of high-rank matrix completion. Rather than establishing our approach as the state-of-the-art, our experiments have the intention to serve as proof of concept, showing that there is potential in our approach, in the hopes to ignite future work on several directions, such as the study of weighted versions described in (6) , the choice of penalty parameters, and variants robust to outliers.

B Principal Angles and Singular Values

Recall the notion of principal angles between subspaces: let U ∈ S(m, p) and V ∈ S(m, q) be orthonormal bases for two arbitrary subspaces of R m . Assume, without loss of generality, that 1 ≤ p ≤ q ≤ m. The principal angles between span(U) and span(V) are defined via the following construction. Let u 1 ∈ span(U) and v 1 ∈ span(V) be unit vectors such that |u T 1 v 1 | is maximal. Inductively, let u k ∈ span(U) and v k ∈ span(V) be unit vectors such that u T k u j = 0 and v T k v j = 0 for all 1 ≤ j < k and |u T k v k | is maximal. The principal angles are defined as α k = arccos u T k v k for all k = 1, 2, . . . , p. This constructive definition is too cumbersome to use in practice. We opt for the following alternative computation via the singular value decomposition. Let u 1 , . . . , u p and v 1 , . . . , v q be the columns of U and V respectively. Compute the singular value decomposition U T V = Ūdiag[. . . , σ i , . . . ] VT . Set U ′ = U Ū and V ′ = V V and denote their columns by u ′ i and v ′ j respectively. Observe that span(U) = span( Ū) and span(V) = span( V) and furthermore that U ′T V ′ = diag[. . . , σ i , . . . ], that is u ′T i v ′ j = σ i i = j 0 i ̸ = j. The vectors u ′ i and v ′ j correspond to those in the constructive definition. We therefore have that the i-th principal angle α i relates to the i-th singular value via σ i = cos α i .

C Chordal Distance

The chordal distance between points on the Grassmannian G(m, r), introduced and studied in [18] , is defined via r i=1 sin 2 α i , where the α i are the principal angles between the points described above. The authors of [72] introduce a notion of distance between a partially observed vector x Ω ∈ R m and a subspace U ∈ G(m, r) via a formulation closely related to the chordal distance, which they give the same name. Let X 0 denoted the orthonormal matrix spanning all possible completions of x Ω : If x Ω = 0, then X 0 = I, the identity matrix. Otherwise, X 0 is the m × (m -|Ω| + 1) matrix formed with x Ω normalized and filled with zeros in the unobserved rows, concatenated with the (m -|Ω|) canonical vectors indicating the unobserved rows of x Ω . Let σ 1 (X 0T U) denote the largest singular value of X 0T U. Then d c (x Ω , U) = sin α 1 = 1 -σ 2 1 (X 0T U). This metric is studied in [72] . Of particular importance is the following fact stated as Theorem 2 in [72] : the preimage of 0 under d 2 c (x Ω , •) is the closure of the preimage of 0 under f F (x Ω , •), where f F (x Ω , U) = min w∈R r ∥x Ω -P Ω (Uw)∥ 2 F , and P Ω denotes projection onto the entries indexed by Ω. That is, f F is the Frobenious norm, which is often used to search for subspaces U consistent with data. The Frobenius norm may not be continuous, whereas the chordal distance is continuous and differentiable.

D Geodesic Distance on the Grassmannian

The geodesic distance d g (U i , U j ) is derived from the intrinsic geometry of the Grassmann manifold and depends on the metric which defines the manifold structure. Let γ : [a, b] → M be a curve on a general Riemannian manifold (M, g) with metric g. Then the length of γ is defined as [34] L(γ) = b a g( γ(t), γ(t))dt. The canonical metric for the Grassmann manifold coincides with the Euclidean metric inherited from O(m): g c ( Ui , Uj ) = g e ( Ui , Uj ) = tr( Ui T Uj ) [20] . To compute the geodesic distance, we therefore require knowledge of the geodesic segment connecting U i and U j with respect to the metric g c . This is described in Lemma 1 of [72] : Let V i ΣV T j be the singular value decomposition of the matrix U T i U j , and denote the ℓ-th singular value by σ ℓ = cos α ℓ . Set Ūi = U i V i and Ūj = U j V j and note that ŪT i Ūj = Σ. Then the geodesic with respect to g c from U i to U j is given by U(t), 0 ≤ t ≤ 1, where the path U(t) is given by [ Ūi , G] diag ([. . . , cos α ℓ t, . . . ]) diag ([. . . , sin α ℓ t, . . . ]) V T i , where the columns of G = [. . . , g ℓ , . . . ] ∈ S(r, m) are defined as g ℓ = Ū2,:ℓ -σi Ū1,:ℓ ∥ Ū2,:ℓ -σ ℓ Ū1,:ℓ ∥ if λ ℓ ̸ = 1 0 if λ ℓ = 1. Here, the subscript : ℓ denotes the ℓ-th column of the corresponding matrix. We therefore have U(t) = [ Ūi , G] diag ([. . . , -α ℓ sin α ℓ t, . . . ]) diag ([. . . , α ℓ cos α ℓ t, . . . ]) V T i . Denote S = diag ([. . . , -α ℓ sin α ℓ t, . . . ]) and C = diag ([. . . , α ℓ cos α ℓ t, . . . ]). Then UT U = V i [S, C] ŪT i G T [ Ūi , G] S C V T i = V i [S, C] ŪT i Ūi ŪT i G G T Ūi G T G S C V T i = V i (S 2 + C 2 )V T i = V i diag([. . . , α 2 ℓ , . . . ])V T i , where we use the fact that Ūi , G ∈ S(m, r), and that ŪT i G = 0 [72]. Recall that tr(AB) = tr(BA), hence tr( UT U) = tr(diag([. . . , α 2 ℓ , . . . ])) = ℓ α 2 ℓ . We therefore have L(U(t)) = 1 0 i α 2 ℓ dt = ℓ α 2 ℓ . Recalling that σ ℓ = cos α ℓ , we finally have d g (U i , U j ) = r ℓ=1 arccos 2 σ ℓ (U T i U j ).

E Gradients on the Grassmannian

In this section, we derive the expressions in (2) and (3) that govern the fusion steps of our formulation. For a function F (U) defined on the Grassmannian, the graduate of F at U is given by equation (2.70) in [20] , which we record here: ∇F = F U -UU T F U , where F U is the matrix whose entries are given by [F U ] ij = ∂F ∂Uij . Chordal gradient. To obtain the gradient of the chordal distance d 2 c (x Ω i , U i ) presented in (2), consider the partial derivative with respect to the (a, b) th element of U i : ∂d 2 c (x Ω i , U i ) ∂U i ab = ∂ ∂[U i ] ab d 2 c (x Ω i , U i ) = -2σ 1 (X 0T i U i ) ∂σ 1 (X 0T i U i ) ∂[U i ] ab . To obtain the partial derivative of the leading singular value σ 1 , observe that X 0 i X 0T i U i and X 0T i U i share singular values: if X 0T i U i = VΣW T , then X 0 i X 0T i U i = (X 0 i V)ΣW T with X 0 i V ∈ S(m, r), so the result is a compact singular value decomposition. Recall that v i and w i denote the leading left and right singular vectors of X 0 i X 0T i U i . Since v T i X 0 i X 0T i U i w i = σ 1 , we have that ∂σ 1 ∂[U i ] ab = ∂v T i ∂[U i ] ab X 0 i X 0T i U i w i + v T i X 0 i X 0T i ∂U i ∂[U i ] ab w i + v T i X 0 i X 0T i U i ∂w i ∂[U i ] ab . The first and third terms are zero, because w i is the leading right singular vector of X 0 i X 0T i U i , so (X 0 i X 0T i U i )w i = σ 1 v i , which implies ∂v T i ∂[U i ] ab (X 0 i X 0T i U i )w i = σ 1 ∂v T i ∂[U i ] ab v i , and because ] ab v i = 0, as seen by differentiating both sides of v T i v i = 1 (and similarly for the third term). To compute the second term, note that v i ∈ span(X 0 i ) since it is a column of X 0 i V, and the space spanned by X 0 i is invariant under multiplication by V. Now, (v T i X 0 i X 0T i ) T = X 0 i X 0T i v i = v i , since X 0 i X 0T i acts on vectors as the projection onto span(X 0 i ). Hence v T i X 0 i X 0T i = v T i . The second term then becomes v ∂v T i ∂[Ui T i ∂Ui ∂[Ui] ab w i = [v i ] a [w i ] b . It follows that ∂σ 1 ∂[U i ] ab = [v i ] a [w i ] b . From this, we have ∇d 2 c (x Ω i , U i ) = -2(I -U i U T i )σ 1 v i w T i , where multiplication by I -U i U T i projects onto the tangent space of the Grassmannian at U i , as described before [1, 20] . Geodesic gradient. For the gradient of the geodesic distance d 2 g (U i , U j ) in (3) let us use σ ℓ as shorthand for σ ℓ (U T i U j ), and recall that v ℓ ij and w ℓ ij denote the ℓ th left and right singular vectors of U j U T j U i . Then the partial derivative with respect to the (a, b) th element of U i is ∂d 2 g (U i , U j ) ∂U i ab = r ℓ=1 -2 arccos σ ℓ 1 -σ 2 ℓ ∂σ ℓ ∂[U i ] ab = r ℓ=1 -2 arccos σ ℓ 1 -σ 2 ℓ v ℓ ij w ℓT ij , where the first equality follows because σ ℓ (U T i U j ) = σ ℓ (U T j U i ) = σ ℓ (U j U T j U i ), and the second equality follows by parallel arguments as the derivation of (7) for the leading singular value. The last equation is the Euclidean gradient. Projecting onto the tangent space at U i , as described before [1, 20] , we obtain the following gradient on the Grassmannian ∇d 2 g (U i , U j ) = r ℓ=1 -2 arccos σ ℓ 1 -σ 2 ℓ (I -U i U T i )v ℓ ij w ℓT ij .



A mapping R from T M to M such that its restriction to TU M, denoted RU , satisfies a local rigidity condition which preserves gradients at U; see the rightmost illustration in Figure to build some intuition, or Chapters and of[1] for a more careful treatment of these definitions.2 Given a cost function f on a Riemannian manifold M, a sequence of tangent vectors {∆t}, ∆t ∈ TU t M, is gradient-related if, for any sequence {Ut}t∈K that converges to a non-critical point of f , the corresponding subsequence {∆t}t∈K is bounded and satisfies lim sup t→∞, t∈K ⟨∇f (Ut), ∆t⟩ < 0.



Figure2: The semi-spheres represent the Grassmannian G(m, r), where each point Ui represents a subspace (in the particular case of G(3, 1), the line going from the origin to Ui). Left: Intuitively, the chordal distance dc(x Ω i , Ui) is an informal measure of distance between the subspace Ui and an incomplete point x Ω i . The left image should only be taken as intuition since X 0 i may not live on the same Grassmannian and the chordal distance should not be thought of as a geodesic distance. Center: The geodesic distance dg(Ui, Uj) measures the distance over the Grassmannian between Ui and Uj. Right: The Euclidean gradient vector ∇i falls out of the Grassmann manifold; to account for the Grassmannian curvature, each geodesic step needs to be adjusted according to(4).

Figure3: λ ≥ 0 in (1) regulates how clusters fuse together. If λ = 0, each point is assigned to a subspace that exactly contains it (overfitting). The larger λ, the more we penalize subspaces being apart, which results in subspaces getting closer to form fewer clusters. The extreme case λ = ∞ is the special case of PCA and LRMC, where only one subspace is allowed to explain all data.

Figure 4: Clustering error (average over 10 trials) as a function of sampling rate for different synthetic settings.

Figure 5: Clustering error as a function of sampling rate for real datasets. Left: average over 120 videos with K = 2 objects, and 35 videos with K = 3 objects. Right: average over 20 trials.

A Stiefel and Grassmann Manifolds

The primary mathematical object involved in this work is the Grassmannian G(m, r). This is a smooth compact manifold of dimension r(m -r). A full expository on the Stiefel and Grassmann manifolds is given [20] . Here, we record the most basic necessary ideas needed in order to have a working understanding of the tools used in the above. To describe this, it is necessary to define precursor objects; the orthogonal group O(m) and the Stiefel manifold S(m, r). The objects of interest are thus: whose columns span the same subspace (a quotient manifold).In this setting, the Stiefel manifold is defined as a quotient space of the orthogonal group.Here, two orthogonal matrices U and V are identified if their first r columns are identical or, equivalently, ifGoing further, the Grassmannian is defined a quotient space of the Stiefel manifold where two Stiefel elements are identified if their columns span the same r-dimensional subspace. Therefore G(m, r) = S(m, r)/O(r).Given the above, it is clear that we may describe elements of the Stiefel and Grassmann manifolds using concrete representatives that can be stored on a computer. A point on the Stiefel manifold may be stored as an m × r orthonormal matrix. A point on the Grassmann manifold, however, being a linear subspace, does not have a unique representative and can be stored as an arbitrary m × r orthonormal matrix so long as it spans the correct subspace.

