EFFECTIVE SUBSPACE INDEXING VIA INTERPOLATION ON STIEFEL AND GRASSMANN MANIFOLDS

Abstract

We propose a novel local Subspace Indexing Model with Interpolation (SIM-I) for low-dimensional embedding of image data sets. Our SIM-I is constructed via two steps: in the first step we build a piece-wise linear affinity-aware subspace model under a given partition of the data set; in the second step we interpolate between several adjacent linear subspace models constructed previously using the "center of mass" calculation on Stiefel and Grassmann manifolds. The resulting subspace indexing model built by SIM-I is a globally non-linear low-dimensional embedding of the original data set. Furthermore, the interpolation step produces a "smoothed" version of the piece-wise linear embedding mapping constructed in the first step, and can be viewed as a regularization procedure. We provide experimental results validating the effectiveness of SIM-I, that improves PCA recovery for SIFT data set and nearest-neighbor classification success rates for MNIST and CIFAR-10 data sets.

1. INTRODUCTION

Subspace selection algorithms have been successful in many application problems related to dimension reduction (Zhou et al. (2010) , Bian & Tao (2011) , Si et al. (2010) , Zhang et al. (2009) ), with applications including, e.g., human face recognition (Fu & Huang (2008) ), speech and gait recognition (Tao et al. (2007) ), etc.. The classical approaches of subspace selection in dimension reduction include algorithms like Principle Component Analysis (PCA, see Jolliffe (2002) ) and Linear Discriminant Analysis (LDA, see Belhumeur et al. (1997) , Tao et al. (2009) ). They are looking for globally linear subspace models. Therefore, they fail to estimate the nonlinearity of the intrinsic data manifold, and ignore the local variation of the data (Saul & Roweis (2003) , Strassen (1969) ). Consequently, these globally linear models are often ineffective for search problems on large scale image data sets. To resolve this difficulty, nonlinear algorithms such as kernel algorithms (Ham et al. (2004) ) and manifold learning algorithms (Belkin et al. (2006) , Guan et al. (2011) ) are proposed. However, even though these nonlinear methods significantly improve the recognition performance, they face a serious computational challenge dealing with large-scale data sets due to the complexity of matrix decomposition at the size of the number of training samples. Here we propose a simple method, Subspace Indexing Model with Interpolation (SIM-I), that produces from a given data set a piece-wise linear, locality-aware and globally nonlinear model of low-dimensional embedding. SIM-I is constructed via two steps: in the first step we build a piecewise linear affinity-aware subspace model under a given partition of the data set; in the second step we interpolate between several adjacent linear subspace models constructed previously using the "center of mass" calculation on Stiefel and Grassmann manifolds (Edleman et al. (1999) , Kaneko et al. (2013) , Marrinan et al. (2014) ). The interpolation step outputs a "smoothed" version (Figure 1 ) of the original piece-wise linear model, and can be regarded as a regularization process. Compared to previously mentioned subspace methods, SIM-I enjoys the following advantages: (1) it captures the global nonlinearity and thus the local fluctuations of the data set; (2) it is computationally feasible to large-scale data sets since it avoids the complexity in matrix decomposition at the size of the number of training samples; (3) it includes a regularization step via interpolating between several adjacent pieces of subspace models. Numerical experiments on PCA recovery task for SIFT data set and classification tasks via nearest-neighbor method for MNIST and CIFAR-10 data sets further validate the effectiveness of SIM-I.

2. PIECE-WISE LINEAR LOCALITY PRESERVING PROJECTION (LPP) MODEL

If an image data point x ∈ R D is represented as a vector in a very high-dimensional space, then we want to find a low-dimensional embedding y = f (x) ∈ R d , d < < D such that the embedding function f retains some meaningful properties of the original image data set, ideally close to its intrinsic dimension. If we restrict ourselves to linear maps of the form y = W T x ∈ R d , where the D × d projection matrix W = (w ij ) 1≤i≤D,1≤j≤d (assuming full rank), then such a procedure is called a locally linear low-dimensional embedding (see Roweis & Saulm (2000) ; Van Der Maaten et al. (2009) ). The target is to search for a "good" projection matrix W , such that the projection x → y = W T x must preserve certain locality in the data set (this is called a Locality Preserving Projection, or LPP projection, see He & Niyogi (2003) ). The locality is interpreted as a kind of intrinsic relative geometric relations between the data points in the original high-dimensional space, usually represented by the affinity matrix S = (s ij ) 1≤i,j≤n (which is a symmetric matrix with non-negative terms). As an example, given unlabelled data points x 1 , ..., x n ∈ R D , we can take s ij = exp - xi-xj 2 2σ 2 when x i -x j < ε and s ij = 0 otherwise. Here σ > 0 and ε > 0 is a small threshold parameter, and x i -x j is the Euclidean norm in R D . Based on the affinity matrix S = (s ij ), the search for the projection matrix W can be formulated as the following optimization problem min W φ(W ) = 1 2 n i,j=1 s ij y i -y j 2 , in which y i = W T x i and y j = W T x j and the norm y i -y j is taken in the projected space R d . Usually when x i -x j is large, the affinity s ij will be small, and vice versa. Thus (1) is seeking for the embedding matrix W such that close pairs of image points x i and x j will be mapped to close pairs of embeddings y i = W T x i and y j = W T x j , and vice versa. This helps to preserve the local geometry of the data set, a.k.a the locality. To solve (1), we introduce a weighted fully-connected graph G where the vertex set consists of all data points x 1 , ..., x n and the weight on the edge connecting x i and x j is given by s ij ≥ 0. Consider the diagonal matrix D = diag(D 11 , ..., D nn ) where D ii = n j=1 s ij , and we then introduce the graph Laplacian L = D -S. Then the minimization problem (1), together with the normalization constraint n i=1 D ii y 2 i = 1, reduces to the following generalized eigenvalue problem (see He & Niyogi (2003) ) XLX T w = λXDX T w , (2) where X = [x 1 , ..., x n ] ∈ R D×n . Assume we have obtained an increasing family of eigenvalues 0 = λ 0 < λ 1 ≤ ... ≤ λ n-1 . Let the corresponding eigenvectors be w 0 , w 1 , ..., w n-1 . Then the low-dimensional embedding matrix can be taken as W = [w 1 , ..., w d ] (see He et al. (2005)) . By choosing different affinity matrices S = (s ij ), the above LPP framework includes many commonly seen practical examples. For example, if the data x 1 , ..., x n are not labelled, then we can take s ij = 1 n and (2) produces the classical Principle Component Analysis (PCA). For labelled data forming subsets X 1 , ..., X m with same labels in each subset, we can take s ij = 1 n k when x i , x j ∈ X k , and s ij = 0 other-wise. Here n k is the cardinality of X k . This will produce Linear Discriminant Analysis (LDA). The detailed justifications of these connections can be found in He et al. (2005) . Given an input data set X = {x 1 , ..., x n } where each x i ∈ R D , either labelled or unlabelled, we can apply a k-d tree (Bentley (1975) , Wang et al. (2011) ) based partition scheme to divide the whole data set X into non-overlapping subsets C 1 , ..., C 2 h where h is the depth of the tree. Conventional subspace selection algorithms could be applied on the whole sample space before the whole space is partitioned and indexed. For example, we can first apply a PCA to X , which selects the first d bases [a 1 , ..., a d ] with largest variance. Based on these bases, the covariance information obtained from global PCA is utilized in the indexing as follows: (1) we project all sample points x 1 , ..., x n onto the maximum variance basis a 1 , then we find the median value m 1 of the projected samples, and split the whole collection of data along a 1 at m 1 , i.e., split the current node into left and right children; (2) starting from level i = 2, for each left and right child, project the whole collection of data along the i-th maximum variance basis a i , find the median value m i , and split all the children at m i ; (3) increase the level from i to i + 1 and repeat (2) until i = h reaches the bottom of the tree. We collect all the 2 h children at level i = h and obtain the disjoint subsets 3 CALCULATING THE "CENTER OF MASS" ON STIEFEL AND GRASSMANN MANIFOLDS Subspace Indexing (Wang et al. (2011) ) provides a d-dimensional representation of the data set {x 1 , ..., x n } by the subspace span(w 1 , ..., w d ) = {W T x, x ∈ R D } generated from the linear embedding matrix W ∈ R D×d . In this case, we are only interested in the column space of W , so we can assume that w 0 , w 1 , ..., w n-1 is an orthonormal basisfoot_0 . Such a matrix W belongs to the Stiefel manifold, defined by C 1 , ..., C 2 h . Each subset C k , k = 1, 2, ..., 2 h consists Definition 1 (Stiefel manifold) The compact Stiefel manifold St(d, D) is a submanifold of the Eu- cilidean space R D×d such that St(d, D) = {X ∈ R D×d : X T X = I d } . (3) As an example, if we are interested in signal recovery using low-dimensional PCA embedding, the projections we calculated from PCA analysis will be on Stiefel manifolds. However, for classification tasks, the exact distance information is less important than label information. In this case, two such Stiefel matrices W 1 and W 2 produce the same embedding if W 1 = W 2 O d for some O d ∈ O(d), where O(d) is the group of orthogonal matrices in dimension d. In this case, the relevant embedding we obtained is a point on the Grassmann manifold, defined by Given a family of elements on the Stiefel or Grassmann manifold, the center-of-mass is defined as an element on the same manifold that minimizes the functional given by the weighted sum of square distances. To be precise, we have Definition 3 (Stiefel and Grassmann center-of-masses) Given a sequence of matrices W 1 , ..., W l ∈ St(d, D) and a sequence of weights w 1 , ..., w l > 0, the Stiefel center-of-mass with respect to the distance d(W 1 , W 2 ) on St(d, D) is defined as a matrix W c = W St c (W 1 , ..., W l ; w 1 , ..., w l ) ∈ St(d, D) such that W c = W St c (W 1 , ..., W l ; w 1 , ..., w l ) ≡ arg min W ∈St(d,D) l j=1 w j d 2 (W, W j ) . ( ) Similarly, if the corresponding equivalent classes are [W 1 ], ..., [W l ] ∈ Gr(d, D), then the Grass- mann center-of-mass with respect to the distance d([W 1 ], [W 2 ]) on Gr(d, D) is defined as the equiv- alence class [W c ], where W c = W Gr c (W 1 , ..., W l ; w 1 , ..., w l ) ∈ St(d, D) is such that W c = W Gr c (W 1 , ..., W l ; w 1 , ..., w l ) ≡ arg min W ∈St(d,D) l j=1 w j d 2 ([W ], [W j ]) . The distances d(W 1 , W 2 ) or d([W 1 ], [W 2 ] ) can be taken in different ways. For example, for W 1 , W 2 ∈ St(d, D), one way is to consider d(W 1 , W 2 ) = d F (W 1 , W 2 ) = W 1 -W 2 F , the ma- trix Frobenius norm of W 1 -W 2 . One can also take a more intrinsic distance, such as the geodesic distance between W 1 and W 2 on the manifold St(d, D) with the metric given by embedded geometry (see Edleman et al. (1999) ). For [W 1 ], [W 2 ] ∈ Gr(d, D), one way is to consider the projected Frobenius norm d([W 1 ], [W 2 ]) = d pF ([W 1 ], [W 2 ]) = 2 -1/2 W 1 W T 1 -W 2 W T 2 F . There are also many other choices, such as using the principle angles between the subspaces, chordal norms, or other types of Frobenius norms (see Edleman et al. (1999, Section 4.3) ). With respect to matrix Frobenius norm and projected Frobenius norm, the Stiefel and Grassmann center-of-masses can be calculated explicitly in the following theorems. Theorem 1 (Stiefel center-of-mass with respect to Frobenius norm) We consider the singular value decomposition of the matrix l j=1 w j W j = O 1 ∆O 2 , where O 1 ∈ O(D) and O 2 ∈ O(d), ∆ = diag(λ 1 , ..., λ d ) d×d 0 (D-d)×d and λ 1 ≥ ... ≥ λ d ≥ 0 are the singular values. Then the Stiefel center-of-mass with respect to the distance given by Frobenius norm d(W 1 , W 2 ) = W 1 -W 2 F is given by W c = O 1 ΛO 2 where Λ = diag(1, ..., 1) d×d 0 (D-d)×d . Theorem 2 (Grassmann center-of-mass with respect to projected Frobenius norm) Set Ω j = l j=1 w j -1 w j . We consider the singular value decomposition of the symmetric matrix l j=1 Ω j W j W T j = Q∆Q T where Q ∈ O(D) and ∆ = diag(σ 2 1 , ..., σ 2 D ), σ 2 1 ≥ ... ≥ σ 2 D ≥ 0. Then the Grassmann center-of-mass with respect to the distance given by projected Frobenius norm for classification tasks). Each subspace model W k is built from LPP embedding using the subset C k ⊂ X developed from k-d tree and h is the depth tree. Given a test point x ∈ R D that does not lie in X , we can map it to the low-dimensional embedding f (x) = W T k(x) x ∈ R d . The index k(x) corresponds to the subset C k(x) that lie closest to x. In practice, we can first compute the means m k over all the data points in the subset C k for each k = 1, 2, ..., 2 h and sort the distances x -m k in ascending order x -m k1(x) ≤ ... ≤ x -m k 2 h (x) , {k 1 (x), ..., k 2 h (x)} = {1, ..., 2 h }. We then take k(x) = k 1 (x) to be the index k corresponding to the shortest distance. This is effective when the test point x lies significantly close to one of the subsets C k(x) , see Figure 2(a) . d pF ([W 1 ], [W 2 ]) = 2 -1/2 W 1 W T 1 -W 2 W T 2 F is Algorithm 1 SIM-I: Subspace Indexing Model with Interpolation 1: Input: Data set X = {x 1 , ..., x n ∈ R D } and its corresponding affinity matrix S = (s ij ) 1≤i,j≤n ; test point x ∈ R D ; threshold ratio r thr > 1; tree depth h; parameter K > 0 2: Using an initial PCA and a k-d tree based partition scheme, decompose the data set X into subsets C 1 , ..., C 2 h , where h is the depth of the tree 3: For each subset C k , calculate its mean (center) m k ∈ R D and its LPP embedding matrix W k ∈ St(d, D) based on the affinity matrix S 4: Sort the distances x -m k in ascending order x - In this case, we aim to interpolate between several subspace indexing models W j1(x) , ..., W j I (x) . To do this, we first find the subspace indexes j 1 (x), ..., j I (x) from the first I subsets C j1(x) , ..., C j I (x) closest to x, i.e., j 1 (x) = k 1 (x), ..., j I (x) = k I (x) given the sorted distances x -m k mentioned above. In practice, the number I = I(x) is depending on x and can be chosen in the following way: m k1(x) ≤ ... ≤ x -m k 2 h (x) , {k 1 (x), ..., k 2 h (x)} = {1, ..., 2 h } 5: Determine I, which is the first sub-index i of k i (x) such that x-m k I+1 (x) > r thr x-m k1(x) 6: Set j i (x) = k i (x) for i = 1, I is the first sub-index i of k i (x) such that x -m k I+1 (x) > r thr x -m k1(x) , where r thr > 1 is a threshold ratio that can be tuned. We then pick the weights as w i = exp(-K x-m ji(x) 2 ) for some K > 0 and i = 1, 2, ..., I. This is indicating that the closer x is to C ji(x) , the heavier weights we assign to W ji(x) in the interpolation process. Given the embedding matrices W j1(x) , ..., W j I (x) ∈ St(d, D) or their corresponding subspaces [W j1(x) ], ..., [W j I (x) ] ∈ Gr(d, D), together with the weights w 1 , ..., w I > 0, we find a center-of-mass W c = W St c (W j1(x) , ..., W j I (x) ; w 1 , ..., w I ) (Stiefel case) or [W c ] = [W Gr c (W j1(x) , . .., W j I (x) ; w 1 , ..., w I )] (Grassmann case) according to Definition 3 and Theorems 1 and 2. Finally, we map the test point x to the low-dimensional embedding f (x) = W T c x ∈ R d . Notice that when I = 1, the interpolation procedure reduces to projecting x using W k1(x) calculated from LPP analysis on the closest subset only. In general, the whole interpolation procedure can be regarded as providing a regularized version of the piece-wise linear embedding we discussed in Section 2 (also see Figure 1 ). We summarize our interpolation method as the SIM-I Algorithm 1.

5. EXPERIMENTS

5.1 PCA RECOVERY FOR SIFT DATA SET SIFT (Scale Invariant Feature Transform, see Lowe (2004) ) data set is a data set that computes for each keypoint a real-valued descriptor, based on the content of the surrounding patch in terms of local intensity gradients. Given its remarkable performance, SIFT has been often used as starting point for the creation of other descriptors. The final SIFT descriptor consists of 128 elements. This means that each data point in SIFT data set has dimension D = 128, and we pick the embedding dimension d = 16. The original SIFT data set has 10068850 data samples, that form the data set sift sample. We randomly collect n train = 200 × 2 13 elements from these sample points as our target data set X = sift train = {x 1 , ..., x ntrain ∈ R D }. We consider the recovery efficiency of PCA embedding of the SIFT data set. Let x be a point in sift sample and let W be a Stiefel matrix in St(16, 128) . The projection of x onto the 16-dimensional space is then denoted by y = W T x. By recovery we meant to consider the point x = (W T ) -y where (W T ) - is the Moore-Penrose pseudo-inverse of W T . The recovery efficiency can then be quantified by the recovery error x -x , where the Euclidean norm is computed in R 128 . We pick h = 13 so that 2 h = 8192. Then we decompose sift train into 8192 subsets C 1 , ..., C 8192 using k-d tree based partition. For each subset C k , we calculate the mean m k ∈ R 128 and we obtain a PCA embedding matrix W k ∈ St(16, 128). We sort the distances x -m k , k = 1, 2, ..., 8192 in ascending order so that x -m k1 ≤ ... ≤ x -m k8192 , where {k 1 , ..., k 8192 } = {1, ..., 8192}. Then we find among the subset means m k , k = 1, 2, ..., 8192 the first I nearest to x, with their indexes denoted by k i = k i (x), i = 1, .., I. The number I = I(x) is depending on x and is chosen in the following way: I is the first sub-index i of k i such that x -m k I+1 > r thr x -m k1 , where r thr > 1 is a threshold ratio. We pick r thr = 2. For test data set, we randomly pick from sift sample\sift train a subset of size n test = 500, and we denote the data set as sift test. For each test point x ∈ sift test, we consider sending it to the nearest subset C k1 and the corresponding Stiefel matrix is W k1 ∈ St(16, 128). We can then consider the recovery point x = (W T k1 ) -W T k1 x and the benchmark recovery error Error bm = x -x . Consider the alternative recovery scheme using our method SIM-I. We calculate the weights w j = exp(-K x -m kj 2 ) for j = 1, ..., I and we choose the constant K = 10 -8 . We then find the Stiefel center-of-mass W c = W St c (W k1(x) , ..., W k I (x) ; w 1 , ..., w I ) using Theorem 1 and taking the distance function to be the matrix Frobenius norm d(W 1 , W 2 ) = W 1 -W 2 F . We consider the recovery point x c = ((W c ) T ) -(W c ) T x and the recovery error Error c = x -x c . Over the test set sift test, we find that for about 94.2% test points, Error c < Error bm, which implies that SIM-I improved the efficiency of recovery. The empirical average Error c is 402.089506, and the empirical average Error bm is 454.452314. Figure 3 plots the Error c and Error bm (vertical axis) as functions of the test sample indexes. The red curve is for Error c and blue curve is for Error bm, where we have sorted Error bm in ascending order and reordered the test indexes correspondingly. Figure 4 gives the differences Error c -Error bm in descending order. It can be apparently seen that Error c -Error bm < 0 for most of test samples.

5.2. NEAREST-NEIGHBOR CLASSIFICATION FOR MNIST AND CIFAR-10

Here we consider a labeled data set data = {(x i , y i ), i = 1, 2, ..., N } where x i ∈ R D are the inputs and y i ∈ N are the labels. We randomly select the training set data train = {(x i , Y i ), i = 1, 2, ..., n train } and the test set data test = {(a i , b i ), i = 1, 2, ..., n test }, and we make them disjoint. We first project the training set onto a kd PCA-dimensional subspace via a standard PCA. Based on this initial embedding, using a kd-tree with height h, we divide data train into 2 h clusters C k , k = 1, 2, ..., 2 h . For each C k , we find an LPP embedding matrix W k ∈ St(kd LPP, D) by setting the affinity matrix to be s ij = exp(-x i -x j 2 ) if x i , x j ∈ C k are in the same class and s ij = 0 otherwise. Since we are having a classification problem, we can identify each W k by the subspace it spans, i.e., we consider [W k ] which is the equivalence class of W k in Gr(kd LPP, D). As before, we compute in R D the means m k of each cluster C k . For a test point x ∈ data test, we sort the distances x -m k , k = 1, 2, ..., 2 h in ascending order so that x -m k1 ≤ ... ≤ x -m k 2 h , where {k 1 , ..., k 2 h } = {1, ..., 2 h }. Then we find among the cluster means m k , k = 1, 2, ..., 2 h the first I nearest to x, with their indexes denoted by k i = k i (x), i = 1, ..., I. The number I = I(x) is depending on x and is chosen in the following way: I is the first sub-index i of k i such that x -m k I+1 > r thr x -m k1 , where r thr > 1 is a threshold ratio. The parameter r thr can be treated as a hyper-parameter that we can tune here. For baseline method, we do a nearest-neighbor classification for x on a low-dimensional embedding of the cluster C k1 , and we pick the number of nearest-neighbors to be knn ≥ 1. Indeed we project x and all training data points in C k1 using W k1 , and perform nearest-neighbor classification on the resulting projection. For our method SIM-I, we take the union C k1 ∪ ... ∪ C k I . Recall that each C ki corresponds to an LPP embedding projection matrix W ki . We set the weights w i = exp(-K x -m ki 2 ), and we pick K = 10 -8 . We compute a center-of-mass W c of the projection matrices W ki with weights w i , i = 1, 2, ..., I using the Grassmann center-of-mass method, where the distance is taken as the projected Frobenius norm of Grassmann matrices, i.e., d(W 1 , W 2 ) = 2 -1/2 W 1 W T 1 -W 2 W T 2 F and • F is the matrix Frobenius norm. We obtain W c = W Gr c (W k1(x) , ..., W k I (x) ; w 1 , ..., w I ) and we project x and all training data points in the union C k1 ∪ ... ∪ C k I using W c . We then perform a nearest-neighbor classification for x on this low-dimensional embedding with the number of nearest-neighbors being equal to knn. 1 shows the results, where the last 2 columns are the nearestneighbor classification success rates for baseline method, and for SIM-I using Grassmann centerof-mass and the rows are for different experiments. The first 4 columns are for data sets, kd-tree height, threshold value r thr , the number of nearest-neighbors knn, respectively. We slightly tuned r thr to reach best performances. Clearly, SIM-I has its advantage. 

A PROOF

([W 1 ], [W 2 ]) = 2 -1/2 W 1 W T 1 -W 2 W T 2 F .



If this is not the case, we can replace the matrix [w0 w1 ... wn-1] by the Q matrix of the QR-decomposition of itself, without changing the corresponding subspace.



Figure 1: The idea of "smoothing" a piece-wise linear low-dimensional embedding model: (a) The piece-wise linear low-dimensional embedding model built from LPP; (b) The regularized lowdimensional embedding by taking Stiefel/Grassmann manifold center-of-mass among adjacent linear pieces.

of a family of input data in R D . Based on them, using the above LPP framework, for each C k , a low-dimensional embedding matrix W k ∈ R D×d can be constructed. In this way, over the whole data set X , we have constructed a piece-wise linear low-dimensional embedding model f (x) : R D → R d (see Figure1(a)) where x ∈ X . This model is given by the linear embedding matrices W 1 , ..., W 2 h ∈ R D×d .The above model construction can be regarded as a training process from the data set X . For a given test data point x ∈ R D , not included in X , we can find the closest subset C k(x) to it, by selecting the index k = k(x) ∈ {1, ..., 2 h } with the smallest distance x -m k(x) . Here m k is the mean among all data points in the subset C k . With the subset C k(x) chosen, we map the test point x ∈ R D to its low-dimensional embedding f (x) = W T k(x) x ∈ R d . Such a procedure extended the piece-wise linear embedding model f (x) : R D → R d to all testing data points in R D .

Grassmann manifold) The Grassmann manifold Gr(d, D) is defined to be the quotient manifold Gr(d, D) = St(d, D)/O(d). A point on Gr(d, D) is defined by an equivalence class [W ] = {W O d , O d ∈ O(d)} where W ∈ St(d, D).

the equivalence class [W c ] determined by W c = QΛ, where Λ = diag(1, ..., 1) d×d 0 (D-d)×d . 4 SIM-I: INTERPOLATING THE LPP MODEL FAMILY Recall that we have developed a piece-wise linear embedding model f (x) : R D → R d over the data set X = {x 1 , ..., x n }. The embedding f (x) corresponds to a family of subspace indexing models

Figure 2: There are I = 3 nearby subsets C 1 , C 2 , C 3 with means m 1 , m 2 , m 3 in the training set. (a) Test point x is apparently close to m 1 , and thus the low-dimensional embedding f (x) is taken as f (x) = W T 1 x, where W 1 is the LPP subspace based on C 1 ; (b) Test point x has approximately the same distances to m 1 , m 2 , m 3 , and thus the embedding f (x) is taken as f (x) = W T c x, where W c is the Stiefel/Grassmann center-of-mass for the LPP subspace models W 1 , W 2 , W 3 based on C 1 , C 2 , C 3 .

2, ..., I and obtain the embedding matrices W j1(x) , ..., W j I (x) ∈ St(d, D) or their corresponding subspaces [W j1(x) ], ..., [W j I (x) ] ∈ Gr(d, D), together with the weights w i = exp(-K x -m ji(x) 2 ) > 0 for i = 1, ..., I 7: Find a center-of-mass W c = W St c (W j1(x) , ..., W j I (x) ; w 1 , ..., w I ) (Stiefel case) or [W c ] = [W Gr c (W j1(x) , ..., W j I (x) ; w 1 , ..., w I )] (Grassmann case) according to Definition 3 and Theorems 1 and 2. 8: Output: The low-dimensional embedding f (x) = W T c x ∈ R d However, for a general test point x ∈ R D , it might happen that this point lies at approximately the same distances to the centers of each of the several different subsets adjacent to x (see Figure 2(b)).

Figure 3: Comparison of the PCA Recovery Errors: Blue = benchmark case using closest subset PCA recovery, with the error sorted from low to high; Red = using SIM-I based on Stiefel centerof-mass and d(W 1 , W 2 ) = W 1 -W 2 F .

OF THEOREM 1 Let D ≥ d ≥ 1. Recall St(d, D) stands for the Stiefel manifold, that is, each matrix in St(d, D) is a D by d matrix with columns being orthogonally normal. For any matrix M , recall that M 2 F = [tr(M T M )] 1/2 is the Frobenius norm of M . Theorem 1 (Stiefel center-of-mass with respect to Frobenius norm) We consider the singular value decomposition of the matrix l j=1 w j W j = O 1 ∆O 2 , where O 1 ∈ O(D) and O 2 ∈ O(d), ∆ = diag(λ 1 , ..., λ d ) d×d 0 (D-d)×d and λ 1 ≥ ... ≥ λ d ≥ 0 are the singular values. Then the Stiefel center-of-mass with respect to the distance given by Frobenius norm d(W 1 , W 2 ) = W 1 -W 2 F is given by W c = O 1 ΛO 2 where Λ = diag(1, ..., 1) d×d 0 (D-d)×d . Proof. For W ∈ St(d, D) we define f (W ) = l j=1 w j W -W j2 , and we are looking for aminimizer W c of f on St(d, D). Write W -W j 2 F = tr[(W -W j ) T (W -W j )] = tr(W T W ) + tr(W T j W j ) -2tr(W T W j ) = 2p -2 tr(W T W j ).j . By singular value decomposition, there are D × D orthogonal matrix O 1 and d × d orthogonal matrix O 2 such that B = O 1 ∆O 2 , where ∆ := diag(λ 1 , ..., λ d ) d×d 0 (D-d)×d , and λ 1 ≥ • • • ≥ λ d ≥ 0 are the singular values of B. It follows that tr(W T B) = tr(∆O 2 W T O 1 ).Observe that(O 2 W T O 1 ) T ∈ St(d, D). Write O 2 W T O 1 = (c ij ) d×D . Then when O 2 W T O 1 = (I d ,O) where O is the d × (D -d) matrix with all entries being zero. This says that W c = O 1 ΛO 2 where Λ = diag(1, ..., 1) d×d 0 (D-d)×d is the maximizer of tr(W T B). The so obtained W c serves as the minimizer of f (W ) on St(d, D) and is thus the centerof-mass. B PROOF OF THEOREM 2 Recall that Gr(d, D) = St(d, D)/O(d). Every point W on St(d, D) will correspond to an equivalence class [W ] = {W O d : O d ∈ O(d)} which is a point on Gr(d, D). To represent points on Gr(d, D) using matrices, we notice that every point on Gr(d, D) corresponds to a unique choice of matrix W W T where W ∈ St(d, D). In this way, we can define the projected Frobenius distance between two classes [W 1 ] and [W 2 ] in Gr(d, D) as d 2 pF

Nearest-neighbor classification success rates.

annex

Theorem 2 (Grassmann center-of-mass with respect to projected Frobenius norm) Set Ω j = l j=1 w j -1 w j . We consider the singular value decomposition of the symmetric matrix l j=1 Ω j W j W T j = Q∆Q T where Q ∈ O(D) and ∆ = diag(σ 2 1 , ..., σ 2 D ), σ 2 1 ≥ ... ≥ σ 2 D ≥ 0. Then the Grassmann center-of-mass with respect to the distance given by projected Frobenius normProof. Set M = W W T and M j = W j W T j , then we are looking for a minimizerIt is easy to verify that , is an inner product and M 2 F = M, M . Thus we write

So the minimum is taken when

we have W W T = P V P T where P is an orthogonal matrix of size D × D and V = diag(1, 1, .., 1, 0, 0, ..., 0) is an D × D matrix with rank(V ) = d. Moreover, P can be chosen asLet the orthgonal matrix O = Q -1 P . Then we further haveWe then show that min 

