MIXED-FEATURES VECTORS & SUBSPACE SPLITTING

Abstract

Motivated by metagenomics, recommender systems, dictionary learning, and related problems, this paper introduces subspace splitting (SS): the task of clustering the entries of what we call a mixed-features vector, that is, a vector whose subsets of coordinates agree with a collection of subspaces. We derive precise identifiability conditions under which SS is well-posed, thus providing the first fundamental theory for this problem. We also propose the first three practical SS algorithms, each with advantages and disadvantages: a random sampling method , a projection-based greedy heuristic , and an alternating Lloyd-type algorithm ; all allow noise, outliers, and missing data. Our extensive experiments outline the performance of our algorithms, and in lack of other SS algorithms, for reference we compare against methods for tightly related problems, like robust matched subspace detection and maximum feasible subsystem, which are special simpler cases of SS.

1. INTRODUCTION

As the reach of data science expands, and as we continuously improve our sensing, storage and computing capabilities, data in virtually all fields of science keeps becoming increasingly highdimensional. For example, the CERN Large Hadron Collider currently "generates so much data that scientists must discard the overwhelming majority of it, hoping that they've not thrown away anything useful" [1] , and the upcoming Square Kilometer Array is expected to produce 100 times that [2] . Fortunately, high-dimensional data often has an underlying low-dimensional structure. Inferring such structure not only cuts memory and computational burdens, but also reduces noise and improves learning and prediction. However, higher dimensionality not only increases computational requirements; it also augments data's structure complexity. In light of this, several research lines have explored new low-dimensional models that best summarize data, going from principal component analysis (PCA) [3] [4] [5] [6] [7] [8] [9] [10] [11] and single subspaces [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] to unions of subspaces [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] , and algebraic varieties [41] . This paper introduces mixed-features vectors (MFV's): a new model that describes the underlying structure of data arising from several modern applications that is not captured by existing lowdimensional models. The main idea is that each entry of a MFV comes from one out of several classes, and that the entries of the same class lie in an underlying subspace. In particular, MFV's are motivated by megatenomics [42] [43] [44] [45] [46] and recommender systems [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] : in metagenomics each gene segment comes from one of the several taxa present in a microbiome; in recommender systems each rating may come from one of several users sharing the same account. However, MFV's also have applications in robust estimation (e.g., and robust dictionary learning [60-67]), matrix completion [48-59], subspace clustering [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] , and more. This paper also introduces subspace splitting (SS): the task of clustering the entries of a MFV according to its underlying subspaces. SS is tightly related to other machine learning problems. In particular, SS can be thought as a generalization of robust matched subspace detection (RMSD) [12] [13] [14] [15] [16] [17] , and maximum feasible subsystem (MAXFS) [68] [69] [70] [71] [72] [73] [74] . However, the added complexity of SS renders existing approaches for these problems inapplicable, which calls the attention for specialized SS theory and methods. In these regards, (i) we derive precise identifiability conditions under which SS is well-posed, and (ii) we propose the first three SS algorithms. Let U 1 , . . . , U K be subspaces of R d , and let Ω 0 , Ω 1 , . . . , Ω K denote a partition of [d] := {1, . . . , d}. For any subspace, matrix or vector that is compatible with a set of indices Ω ⊂ [d], we will use the subscript Ω to denote its restriction to the coordinates/rows in Ω. For example, U 1 ΩK ⊂ R |ΩK| denotes the restriction of U 1 to the coordinates in Ω K . Define x ∈ R d as the mixed-features vector (MFV) such that x Ω k ∈ U k Ω k for each k = 1, . . . , K, and the entries of x Ω0 are outliers. Let ∈ R d denote a noise vector with variance σ 2 . Given U 1 , . . . , U K , and an incomplete observation y Ω = x Ω + Ω , the goal of subspace splitting (SS) is to determine the subsets Ω 1 ∩ Ω, . . . , Ω K ∩ Ω indicating the observed coordinates of y that match with each subspace. Example 1. Consider the following setup, with 1-dimensional subspaces U 1 , U 2 spanned by U 1 , U 2 : U 1 =        1 1 1 1 1 1        , U 2 =        1 2 3 4 5 6        , x =        1 /2 1 /2 6 8 9 10        , =        0.1 -0.1 -0.1 0.1 -0.1 0.1        , y Ω =        0.51 0.49 5.9 8.1 8.9 •        . It is easy to see that Ω 1 = {1, 2}, Ω 2 = {3, 4}, Ω 0 = {5}, because x Ω1 = 1 2 U 1 Ω1 and x Ω2 = 2U 2 Ω2 . The keen reader will immediately wonder: is there another partition {Ω 1 , . . . , Ω K } different from {Ω 1 , . . . , Ω K } such that x Ω k ∈ U k Ω k for every k? In other words, is this problem well-posed, and if so, under what conditions? Our main theoretical result answers this question, showing that under the next assumptions, Ω k can be recovered if and only if it has more elements than the dimension of U k . A1 Each U k is drawn independently with respect to the uniform measure over the Grassmannian. A2 Each x Ω k is drawn independently according to an absolutely continuous distribution with respect to the Lebesgue measure on U k Ω k . In words, A1 essentially requires that U 1 , . . . , U K are in general position with no particular relation with one another. Similarly, A2 requires that each piece of x is in general position over its corresponding piece of subspace. This type of genericity assumptions are becoming increasingly common in compressed sensing, matrix completion, subspace clustering, tensor theory, and related problems [10, 21, 30, 31, 33-38, 41, 56-59] . All our statements hold with probability 1 with respect to the measures in A1 and A2. We point out that A1 and A2 do not imply coherence or affinity (other typical assumptions in related theory that quantify alignment with the canonical axes or between subspaces [3, 6, 11, 23, 26, 27, 48-50, 54, 55] ) nor vice-versa. For example, bounded coherence and affinity assumptions indeed allow subspaces perfectly aligned on some coordinates. However, they rule-out cases that our assumptions allow, for example the non-zero measure set of highly coherent or affine subspaces that are somewhat aligned with the canonical axes or with one another. To sum up, these assumptions are different, not stronger nor weaker than the usual coherence and affinity assumptions. With this, we are ready to state our main theorem, showing that subspace splitting is possible if and only if x contains more than dim(U k ) entries of each subspace U k . Theorem 1. Suppose A1 and A2 hold. Given x and U 1 , . . . , U K , one can identify Ω 1 , . . . , Ω K if and only if |Ω k | > dim(U k ) for every k. Example 1 shows a case where the conditions of Theorem 1 are met (|Ω 1 | = |Ω 2 | = 2 > dim(U 1 ) = dim(U 2 ) = 1), and consequently subspace splitting is well-posed (there exist no partition other than the true {Ω 1 , Ω 2 } that splits x into U 1 and U 2 ). Conversely, the following Example shows a case where the conditions of Theorem 1 are not satisfied, and at least some Ω k is unidentifiable.

