MIXED-FEATURES VECTORS & SUBSPACE SPLITTING

Abstract

Motivated by metagenomics, recommender systems, dictionary learning, and related problems, this paper introduces subspace splitting (SS): the task of clustering the entries of what we call a mixed-features vector, that is, a vector whose subsets of coordinates agree with a collection of subspaces. We derive precise identifiability conditions under which SS is well-posed, thus providing the first fundamental theory for this problem. We also propose the first three practical SS algorithms, each with advantages and disadvantages: a random sampling method , a projection-based greedy heuristic , and an alternating Lloyd-type algorithm ; all allow noise, outliers, and missing data. Our extensive experiments outline the performance of our algorithms, and in lack of other SS algorithms, for reference we compare against methods for tightly related problems, like robust matched subspace detection and maximum feasible subsystem, which are special simpler cases of SS.

1. INTRODUCTION

As the reach of data science expands, and as we continuously improve our sensing, storage and computing capabilities, data in virtually all fields of science keeps becoming increasingly highdimensional. For example, the CERN Large Hadron Collider currently "generates so much data that scientists must discard the overwhelming majority of it, hoping that they've not thrown away anything useful" [1], and the upcoming Square Kilometer Array is expected to produce 100 times that [2] . Fortunately, high-dimensional data often has an underlying low-dimensional structure. Inferring such structure not only cuts memory and computational burdens, but also reduces noise and improves learning and prediction. However, higher dimensionality not only increases computational requirements; it also augments data's structure complexity. In light of this, several research lines have explored new low-dimensional models that best summarize data, going from principal component analysis (PCA) [3] [4] [5] [6] [7] [8] [9] [10] [11] and single subspaces [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] to unions of subspaces [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] , and algebraic varieties [41] . This paper introduces mixed-features vectors (MFV's): a new model that describes the underlying structure of data arising from several modern applications that is not captured by existing lowdimensional models. The main idea is that each entry of a MFV comes from one out of several classes, and that the entries of the same class lie in an underlying subspace. In particular, MFV's are motivated by megatenomics [42] [43] [44] [45] [46] and recommender systems [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] : in metagenomics each gene segment comes from one of the several taxa present in a microbiome; in recommender systems each rating may come from one of several users sharing the same account. However, MFV's also have applications in robust estimation (e.g., robust PCA [3] [4] [5] [6] [7] [8] [9] [10] [11] and robust dictionary learning [60-67]), matrix completion [48-59], subspace clustering [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] , and more. This paper also introduces subspace splitting (SS): the task of clustering the entries of a MFV according to its underlying subspaces. SS is tightly related to other machine learning problems. In particular, SS can be thought as a generalization of robust matched subspace detection (RMSD) [12] [13] [14] [15] [16] [17] , and maximum feasible subsystem (MAXFS) [68] [69] [70] [71] [72] [73] [74] . However, the added complexity of SS renders existing approaches for these problems inapplicable, which calls the attention for specialized SS theory and methods. In these regards, (i) we derive precise identifiability conditions under which SS is well-posed, and (ii) we propose the first three SS algorithms.

