GENERALIZED MATRIX LOCAL LOW RANK REPRESEN-TATION BY RANDOM PROJECTION AND SUBMATRIX PROPAGATION

Abstract

Detecting distinct submatrices of low rank property is a highly desirable matrix representation learning technique for the ease of data interpretation, called the matrix local low rank representation (MLLRR). Based on different mathematical assumptions of the local pattern, the MLLRR problem could be categorized into two sub-problems, namely local constant variation (LCV) and local linear low rank (LLR). Existing solutions on MLLRR only focused on the LCV problem, which misses a substantial amount of true and interesting patterns. In this work, we develop a novel matrix computational framework called RPSP (Random Probing based submatrix Propagation) that provides an effective solution for both of the LCV and LLR problems. RPSP detects local low rank patterns that grow from small submatrices of low rank property, which are determined by a random projection approach. RPSP is supported by theories of random projection. Experiments on synthetic data demonstrate that RPSP outperforms all state-of-the-art methods, with the capacity to robustly and correctly identify the low rank matrices under both LCV and LLR settings. On real-world datasets, RPSP also demonstrates its effectiveness in identifying interpretable local low rank matrices.

1. INTRODUCTION

Matrix approximation has found wide-range utilities in recommendation systems, computer vision and text mining. Traditional matrix low rank approximation methods, such as truncated singular value decomposition (SVD) and rank minimization, assumes that the observed matrix has a global low-rank, indicating that the low rank components are dense. This becomes challenging in the phase of data interpretation. In real world data, both features and incidences may form sparse subspace structures. As illustrated in (3) Anchor-based methods that first pinpoint local regions using certain primitive similarity measure, and further conduct a low rank fitting to each anchored region Lee et al. (2016) . Noted, all these methods rely on the assumption that the low rank submatrix has a distinctly spiked mean compared to the background Yang et al. (2014) . Identification of submatrices with a more general definition of low rank property remain unsolved, including: (1) the submatrices having a low rank and a similar mean compared to the background, (2) the background noise matrix having heterogeneous and even contaminated distributions, and (3) the submatrices is of a small size or has heterogeneous error distributions. In this study, we re-visit the tasks of MLLRR from the perspective of random projection and developed a new framework, namely RPSP (Random Probing-based Sub-matrix Propagation). RPSP first evaluates low rankness of a large set of randomly sampled small submatrices, and gradually grow these low rank submatrices using a propagation strategy. RPSP adopts a random projection approach to approximate the singular values of submatrices that drastically improved the computational efficiency compared to the conventional QR decomposition-based computation. We systematically benchmarked RPSP with state-of-the-arts (SOTA) methods on comprehensively simulated data and two real-world datasets. RPSP outperformed all SOTA methods on different senarios. RPSP is shown to have the unique capability to handle heteroscedastic error distributions, and distinguish a true local low matrix from background noise with or without spiked mean structure. Application of RPSP on real world datasets demonstrated its capability in detecting context meaningful local low rank matrices. The key contributions of this work include: (1) RPSP is the first general solution for the LCV and unsolved LLR problems: Compared to existing methods, RPSP is the only method that can robustly solve the MLLRR problem when (i) the patterns are small, (ii) the mean of patterns is not necessarily distinct compared to background, (iii) the background error is non-Gaussian or heterogeneous, and (iv) there are a large of low rank submatrices of different size and ranks. (2) A new perspective in analyzing and embedding local low rankness: We developed new framework for computing and propagating local low rankness of submatrices by estimating singular values of randomly sampled small submatrices and propagating small low rank submatrices to gradually grow into larger ones. This framework directly computes the probability of local low rankness, which also serves an interpretable embedding of matrix data. (3) A theoretical framework is developed from mathematical theories of random projection that support: (i) the identifiability of local low rankness, (ii) bound of sensitivity and specificity, and (iii) impact of errors and pattern sizes with respect to the setting of hyperparameters of RPSP. (4) An efficient computation of singular values: A random projection and parallel computing-based method on GPU was developed to drastically increase the computation efficiency of singular values for a large set of small matrices.

2.1. NOTATIONS AND MATHEMATICAL BACKGROUNDS

We denote a matrix X of M rows and N columns as X M ×N , and its (i, j)-th entry as X ij . We use I k ⊂ {1, ..., M } and J k ⊂ {1, ..., N } to denote row and column indices, and X I k ×J k means a submatrix indexed by I k × J k . ||X|| 1 , ||X|| 2 , ||X|| * denote element-wise L-1, L-2, and nuclear norms of a matrix, here nuclear norm is the sum of all singular values of X. For Z ∈ R M ×N , we use Rank(Z) to denote the rank of the matrix. Rank(Z) = r if and only if Z = U V T , where U ∈ R M ×r and V ∈ R N ×r are two orthogonal rank-r matrices. To say that Z has a low rank property, we mean r ≪ min(M, N ). Intuitively, a random matrix will have a rank of min(M, N ). The low rank property of a matrix added by background noise is commonly characterized by truncated SVD, as defined below.



Fig 1, a matrix can be generated as the sum of a series of local low rank matrices, each consists of a sparse set of features and incidences. One example of such 'locality' property is the purchase history data, where a subset of items were purchased under a common reason by a subset of customers, while neither the items bought together or the users sharing a common purchase reason is known Cheng et al. (2014). Similarly, in biological single cell RNA-sequencing data, a subgroup of genes may be regulated by an unknown signal that is activated only in a subset of cells, which forms a local low rank gene co-regulation module Xia et al. (2017); Wan et al. (2019a); Chang et al. (2020). In addition, shapes, numbers and words in imaging data are also local low rankLee et al. (2016). In these situations, Matrix Local Low Rank Representation (MLLRR) is more advantageous with its locality assumptions to uncover more interpretable patterns hidden in the data.

Figure 1: One example of the Matrix Local Low Rank Representation (MLLRR) Problem.

