SPARSE QUANTIZED SPECTRAL CLUSTERING

Abstract

Given a large data matrix, sparsifying, quantizing, and/or performing other entrywise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models. Here, we exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations. In particular, we show that very little change occurs in the informative eigenstructure, even under drastic sparsification/quantization, and consequently that very little downstream performance loss occurs when working with very aggressively sparsified or quantized spectral clustering problems. We illustrate how these results depend on the nonlinearity, we characterize a phase transition beyond which spectral clustering becomes possible, and we show when such nonlinear transformations can introduce spurious non-informative eigenvectors.

1. INTRODUCTION

Sparsifying, quantizing, and/or performing other entry-wise nonlinear operations on large matrices can have many benefits. Historically, this has been used to develop iterative algorithms for core numerical linear algebra problems (Achlioptas & McSherry, 2007; Drineas & Zouzias, 2011) . More recently, this has been used to design better neural network models (Srivastava et al., 2014; Dong et al., 2019; Shen et al., 2020) . A concrete example, amenable to theoretical analysis and ubiquitous in practice, is provided by spectral clustering, which can be solved by retrieving the dominant eigenvectors of X T X, for X = [x 1 , . . . , x n ] ∈ R p×n a large data matrix (Von Luxburg, 2007) . When the amount of data n is large, the Gram "kernel" matrix X T X can be enormous, impractical even to form and leading to computationally unaffordable algorithms. For instance, Lanczos iteration that operates through repeated matrix-vector multiplication suffers from an O(n 2 ) complexity (Golub & Loan, 2013) and quickly becomes burdensome. One approach to overcoming this limitation is simple subsampling: dividing X into subsamples of size εn, for some ε ∈ (0, 1), on which one performs parallel computation, and then recombining. This leads to computational gain, but at the cost of degraded performance, since each data point x i looses the cumulative effect of comparing to the whole dataset. An alternative cost-reduction procedure consists in uniformly randomly "zeroing-out" entries from the whole matrix X T X, resulting in a sparse matrix with only an ε fraction of nonzero entries. For spectral clustering, by focusing on the eigenspectrum of the "zeroed-out" matrix, Zarrouk et al. (2020) showed that the same computational gain can be achieved at the cost of a much less degraded performance: for n/p rather large, almost no degradation is observed down to very small values of ε (e.g., ε ≈ 2% for n/p 100). Previous efforts showed that it is often advantageous to perform sparsification/quantization in a nonuniform manner, rather than uniformly (Achlioptas & McSherry, 2007; Drineas & Zouzias, 2011) . The focus there, however, is often on (non-asymptotic bounds of) the approximation error between the original and the sparsified/quantized matrices. This, however, does not provide a direct access to the actual performance for spectral clustering or other downstream tasks of interest, e.g., since the top eigenvectors are known to exhibit a phase transition phenomenon (Baik et al., 2005; Saade et al., 2014) . That is, they can behave very differently from those of the original matrix, even if the matrix after treatment is close in operator or Frobenius norm to the original matrix. Here, we focus on a precise characterization of the eigenstructure of X T X after entry-wise nonlinear transformation such as sparsification or quantization, in the large n, p regime, by performing simultaneously non-uniform sparsification and/or quantization (down to binarization). We consider a simple mixture data model with x ∼ N (±µ, I p ) and let K ≡ f (X T X/ √ p)/ √ p, where f is an entry-wise thresholding/quantization operator (thereby zeroing-out/quantizing entries of X T X); and we prove that this leads to significantly improved performances, with the same computational cost, in spectral clustering as uniform sparsification, but for a much reduced cost in storage induced by quantization. The only (non-negligible) additional cost arises from the extra need for evaluating each entry of X T X. Our main technical contribution (of independent interest, e.g., for those interested in entry-wise nonlinear transformations of feature matrices) consists in using random matrix theory (RMT) to derive the large n, p asymptotics of the eigenspectrum of K = f (X T X/ √ p)/ √ p for a wide range of functions f , and then comparing to previously-established results for uniform subsampling and sparsification in (Zarrouk et al., 2020) . Experiments on real-world data further corroborate our findings. Our main contributions are the following. 1. We derive the limiting eigenvalue distribution of K as n, p → ∞ (Theorem 1), and we identify: (a) the existence of non-informative and isolated eigenvectors of K for some f (Corollary 1); (b) in the absence of such eigenvectors, a phase transition in the dominant eigenvalue-eigenvector ( λ, v) pair (Corollary 2): if the signal-to-noise ratio (SNR) µ 2 of the data exceeds a certain threshold γ, then λ becomes isolated from the main bulk (Von Luxburg, 2007; Joseph & Yu, 2016; Baik et al., 2005) and v contains data class-structure information exploitable for clustering; if not, then v contains only noise and is asymptotically orthogonal to the class-label vector. 2. Letting f be a sparsification, quantization, or binarization operator, we show: (a) a selective non-uniform sparsification operator, such that X T X can be drastically sparsified with very little degradation in clustering performance (Proposition 1 and Section 4.2), which significantly outperforms the random uniform sparsification scheme in (Zarrouk et al., 2020) ; (b) for a given matrix storage budget (i.e., fixed number of bits to store K), an optimal design of the quantization/binarization operators (Proposition 2 and Section 4.3), the performances of which are compared against the original X T X and its sparsified but not quantized version. For spectral clustering, the surprisingly small performance drop, accompanied by a huge reduction in computational cost, contributes to improved algorithms for large-scale problems. More generally, our proposed analysis sheds light on the effect of entry-wise nonlinear transformations on the eigenspectra of data/feature matrices. Thus, looking forward (and perhaps more importantly, given the use of nonlinear transformations in designing modern neural network models as well as the recent interest in applying RMT to neural network analyses (Dobriban et al., 2018; Li & Nguyen, 2018; Seddik et al., 2018; Jacot et al., 2019; Liu & Dobriban, 2019 )), we expect that our analysis opens the door to improved analysis of computationally efficient methods for large dimensional machine learning and neural network models more generally.

2. SYSTEM MODEL AND PRELIMINARIES

Basic setup. Let x 1 , . . . , x n ∈ R p be independently drawn (not necessarily uniformly) from a two-class mixture of C 1 and C 2 with C 1 : x i = -µ + z i , C 2 : x i = +µ + z i (1) with z i ∈ R p having i.i.d. zero-mean, unit-variance, κ-kurtosis, sub-exponential entries, µ ∈ R p such that µ 2 → ρ ≥ 0 as p → ∞, and v ∈ {±1} n with [v] i = -1 for x i ∈ C 1 and +1 for x i ∈ C 2 .foot_0 The data matrix X = [x 1 , . . . , x n ] ∈ R p×n can be compactly written as X = Z + µv T



The norm • is the Euclidean norm for vectors and the operator norm for matrices.

