SPARSE QUANTIZED SPECTRAL CLUSTERING

Abstract

Given a large data matrix, sparsifying, quantizing, and/or performing other entrywise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models. Here, we exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations. In particular, we show that very little change occurs in the informative eigenstructure, even under drastic sparsification/quantization, and consequently that very little downstream performance loss occurs when working with very aggressively sparsified or quantized spectral clustering problems. We illustrate how these results depend on the nonlinearity, we characterize a phase transition beyond which spectral clustering becomes possible, and we show when such nonlinear transformations can introduce spurious non-informative eigenvectors.

1. INTRODUCTION

Sparsifying, quantizing, and/or performing other entry-wise nonlinear operations on large matrices can have many benefits. Historically, this has been used to develop iterative algorithms for core numerical linear algebra problems (Achlioptas & McSherry, 2007; Drineas & Zouzias, 2011) . More recently, this has been used to design better neural network models (Srivastava et al., 2014; Dong et al., 2019; Shen et al., 2020) . A concrete example, amenable to theoretical analysis and ubiquitous in practice, is provided by spectral clustering, which can be solved by retrieving the dominant eigenvectors of X T X, for X = [x 1 , . . . , x n ] ∈ R p×n a large data matrix (Von Luxburg, 2007) . When the amount of data n is large, the Gram "kernel" matrix X T X can be enormous, impractical even to form and leading to computationally unaffordable algorithms. For instance, Lanczos iteration that operates through repeated matrix-vector multiplication suffers from an O(n 2 ) complexity (Golub & Loan, 2013) and quickly becomes burdensome. One approach to overcoming this limitation is simple subsampling: dividing X into subsamples of size εn, for some ε ∈ (0, 1), on which one performs parallel computation, and then recombining. This leads to computational gain, but at the cost of degraded performance, since each data point x i looses the cumulative effect of comparing to the whole dataset. An alternative cost-reduction procedure consists in uniformly randomly "zeroing-out" entries from the whole matrix X T X, resulting in a sparse matrix with only an ε fraction of nonzero entries. For spectral clustering, by focusing on the eigenspectrum of the "zeroed-out" matrix, Zarrouk et al. (2020) showed that the same computational gain can be achieved at the cost of a much less degraded performance: for n/p rather large, almost no degradation is observed down to very small values of ε (e.g., ε ≈ 2% for n/p 100). Previous efforts showed that it is often advantageous to perform sparsification/quantization in a nonuniform manner, rather than uniformly (Achlioptas & McSherry, 2007; Drineas & Zouzias, 2011) . The focus there, however, is often on (non-asymptotic bounds of) the approximation error between the original and the sparsified/quantized matrices. This, however, does not provide a direct access to the actual performance for spectral clustering or other downstream tasks of interest, e.g., since the top eigenvectors are known to exhibit a phase transition phenomenon (Baik et al., 2005; Saade et al., 2014) . That is, they can behave very differently from those of the original matrix, even if the matrix after treatment is close in operator or Frobenius norm to the original matrix. Here, we focus on a precise characterization of the eigenstructure of X T X after entry-wise nonlinear transformation such as sparsification or quantization, in the large n, p regime, by performing simultaneously non-uniform sparsification and/or quantization (down to binarization). We consider a simple mixture data model with x ∼ N (±µ, I p ) and let K ≡ f (X T X/ √ p)/ √ p, where f is an entry-wise thresholding/quantization operator (thereby zeroing-out/quantizing entries of X T X); and we prove that this leads to significantly improved performances, with the same computational cost, in spectral clustering as uniform sparsification, but for a much reduced cost in storage induced by quantization. The only (non-negligible) additional cost arises from the extra need for evaluating each entry of X T X. Our main technical contribution (of independent interest, e.g., for those interested in entry-wise nonlinear transformations of feature matrices) consists in using random matrix theory (RMT) to derive the large n, p asymptotics of the eigenspectrum of K = f (X T X/ √ p)/ √ p for a wide range of functions f , and then comparing to previously-established results for uniform subsampling and sparsification in (Zarrouk et al., 2020) . Experiments on real-world data further corroborate our findings. Our main contributions are the following. 1. We derive the limiting eigenvalue distribution of K as n, p → ∞ (Theorem 1), and we identify: (a) the existence of non-informative and isolated eigenvectors of K for some f (Corollary 1); (b) in the absence of such eigenvectors, a phase transition in the dominant eigenvalue-eigenvector ( λ, v) pair (Corollary 2): if the signal-to-noise ratio (SNR) µ 2 of the data exceeds a certain threshold γ, then λ becomes isolated from the main bulk (Von Luxburg, 2007; Joseph & Yu, 2016; Baik et al., 2005) and v contains data class-structure information exploitable for clustering; if not, then v contains only noise and is asymptotically orthogonal to the class-label vector. 2. Letting f be a sparsification, quantization, or binarization operator, we show: (a) a selective non-uniform sparsification operator, such that X T X can be drastically sparsified with very little degradation in clustering performance (Proposition 1 and Section 4.2), which significantly outperforms the random uniform sparsification scheme in (Zarrouk et al., 2020) ; (b) for a given matrix storage budget (i.e., fixed number of bits to store K), an optimal design of the quantization/binarization operators (Proposition 2 and Section 4.3), the performances of which are compared against the original X T X and its sparsified but not quantized version. For spectral clustering, the surprisingly small performance drop, accompanied by a huge reduction in computational cost, contributes to improved algorithms for large-scale problems. More generally, our proposed analysis sheds light on the effect of entry-wise nonlinear transformations on the eigenspectra of data/feature matrices. Thus, looking forward (and perhaps more importantly, given the use of nonlinear transformations in designing modern neural network models as well as the recent interest in applying RMT to neural network analyses (Dobriban et al., 2018; Li & Nguyen, 2018; Seddik et al., 2018; Jacot et al., 2019; Liu & Dobriban, 2019 )), we expect that our analysis opens the door to improved analysis of computationally efficient methods for large dimensional machine learning and neural network models more generally.

2. SYSTEM MODEL AND PRELIMINARIES

Basic setup. Let x 1 , . . . , x n ∈ R p be independently drawn (not necessarily uniformly) from a two-class mixture of C 1 and C 2 with C 1 : x i = -µ + z i , C 2 : x i = +µ + z i (1) with z i ∈ R p having i.i.d. zero-mean, unit-variance, κ-kurtosis, sub-exponential entries, µ ∈ R p such that µ 2 → ρ ≥ 0 as p → ∞, and v ∈ {±1} n with [v] i = -1 for x i ∈ C 1 and +1 for x i ∈ C 2 . 1 The data matrix X = [x 1 , . . . , x n ] ∈ R p×n can be compactly written as X = Z + µv T for Z = [z 1 , . . . , z n ] so that v = √ n and both Z, µv T have operator norm of order O( √ p) in the n ∼ p regime. In this setting, the Gram (or linear kernel) matrix X T X achieves optimal clustering performance on the mixture model (1); see Remark 1 below. However, it consists of a dense n × n matrix, which becomes quickly expensive to store or to perform computation on, as n increases. Thus, we consider instead the following entry-wise nonlinear transformation of X T X: K = δ i =j f (x T i x j / √ p)/ √ p n i,j=1 for f : R → R satisfying some regularity conditions (see Assumption 1 below), where δ i =j equals 1 for i = j and equals 0 otherwise. The diagonal elements f (x T i x i / √ p) (i) bring no additional information for clustering and (ii) do not scale properly for p large ( 1)). Thus, following (El Karoui, 2010; Cheng & Singer, 2013) , they are discarded. x T i x i / √ p = O( √ p) instead of O( Most of our technical results hold for rather generic functions f , e.g., those of interest beyond sparse quantized spectral clustering, but we are particularly interested in f with nontrivial numerical properties (e.g., promoting quantization and sparsity): Sparsification: f 1 (t) = t • 1 |t|> √ 2s (3) Quantization: f 2 (t) = 2 2-M ( t • 2 M -2 / √ 2s + 1/2) • 1 |t|≤ √ 2s + sign(t) • 1 |t|> √ 2s (4) Binarization: f 3 (t) = sign(t) • 1 |t|> √ 2s . Here, s ≥ 0 is some truncation threshold, and M ≥ 2 is a number of information bits. 2 The visual representations of these f s are given in Figure 1 -(left). For f 3 , taking s → 0 leads to the sign function sign(t). In terms of storage, the quantization f 2 consumes 2 M -2 + 1 bits per nonzero entry, while the binarization f 3 takes values in {±1, 0} and thus consumes 1 bit per nonzero entry. Random matrix theory. To provide a precise description of the eigenspectrum of K for the nonlinear f of interest, to be used in the context of spectral clustering, we will provide a large dimensional characterization for the resolvent of K, defined for z ∈ C \ R + as Q(z) ≡ (K -zI n ) -1 . (6) This matrix, which plays a central role in RMT analysis (Bai & Silverstein, 2010) , will be used in two primary ways. First, the normalized trace 1 n tr Q(z) is the so-called Stieltjes transform of the eigenvalue distribution of K, from which the eigenvalue distribution can be recovered, and be further used to characterize the phase transition beyond which spectral clustering becomes theoretically possible (Corollary 2). Second, for ( λ, v), an "isolated" eigenvalue-eigenvector pair of K, and a ∈ R n , a deterministic vector, by Cauchy's integral formula, the "angle" between v and a is given by |v T a| 2 = -1 2πı Γ( λ) a T Q(z)a dz, where Γ( λ) is a positively oriented contour surrounding λ only. Letting a = v, this will be exploited to characterize the spectral clustering error rate (Proposition 1). From a technical perspective, unlike linear random matrix models, K (and thus Q(z)) involves nonlinear dependence between its entries. To break this difficulty, following the ideas of (Cheng & Singer, 2013) , we exploit the fact that, by the central limit theorem, z T i z j / √ p → N (0, 1) in distribution as p → ∞. As such, up to µv T , which is treated separately with a perturbation argument, the entries of K asymptotically behave like a family of dependent standard Gaussian variables to which f is applied. Expanding f in a series of orthogonal polynomials with respect to the Gaussian measure allows for "unwrapping" this dependence. A few words on the theory of orthogonal polynomials (Andrews et al., 1999) are thus convenient to pursue our analysis. Orthogonal polynomial framework. For a probability measure µ, let {P l (x), l ≥ 0} be the orthonormal polynomials with respect to f, g ≡ f gdµ obtained by the Gram-Schmidt procedure on the monomials {1, x, x 2 , . . .}, such that P 0 (x) = 1, P l is of degree l and P l1 , P l2 = δ l1-l2 . By the Riesz-Fischer theorem (Rudin, 1964, Theorem 11.43) , for any function f ∈ L 2 (µ), the set of square-integrable functions with respect to •, • , one can formally expand f as f (x) ∼ ∞ l=0 a l P l (x), a l = f (x)P l (x)µ(dx) 2 Also, here, • denotes the floor function, while erf(x) = 2 √ π x 0 e -t 2 dt and erfc(x) = 1-erf(x) denotes the error function and complementary error function, respectively. - √ 2s 0 √ 2s -1 0 1 f1 f2 f3 f a 1 ν f 1 erfc(s) + 2se -s 2 / √ π erfc(s) + 2se -s 2 / √ π f 2 2 π • 2 1-M (1 + e -s 2 1 -2 M -1 4 M -1 erf(s) + 2 M -2 -1 k=1 2e -k 2 s 2 4 M -2 ) - 2 M -2 -1 k=1 k erf(ks•2 2-M ) 2 2M -5 f 3 e -s 2 2/π erfc(s) Figure 1 : Visual representations of different functions f defined in (3)-(5) (left) and their corresponding parameters a 1 and ν (right) (a 2 = 0 for each of these f s) defined in Assumption 1. where "f ∼ ∞ l=0 a l P l " indicates that f - L l=0 a l P l → 0 as L → ∞ with f 2 = f, f . To investigate the asymptotic behavior of K, as n, p → ∞, we make the following assumption on f . Assumption 1 (Square-integrable in Gauss space). Let ξ p = z T i z j / √ p and P l,p (x) be the orthonormal polynomials with respect to the measure µ p of ξ p . For f ∈ L 2 (µ p ), f (x) ∼ ∞ l=0 a l,p P l,p (x) with a l,p in (7) such that (i) ∞ l=0 a l,p P l,p (x)µ p (dx) converges in L 2 (µ p ) to f (x) uniformly over large p; and (ii) as p → ∞, ∞ l=1 a 2 l,p → ν and, for l = 0, 1, 2, a l,p → a l converges with a 0 = 0. Since ξ p → N (0, 1) in distribution, the parameters a 0 , a 1 , a 2 and ν are simply moments of the standard Gaussian measure involving f . More precisely, for ξ ∼ N (0, 1), a 0 = E[f (ξ)], a 1 = E[ξf (ξ)], √ 2a 2 = E[ξ 2 f (ξ)] -a 0 , ν = E[f 2 (ξ) -a 2 0 ] ≥ a 2 1 + a 2 2 . (8) Imposing the condition a 0 = 0 simply discards the non-informative rank-one matrix a 0 1 n 1 T n / √ p from K. The three parameters (a 1 , a 2 , ν) are of crucial significance in determining the spectral behavior of K (see Theorem 1 below).The sparse f 1 , quantized f 2 , and binary f 3 of our primary interest all satisfy Assumption 1 (counterexample exists though, for example f (x) = e P (x) for polynomial P (x) of degree greater than two), with their corresponding a 2 = 0 (as we shall see in Corollary 1 below, this is important for the spectral clustering use case) and a 1 , ν given in Figure 1 - (right) . With these preliminary comments, we are in position to present our main technical results.

3. MAIN TECHNICAL RESULTS

Our main technical result, from which our performance-complexity trade-off analysis will follow, provides an asymptotic deterministic equivalent Q(z) for the random resolvent Q, defined in (6). (A deterministic equivalent is a deterministic matrix Q(z) such that, for any deterministic sequence of matrices A n ∈ R n×n and vectors a n , b n ∈ R n of bounded (spectral and Euclidean) norms, 1 n tr A n (Q(z) -Q(z)) → 0 and a T n (Q(z) -Q(z))b n → 0 almost surely as n, p → ∞. We denote this relation Q(z) ↔ Q(z). ) This is given in the following theorem. The proof is in Appendix A.1. Theorem 1 (Deterministic equivalent). Let n, p → ∞ with p/n → c ∈ (0, ∞), let Q(z) be defined in (6), and let [•] denote the imaginary part of a complex number. Then, under Assumption 1, Q(z) ↔ Q(z) = m(z)I n -VΛ(z)V T , Λ(z) = Θ(z)m 2 (z) v T 1nΘ(z)Ω(z) n m(z) v T 1nΘ(z)Ω(z) n m(z) (v T 1n) 2 Θ(z)Ω 2 (z) n 2 -Ω(z) , with √ nV = [v, 1 n ], Ω(z) = a 2 2 (κ-1)m 3 (z) 2c 2 -a 2 2 (κ-1)m 2 (z) , Θ(z) = a1 µ 2 c+a1m(z)(1+ µ 2 )+a1 µ 2 Ω(z)(v T 1n) 2 /n 2 , for κ the kurtosis of the entries of Z, and m(z) the unique solution, such that [m(z)] • [z] ≥ 0, to - 1 m(z) = z + a 2 1 m(z) c + a 1 m(z) + ν -a 2 1 c m(z). The major implication of Theorem 1 is that, the spectral behavior of the matrix K (e.g., its eigenvalue distribution and the isolated eigenpairs as discussed after ( 6)) depends on the nonlinear f only via the three parameters a 1 , a 2 , ν defined in (8). More concretely, the empirical spectral measure ω n = 1 n n i=1 δ λi(K) of K has a deterministic limit ω as n, p → ∞, uniquely defined through its Stieltjes transform m(z) ≡ (t -z) -1 ω(dt) as the solution to (9). 3 This limiting measure ω does not  f (t) = sin(t) -3 cos(t) + 3/ √ e, p = 800, n = 6 400, µ = 1.1 • 1 p / √ p, v = [-1 n/2 ; 1 n/2 ] on Student-t data with κ = 5. depend on the law of the independent entries of Z, so long that they are sub-exponential, with zero mean and unit variance. In particular, taking a 1 = 0 in (9) gives the (rescaled) Wigner semi-circle law ω (Wigner, 1955) , and taking ν = a 2 1 (i.e., a l = 0 for l ≥ 2) gives the Marcenko-Pastur law ω (Marcenko & Pastur, 1967) . See Remark 2 in Appendix A.2 for more discussions. Spurious non-informative spikes. Going beyond just the limiting spectral measure of K, Theorem 1 also shows that isolated eigenvalues (often referred to as spikes) may be found in the eigenspectrum of K at very specific locations. Such spikes and their associated eigenvectors are typically thought to provide information on the data class-structure (and they do when f is linear). However, when f is nonlinear, this is not always the case: it is possible that not all these spikes are "useful" in the sense of being informative about the class structure. This is made precise in the following, where we assume µ = 0, i.e., that there is no class structure. The proof is in Appendix A.2. Corollary 1 (Non-informative spikes). Assume µ = 0, κ = 1 and a 2 = 0, so that in the notations of Theorem 1, Θ(z) = 0 and Q(z) = m(z) + Ω(z) 1 n 1 n 1 T n for Ω(z) defined in Theorem 1. Then, if x ± ≡ ± 1 a2 2 κ-1 satisfies a 1 x ± = ±1 and (ν -a 2 1 )x 2 ± + a 2 1 x 2 ± (1+a1x±) 2 < 1/c, two eigenvalues of K converge to z ± = -1 cx± - a 2 1 x± 1+a1x± -(ν -a 2 1 ) x ± away from the support of ω. If instead x ± = ±1/a 1 for a 1 = 0, then a single eigenvalue of K isolates with limit z = -ν a1 -a1(2-c)

2c

. Corollary 1 (combined with Theorem 1) says that, while the limiting spectrum ω is universal with respect to the distribution of the entries of Z, the existence and position of the spurious non-informative spikes are not universal and depend on the kurtosis κ of the distribution. (See Figure 12 in Appendix A.2 for an example.) This is far from a mathematical curiosity. Given how nonlinear transformations are used in machine learning practice, being aware of the existence of spurious non-informative spikes in the eigenspectrum of K (well separated from the bulk of eigenvalues, but that correspond to random noise instead of signal), as a function of properties of the nonlinear f , is of fundamental importance for downstream tasks. For example, for spectral clustering, their associated eigenvectors may be mistaken as informative ones by spectral clustering algorithms, even in the complete absence of classes. This is confirmed by Figure 2 where two isolated eigenvalues (on the right side of the bulk) are observed, with only the second largest one corresponding to an eigenvector that contains class-label information. For further discussions, see Appendix A.2 and A.3. Informative spikes. From Theorem 1, we see that the eigenspectrum of K depends on f only via a 1 , a 2 , and ν. In particular, a 1 and ν determine the limiting spectral measure ω. From Corollary 1, we see that a 2 contributes by (i) introducing (at most two) non-informative spikes and (ii) reducing the ratio a 1 /ν (since ν = i≥1 a 2 i ), thereby necessarily enlarging the support of ω (see Remark 2 in Appendix A.2 for more details). Taking a 2 = 0 thus reduces the length of the support of ω and, as such, maximizes the "chance" of the appearance of an informative spike (the eigenvector of which is positively correlated with the label vector v). See Remark 1 below for a more precise statement. In particular, by taking a 2 = 0 and a 1 = 0, we obtain only informative spikes, and we can characterize a phase transition depending on the SNR ρ. The proof of the following is in Appendix A.3. Corollary 2 (Informative spike and a phase transition). For a 1 > 0 and a 2 = 0, let F (x) = x 4 + 2x 3 + 1 - cν a 2 1 x 2 -2cx -c, G(x) = a 1 c (1 + x) + a 1 x + ν -a 2 1 a 1 1 1 + x , and let γ be the largest real solution to F (γ) = 0. Then, under Assumption 1, we have √ c ≤ γ ≤ √ cν/a 1 , and the largest eigenpair ( λ, v) of K satisfies λ → λ = G(ρ) ρ > γ, G(γ) ρ ≤ γ; |v T v| 2 n → α = F (ρ) ρ(1+ρ) 3 ρ > γ, 0 ρ ≤ γ; almost surely as n, p → ∞, p/n → c ∈ (0, ∞), where we recall ρ ≡ lim p→∞ µ 2 and v = √ n. Without loss of generality, we discuss only the case a 1 > 0.foot_1 For a 1 > 0, both F (x) and G(x) are increasing functions on x ∈ (γ, ∞). Then, as expected, both λ and α increase with the SNR ρ. Moreover, the phase transition point γ is an increasing function of c and of ν/a 2 1 . As such, the optimal f in the sense of the smallest phase transition point is the linear function f (t) = t with a 1 = ν = 1 and γ = √ c. This recovers the classical random matrix result in (Baik et al., 2005) .

4. CLUSTERING PERFORMANCE OF SPARSE AND QUANTIZED OPERATORS

We start, in Section 4.1, by providing a sharp asymptotic characterization of the clustering error rate and demonstrating the optimality of the linear function under (1). Then, in Section 4.2, we discuss the advantageous performance of the proposed selective sparsification approach (with f 1 ) versus the uniform or subsampling approach studied previously in (Zarrouk et al., 2020) . Finally, in Section 4.3, we derive the optimal truncation threshold s opt , for both quantized f 2 and binary f 3 , so as to achieve an optimal performance-complexity trade-off for a given storage budget.

4.1. PERFORMANCE OF SPECTRAL CLUSTERING

The technical results in Section 3 provide conditions under which K admits an informative eigenvector v that is non-trivially correlated with the class label vector v (and thus that is exploitable for spectral clustering) in the n, p → ∞ limit. Since the exact (limiting) alignment |v T v| is known, along with an additional argument on the normal fluctuations of v, we have the following result for the performance of the spectral clustering method. The proof is in Appendix A.4. Proposition 1 (Performance of spectral clustering). Let Assumption 1 hold, let a 1 > 0, a 2 = 0, and let Ĉi = sign ([v] i ) be the estimate of the underlying class C i of x i , with the convention vT v ≥ 0, for v the top eigenvector of K. Then, the (average) misclassification rate satisfies 1 n n i=1 δ Ĉi =Ci → 1 2 erfc( α/(2 -2α)), almost surely, as n, p → ∞, for α ∈ [0, 1) defined in (11). Recall from Corollary 2 that, for a 2 = 0, the nonlinear function f (e.g., f 1 , f 2 , f 3 ) acts on the (statistical behavior of the) isolated eigenvector v, and thus on the spectral clustering performance per Proposition 1, only through the ratio ν/a 2 1 . It thus suffices to evaluate the ratio ν/a 2 1 of different f and compare, for instance to that of the linear f (t) = t corresponding to the original X T X matrix. Despite being asymptotic results valid in the n, p → ∞ limit, the results of Proposition 1 and Corollary 2 closely match empirical results for moderately large n, p only in the hundreds. This is illustrated in Figure 3 . Proposition 1 further confirms that the misclassification rate, being a decreasing function of α, increases with ν/a 2 1 (for c and ρ fixed). This leads to the following remark. Remark 1 (Optimality of linear function). Since both the phase transition point γ and the misclassification rate grow with ν/a 2 1 , the linear function f (t) = t with the minimal ν/a 2 1 = 1 is optimal in the sense of (i) achieving the smallest SNR ρ or the largest ratio c = lim p/n (i.e., the fewest samples n) necessary to observe an informative (isolated) eigenvector, and (ii) upon existence of such an isolated eigenvector, reaching the lowest classification error rate. According to Remark 1, any f with ν/a 2 1 > 1 induces performance degeneration (compared to the optimal linear function). However, by choosing f to be one of the functions f 1 , f 2 , f 3 defined in (3)-( 5), one may trade clustering performance optimality for reduced storage size and computational time. Figure 4 displays the theoretical decay in clustering performance and the gain in storage size of the sparse f 1 , quantized f 2 , and binary f 3 , when compared to the optimal but "dense" linear function. As s → 0, both performance and storage size under f 1 naturally approach those of the linear function. This is unlike f 2 or f 3 , which approach the sign function. For s 1, the performance under sparse f 1 becomes comparable to that of binary f 3 (which is significantly worse than quantized but non-sparse f 2 ) but for a larger storage cost. In particular, using f 2 or f 3 in the setting of Figure 4 , one can reduce the storage size by a factor of 32 or 64 (in the case of IEEE standard single-or double-precision floating-point format), at the price of a performance drop less than 1%.

4.2. COMPARISON TO UNIFORM SPARSIFICATION AND SUBSAMPLING

From Figure 4 , we see that the classification error and storage gain of the sparse f 1 increase monotonically, as the truncation threshold s grows. For f 1 , the number of nonzero entries of K is approximately erfc(s)n 2 with truncation threshold s. Thus, the sparsity level ε selec = erfc(s) ∈ [0, 1] can be defined and compared to uniform sparsification or subsampling approaches. Recall (from the introduction) that the cost of spectral clustering may be reduced by subsampling the whole dataset in 1/ε sub chunks of nε sub data vectors each. Alternatively, as investigated recently in (Zarrouk et al., 2020) , the cost can be reduced by uniformly zeroing-out X T X with a symmetric random mask matrix B, with B ij ∼ Bern(ε unif ) for 1 ≤ i < j ≤ n and B ii = 0. On average, a proportion 1 -ε unif of the entries of X T X is set to zero, so that ε unif ∈ [0, 1] controls the sparsity level (and thus the storage size as well as computational time). Similar to our Corollary 2, the associated eigenvector alignment α (and thus the clustering accuracy per Proposition 1) in both cases can be derived. Specifically, taking ε unif = a 2 1 /ν in (Zarrouk et al., 2020, Theorem 3. 2), we obtain the same F (x) as in our Corollary 2 and therefore the same phase transition point γ and eigenvector alignment α. As for subsampling, its performance can be obtained by letting a 2 1 = ν and changing c into c/ε sub in the formulas of F (x) and G(x) of our Corollary 2. Consequently, the same clustering performance is achieved by either uniform or selective sparsification (using f 1 ) with ε unif = a 2 1 /ν = erfc(s) + 2se -s 2 / √ π > erfc(s) = ε selec , and our proposed selective sparsification thus leads to strictly sparser matrices. Moreover, their ratio r(s) = erfc(s)/(erfc(s) + 2se -s 2 / √ π) is a decreasing function of s and approximates as r(s) ∼ (1 + s 2 ) -1 /2 for s 1, 5 meaning that the gain in storage size and computational time is more significant as the matrix becomes sparser. This is depicted in Figure 5-(left) . Fixing α in Corollary 2 to achieve a given clustering performance level (via Proposition 1), one may then retrieve "equi-performance" curves in the (ε, ρ)-plane, for uniform sparsification, selective sparsification, and subsampling. This is displayed in Figure 5-(right) , showing that a dramatic performance gain is achieved by the proposed selective sparsification f 1 . Besides, here for c = 2, as much as 80% sparsity could be obtained with selective sparsification at constant SNR ρ, with virtually no performance loss (red curves are almost flat on ε ∈ [0.2, 1]). This fails to hold for uniform sparsification (Zarrouk et al. (2020) obtain such a result only when c 0.1) or subsampling.

4.3. OPTIMALLY QUANTIZED AND BINARIZED MATRICES

From Figure 4 , we see that the classification errors of the quantized f 2 (M ; s; t) and binarized f 3 (s; t) do not increase monotonically with the truncation threshold s. It can be shown (and also visually confirmed in Figure 4 ) that, for a given M ≥ 2, the ratio ν/a 2 1 of both f 2 and f 3 is convex in s and has a unique minimum. This leads to the following optimal design result for f 2 and f 3 , respectively, the proof of which is straightforward. Proposition 2 (Optimal design of quantized and binarized functions). Under the assumptions and notations of Proposition 1, the classification error rate is minimized at s = s opt with 1. s opt the unique solution to a 1 (s opt )ν (s opt ) = 2a 1 (s opt )ν(s opt ) for quantized f 2 , with a 1 (s) and ν (s) the corresponding derivatives with respect to s in Figure 1 Therefore, the optimal threshold s opt for quantized f 2 or binary f 3 under (1) is problem-independent, as it depends neither on ρ nor on c. In particular, note that (i) the binary f 3 (s opt ; •) is consistently better than f (t) = sign(t) for which ν/a 2 1 = π/2 ≈ 1.57 > 1.24; and (ii) the performance of quantized f 2 can be worse, though very slightly, than that of binary f 3 for small s, but significantly better for not-too-small s. These are visually confirmed in the left and middle displays of Figure 4 . As already observed in Figure 4 -(right), a significant gain in storage size can be achieved by using f 2 or f 3 , versus the performance-optimal but dense linear function, with virtually no performance loss. Figure 6 compares the performance of the optimally designed f 2 and f 3 , to sparse f 1 that has approximately the same storage size. 6 A significant drop in classification error is observed by using quantization f 2 or binarization f 3 rather than sparsification f 1 . Also, the performances of f 2 and f 3 are extremely close to the theoretical optimal (met by f (t) = t). This is further confirmed by Figure 6 -(right) where, for the optimal f 2 , the ratio ν/a 2 1 gets close to 1, for all M ≥ 5. Figure 7 next evaluates the clustering performance, the proportion of nonzero entries in K, and the computational time of the top eigenvector, for sparse f 1 and binary f 3 , versus linear f (t) = t, 5 We use here the asymptotic expansion erfc(s) = e -s 2 s √ π 1 + ∞ k=1 (-1) k • 1•3•••(2k-1) (2s 2 ) k . 6 We set the truncation threshold s of f1 such that erfc(s) = 3/64, so that the storage size of the sparse f1 (64 bits per nonzero entry) is the same as the quantized f2 with M = 3 (with 3 bits per nonzero entry), which is three times that of the binary f3. as a function of the truncation threshold s, on the popular MNIST datasets (LeCun et al., 1998) . Depending on (the SNR ρ of) the task, up to 90% of the entries can be discarded almost "for free". Moreover, the curves of the binary f 3 appear strikingly close to those of the sparse f 1 , showing the additional advantage of using the former to further reduce the storage size of K. More empirical results on various datasets are provided in Appendix B to confirm our observations in Figure 7 .

5. CONCLUDING REMARKS

We have evaluated performance-complexity trade-offs when sparsifying, quantizing, and binarizing a linear kernel matrix via a thresholding operator. Our main technical result characterizes the change in the eigenspectrum under these operations; and we have shown that, under an information-plusnoise model, sparsification and quantization, when carefully employed, maintain the informative eigenstructure and incur almost negligible performance loss in spectral clustering. Empirical results on real data demonstrate that these conclusions hold far beyond the present statistical model. The proposed analysis can be extended in many ways, for instance by considering a multi-cluster and more involved model than (1) as in (Liao & Couillet, 2019) (i.e., "generic" K-class Gaussian mixture N (µ a , C a ) for a ∈ {1, . . . , K}, which may help better interpret the empirical observations in Figure 7 and Appendix B), by focusing on more general kernels beyond the current inner-product type in (2), or by deriving non-asymptotic guarantees as in (Vankadara & Ghoshdastidar, 2020) . Our results open the door to theoretical investigation of a broad range of cost-efficient linear algebra methods in machine learning, including subsampling techniques (Mensch et al., 2017; Roosta-Khorasani & Mahoney, 2019) , distributed optimization (Wang et al., 2018) , randomized linear algebra algorithms (Mahoney, 2011; Drineas & Mahoney, 2016) , and quantization for improved training and/or inference (Dong et al., 2019; Shen et al., 2020) . Also, given recent interest in viewing neural networks from the perspective of RMT (Li & Nguyen, 2018; Seddik et al., 2018; Jacot et al., 2019; Liu & Dobriban, 2019; Martin & Mahoney, 2019; 2020) , our results open the door to understanding and improving performance-complexity trade-offs far beyond kernel methods (Rahimi & Recht, 2008; Jacot et al., 2018; Liu et al., 2020) , e.g., to sparse, quantized, or even binary neural networks (Hubara et al., 2016; Lin et al., 2017) .

A PROOFS AND RELATED DISCUSSIONS

Under the mixture model ( 1), the data matrix X ∈ R p×n can be compactly written as X = Z + µv T , for Z ∈ R p×n having i.i.d. zero-mean, unit-variance, κ-kurtosis, sub-exponential entries and v ∈ {±1} n so that v = √ n. Recall also the following notations: K = δ i =j f (x T i x j / √ p)/ √ p n i,j=1 , Q(z) ≡ (K -zI n ) -1 . A.1 PROOF OF THEOREM 1 The proof of Theorem 1 comes in the following two steps: 1. show that the random quantities 1 n tr A n Q(z) and a T n Q(z)b n of interest concentrate around their expectations in the sense that 1 n tr A n (Q(z) -E[Q(z)]) → 0, a T n (Q(z) -E[Q(z)])b n → 0, almost surely as n, p → ∞; and 2. show that the sought-for deterministic equivalent Q(z) given in Theorem 1 is an asymptotic approximation for the expectation of the resolvent Q(z) defined in ( 6) in the sense that E[Q] -Q → 0, as n, p → ∞. The concentration of trace forms in the first item has been established in (Cheng & Singer, 2013; Do & Vu, 2013) , and the bilinear forms follow similarly. Here we focus on the second item to show that E[Q] -Q → 0 in the large n, p limit. In the sequel, we use o(1) and o • (1) for scalars or matrices of (almost surely if being random) vanishing absolute values or operator norms as n, p → ∞. To establish E[Q] -Q → 0, we need to show subsequently that: 1. under (13), the random matrix K defined in (2) admits a spiked-model approximation, that is K = K0 + UΛU T + o • (1), for some full rank random (noise) matrix K0 and low rank (information) matrix UΛU T to be specified; and 2. the matrix inverse ( K0 -UΛU T -zI n ) -1 can be decomposed with the Woodbury identity, so that Q = ( K0 -UΛU T -zI n ) -1 + o • (1) = Q0 -Q0 U(Λ -1 + U T Q0 U) -1 U T Q0 + o • (1), ( ) with Q0 (z) ≡ ( K0 -zI n ) -1 ; and 3. the expectation of the right-hand side of ( 18) is close to Q in the large n, p limit, allowing us to conclude the proof of Theorem 1. To establish (17), we denote the "noise-only" null model with µ = 0 by writing K = K 0 such that [K 0 ] ij = δ i =j f (z T i z j / √ p)/ √ p. ( ) With a combinatorial argument, it has been shown in (Fan & Montanari, 2019) that K 0 -K0 - a 2 √ 2 1 p (ψ1 T n + 1 n ψ T ) → 0, almost surely as n, p → ∞, for K0 such that ( K0 -zI n ) -1 ≡ Q0 (z) ↔ m(z)I n and the random vector ψ ∈ R n with its i-th entries given by [ψ] i = 1 √ p z i 2 -E[ z i 2 ] = 1 √ p ( z i 2 -p). Published as a conference paper at ICLR 2021 Consider now the informative-plus-noise model K for X = Z + µv T as in (13) with [v] i = ±1 and v = √ n. It follows from (Liao & Couillet, 2019) that K -K 0 - a 1 p v Z T µ µ 2 1 1 0 v T µ T Z → 0, almost surely as n, p → ∞. Combining ( 20) with ( 21), we obtain K -K0 -UΛU T → 0 almost surely as n, p → ∞, with U = 1 √ p [1 n , v, ψ, Z T µ] ∈ R n×4 , Λ =     0 0 a2 √ 2 0 0 a 1 µ 2 0 a 1 a2 √ 2 0 0 0 0 a 1 0 0     and ( K0 -zI n ) -1 ≡ Q0 (z) ↔ m(z)I n . By the Woodbury identity, we write Q = (K -zI n ) -1 = ( K0 + UΛU T -zI n ) -1 + o • (1) = Q0 -Q0 U(Λ -1 + U T Q0 U) -1 U T Q0 + o • (1) Λ -1 + U T Q0 U =      m(z) c m(z) c v T 1n n 0 0 m(z) c v T 1n n m(z) c 0 0 0 0 (κ -1) m(z) c 0 0 0 0 µ T ( 1 p E[Z Q0 Z T ])µ      + o • (1) where we use the fact that E[ψ] = 0, E[ψψ T ] = (κ -1)I n . We need to evaluate the expectation 1 p E[Z( K0 -zI n ) -1 Z T ] . This is given in the following lemma. Lemma 1. Under the assumptions and notations of Theorem 1, we have 1 p E[Z( K0 -zI n ) -1 Z T ] = m(z) c + a 1 m(z) I p + o • (1). Proof of Lemma 1. For Q0 = ( K0 -zI n ) -1 , we aim to approximate the expectation E[Z Q0 Z T ]/p. Consider first the case where the entries of Z are i.i.d. Gaussian, we can write the (i, i ) entry of E[Z Q0 Z T ] with Stein's lemma (i.e., E[xf (x)] = E[f (x)] for x ∼ N (0, 1)) as E[Z Q0 Z T ] ii = n j=1 E[Z ij [ Q0 Z T ] ji ] = n j=1 E ∂[ Q0 Z T ] ji ∂Z ij = n j=1 E [ Q0 ] jj δ ii + k=1 ∂[ Q0 ] jk ∂Z ij Z T ki . We first focus on the term ∂[ Q0]jk ∂Zij by writing ∂[ Q0 ] jk ∂Z ij = -Q0 ∂K 0 ∂Z ij Q0 jk = n l,m=1 -[ Q0 ] jl ∂[K 0 ] lm ∂Z ij [ Q0 ] mk where we recall [K 0 ] ij = δ i =j f (Z T Z/ √ p) ij / √ p so that for l = m we have ∂[K 0 ] lm ∂Z ij = 1 p f (Z T Z/ √ p) lm ∂[Z T Z] lm ∂Z ij = 1 p f (Z T Z/ √ p) lm (δ jl Z im + Z T li δ jm ) and ∂[K0] lm ∂Zij = 0 for l = m. We get j,k ∂[ Q0 ] jk ∂Z ij Z T ki = - 1 p j,k,m [ Q0 ] jj f (Z T Z/ √ p) jm Z im [ Q0 ] mk Z T ki - 1 p j,k,l [ Q0 ] jl f (Z T Z/ √ p) lj Z T li [ Q0 ] jk Z T ki = - 1 p [Z diag(f (Z T Z/ √ p) Q0 1 n ) Q0 Z T ] ii - 1 p [Z( Q0 f (Z T Z/ √ p)) Q0 Z T ] ii where f (Z T Z/ √ p) indeed represents f (Z T Z/ √ p) -diag(•) in both cases. For the first term, since f (Z T Z/ √ p) -diag(•) = a 1 1 n 1 T n + O • ( √ p), we have 1 p f (Z T Z/ √ p) Q0 1 n = a 1 p 1 T n Q0 1 n • 1 n + O(p -1/2 ) = a 1 m(z) c 1 n + O(p -1/2 ) (25) where O(p -1/2 ) is understood entry-wise. As a result, 1 p Z diag(f (Z T Z/ √ p) Q0 1 n ) Q0 Z T = a 1 m(z) c • 1 p Z Q0 Z T + o • (1). For the second term, since f (Z T Z/ √ p) has O(1) entries and A B ≤ √ n A ∞ B for A, B ∈ R n×n , we deduce that 1 p Z( Q0 f (Z T Z/ √ p)) Q0 Z T = O( √ p). As a consequence, we conclude that 1 p E[Z Q0 Z T ] = 1 p tr Q0 • I p - a 1 m(z) c • 1 p E[Z Q0 Z T ] + o • (1) that is 1 p E[Z Q0 Z T ] = m(z) c + a 1 m(z) I p + o • (1) where we recall that tr Q0 /p = m(z)/c and thus the conclusion of Lemma 1 for the Gaussian case. The interpolation trick (Lytova & Pastur, 2009, Corollaray 3.1) can then be applied to extend the result beyond Gaussian distribution. This concludes the proof of Lemma 1. Denote A = (Λ -1 + U T Q0 U) -1 , it follows from Lemma 1 that E[UAU T ] = 1 p (A 11 1 n 1 T n + A 12 1 n v T + A 21 v1 T n + A 22 vv T + A 33 (κ -1)I n + A 44 µ 2 I n ) = 1 p (A 11 1 n 1 T n + A 12 1 n v T + A 21 v1 T n + A 22 vv T ) + o • (1) since µ = O(1) and v = O( √ n). We thus deduce from (23) that Q(z) ↔ Q(z) = m(z)I n -cm 2 (z)V A 11 A 12 A 21 A 22 V T with √ nV = [v, 1 n ]. Rearranging the expression we conclude the proof of Theorem 1.

A.2 PROOF OF COROLLARY 1 AND RELATED DISCUSSIONS

Consider the noise-only model by taking µ = 0 in Theorem 1. Then, we have K = K 0 and Θ(z) = 0, so that Q(z) = m(z) + Ω(z) • 1 n 1 n 1 T n , Ω(z) = a 2 2 (κ -1)m 3 (z) 2c 2 -a 2 2 (κ -1)m 2 (z) where we recall m(z) is the solution to m(z) = -z + a 2 1 m(z) c + a 1 m(z) + ν -a 2 1 c m(z) -1 . ( ) Since the resolvent Q(z) is undefined for z ∈ R within the eigensupport of K that consists of (i) the main bulk characterized by the Stieltjes transform m(z) defined in ( 27) and (ii) the possible spikes, we need to find the poles of Q(z) but not those of m(z) to determine the asymptotic locations of the spikes that are away from the main bulk. Direct calculations show that the Stieltjes transforms of the possible non-informative spikes satisfy m ± = ± 2 κ -1 c a 2 that are in fact the poles of Ω(z), for a 2 = 0 and κ = 1. For κ = 1 or a 2 = 0, Ω(z) has no (additional) poles, so that there is (almost surely) no spike outside the limiting spectrum. It is however not guaranteed that z ∈ R corresponding to (28) isolates from the main bulk. To this end, we introduce the following characterization of the limiting spectral measure in (27), the proof of which follows from previous work. Corollary 3 (Limiting spectrum). Under the notations and conditions of Theorem 1, with probability one, the empirical spectral measure ω n = 1 n n i=1 δ λi(K0) of the noise-only model K 0 (and therefore that of K as a low rank additive perturbation of K 0 via (21)) converges weakly to a probability measure ω of compact support as n, p → ∞, with ω uniquely defined through its Stieltjes transform m(z) solution to (27). Moreover, 1. if we let supp(ω) be the support of ω, then supp(ω) ∪ {0} = R \ {x(m) | m ∈ R \ {{-c/a 1 } ∪ {0}} and x (m) > 0} for x(m) the functional inverse of (27) explicitly given by x(m) = - 1 m - a 2 1 m c + a 1 m - ν -a 2 1 c m, 2. the measure ω has a density and its support may have up to four edges, with the associated Stieltjes transforms given by the roots of x (m) = 0, i.e., x (m) = 1 m 2 - a 2 1 c (c + a 1 m) 2 - ν -a 2 1 c = 0. The limiting spectral measure ω of the null model K 0 was first derived in (Cheng & Singer, 2013) for Gaussian distribution and then extended to sub-exponential distribution in (Do & Vu, 2013) . The fact that a finite rank perturbation does not affect the limiting spectrum follows from (Silverstein & Bai, 1995, Lemma 2.6 ). The characterization in (29) above follows the same idea as in (Silverstein & Choi, 1995, Theorem 1.1), which arises from the crucial fact that the Stieltjes transform m(x) = (t -x) -1 ω(dt) of a measure ω is an increasing function on its domain of definition and so must be its functional inverse x(m) given explicitly in (30). In plain words, Corollary 3 tells us that (i) depending on the number of real solutions to (31), the support of ω may contain two disjoint regions with four edges, and (ii) x ∈ R is outside the support of ω if and only if its associated Stieltjes transform m satisfies x (m) > 0, i.e., belonging to the increasing region of the functional inverse x(m) in (30). This is depicted in Figure 8 , where for the same function f (t) = max(t, 0) -1/ √ 2π with a 1 = 1/2, a 2 = 1/(2 √ π) and ν = (π -1)/(2π), we observe in the top display a single region of ω for c = 2 and in the bottom display two disjoint regions (with thus four edges) for c = 1/10. The corresponding (empirical) eigenvalue histograms and limiting laws are given in Figure 9 . Note, in particular, that the local extrema of the functional inverse x(m) in Figure 8 characterize the (possibly up to four) edges of the support of ω in Figure 9 . According to the discussion above, it remains to check the sign of x (m) for m = ± 2 κ-1 c a2 to see if they correspond to isolated eigenvalues away from the support of ω. This, after some algebraic manipulations, concludes the proof of Corollary 1. Discussions. The limiting spectral measure in Corollary 3 is indeed a "mix" between the popular Marcenko-Pastur and the Wigner's semicircle law. Remark 2 (From Marcenko-Pastur to semicircle law). As already pointed out in (Fan & Montanari, 2019) , here the limiting spectral measure ω is the so-called free additive convolution (Voiculescu, 1986) of the semicircle and Marcenko-Pastur laws, weighted respectively by a 1 and ν -a 2 1 , i.e., ω = a 1 (ω M P,c -1 -1) (ν -a 2 1 )/c • ω SC (32) where we denote a 1 (ω M P,c -1 -1) the law of a 1 (x -1) for x ∼ ω M P,c -1 and (ν -a 2 1 )/c • ω SC the law of (ν -a 2 1 )/c • x for x ∼ ω SC . Figure 10 for f (t) = a 1 t + a 2 (t 2 -1)/ √ 2 (so that ν -a 2 1 = a 2 2 ) with different pairs of (a 1 , a 2 ). We observe a transition from the Marcenko-Pastur law (in the left display, with a 1 = 0 and a 2 = 0) to the semicircle law (in the right display, with a 1 = 0 and a 2 = 0). -1 0 1 2 -1 0 1 2 -2 -1 0 1 2 Figure 10: Eigenvalues of K with µ = 0 (blue) versus the limiting laws in Theorem 1 and Corollary 3 (red) for Gaussian data, p = 1 024, n = 512 and f (t) = a 1 t + a 2 (t 2 -1)/ √ 2 with a 1 = 1, a 2 = 0 (left), a 1 = 1, a 2 = 1/2 (middle), and a 1 = 0, a 2 = 1 (right). Remark 2 tells us that, depending on the ratio ν/a 2 1 , the eigenspectrum of K exhibits a transition from the Marcenko-Pastur to semicircle-like shape. Note from Figure 1 -(right) that, for the sparse f 1 , the ratio ν/a 2 1 is an increasing function of the truncation threshold s and therefore, as the matrix K become sparser, it eigenspectrum changes from a Marcenko-Pastur-type (at s = 0) to be more semicircle-like. This is depicted in Figure 11 and similar conclusions hold for quantized f 2 and binary f 3 in the s ≥ s opt regime. -1 As discussed after Theorem 1 and in the proof above, while the limiting eigenvalue distribution ω is universal and independent of the law of the entries of Z, so long as they are independent, subexponential, of zero mean and unit variance, as commonly observed in RMT (Tao et al., 2010) , this is no longer the case for the isolated eigenvalues. In particular, according to Corollary 1, the possible non-informative spikes do depend on the kurtosis κ of the distribution. In Figure 12 we observe a farther (left) spike for Student-t (with κ = 5 and is thus not sub-exponential) than Gaussian distribution (with κ = 3), while no spike can be observed for the symmetric Bernoulli distribution (that takes values ±1 with probability 1/2 so that κ = 1), with the same limiting eigenvalue distribution for f (t) = max(t, 0) -1/ √ 2π. 0 1 2 -1 0 1 2 -0.5 0 0.5 -2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4 -2 0 2 4 Figure 12 : Eigenvalues of K with µ = 0 (blue) versus the limiting laws and spikes in Theorem 1 and Corollary 1 (red) for Student-t (with 7 degrees of freedom, left), Gaussian (middle) and Rademacher distribution (right), p = 512, n = 2 048, f (t) = max(t, 0) -1/ √ 2π. Emphasis on the non-informative spikes at different locations: at -2.10 for Student-t and -1.77 for Gaussian. Remark 3 (Non-informative spike in-between). When the support of ω consists of two disjoint regions (e.g., in the right plot of Figure 9 ), a non-informative spike may appear between these two regions, with the associated Stieltjes transform m < -c/a 1 in the setting of Figure 8 -(bottom) . This is only possibly when a 1 2 κ-1 > a 2 . An example is provided in Figure 13 .

A.3 PROOF OF COROLLARY 2 AND RELATED DISCUSSIONS

Similar to our discussions in Section A.2, we need to find the zeros of det Λ(z), that are real solutions to H(x) = 0 with for m(z) the unique solution to (9) and ρ = lim p µ 2 . Note that H(x) = a 1 a 2 2 (κ-1) (v T 1 n ) 2 n 2 ρ -1 -ρ m 3 (x)-a 2 2 c(κ-1)m 2 (x)+2a 1 c 2 (ρ+1)m(x)+2c 3 = 0 1. for a 1 a 2 2 (κ -1)( (v T 1n) 2 n 2 ρ -1 -ρ) = 0, there can be up to three spikes; 2. with a 1 = 0 and a 2 = 0, we get m 2 (x) = 2c 2 a 2 2 (κ-1) and there are at most two spikes: this is equivalent to the case of Corollary 1 with ρ = 0; in fact, taking a 1 we discard the information in the signal µ, as has been pointed out in (Liao & Couillet, 2019) ; 3. with a 2 = 0 and a 1 = 0 we obtain m(x) = -c a1(ρ+1) , this is the case of Corollary 2. For a given isolated eigenvalue-eigenvector pair ( λ, v) (assumed to be of multiplicity one), the projection |v T v| 2 onto the label vector v can be evaluated via the Cauchy's integral formula and our Theorem 1. More precisely, consider a positively oriented contour Γ that circles around only the isolated λ, we write 1 n v T vv T v = - 1 2πı Γ 1 n v T (K -zI n ) -1 v dz = - 1 2πı Γ 1 n v T (m(z)I n -VΛ(z)V T )v dz + o(1) = 1 n v T V 1 2πı Γ Λ(z) dz V T v + o(1) = 1 n v T V (ResΛ(z)) V T v + o(1) = 1 v T 1n n lim z→λ (z -λ) Θ(z)m 2 (z) Θ(z)Ω(z) v T 1n n m(z) Θ(z)Ω(z) v T 1n n m(z) Θ(z)Ω 2 (z) (v T 1n ) 2 n 2 -Ω(z) 1 v T 1n n + o(1) where we use Theorem 1 for the second line and recall that the asymptotic location λ of λ is away from the support of limiting spectral measure ω so that -1 2πı Γ m(z) dz = 0 in the third line. Interestingly, note at this point that taking v T 1 n = 0 or a 2 = 0 (so that Ω(z) = 0) leads to the same simplification by differentiating (9). The particularly means that, in the absence of the (noisy) non-informative spikes due to a 2 = 0 or in the case of balanced class v T 1 n = 0 (in fact "almost" balanced class v T 1 n = o(n)), we obtain the same asymptotic alignment (with respect to v) for the informative eigenvector. However, there may appear spurious non-informative and isolated eigenvectors in the latter case. 1 n |v T v| 2 = lim z→λ (z -λ)Θ(z)m 2 (z) + o(1) = lim z→λ (z -λ) a 1 ρm 2 (z) c + a 1 m(z)(1 + ρ) + o(1) (34) = a 1 ρ 1 + ρ m 2 (λ) m (λ) + o(1) = a 1 ρ 1 + ρ 1 - a 2 1 cm 2 (λ) (c + a 1 m(λ)) 2 - ν -a 2 In the setting of Corollary 2 with a 1 > 0 and a 2 = 0, with the substitution m(λ) = m ρ = -c a1(ρ+1) into (35) and then the change-of-variable m = -c a1 1 1+x , we obtain the expression of F (x) in Corollary 2. The phase transition condition can be similarly obtained, as discussed in Section A.2, by checking the sign of the derivative of the functional inverse x (m) as in Corollary 3. This concludes the proof of Corollary 2. in the case of the proposed f 1 , f 2 and f 3 ) asymptotically discards the second-order discriminative statistics in the covariance structure, and may thus result in suboptimal performance in the case of non-identity covariance. It would be of future interest to extend the current analysis to the "generic" Gaussian mixture classification: N (µ 1 , C 1 ) versus N (µ 2 , C 2 ) by considering the impact of (i) asymmetric means µ 1 and µ 2 = -µ 1 and (ii) statistical information in the covariance structure C 1 versus C 2 and (iii) possibly a multi-class mixture model with number of classes K ≥ 3. Figure 15 compares the clustering performances of the proposed f 1 , f 2 , and f 3 on other MNISTlike datasets including the Fashion-MNIST (Xiao et al., 2017) , Kuzushiji-MNIST (Clanuwat et al., 2018) , and Kannada-MNIST (Prabhu, 2019) datasets. Then, Figure 16 compares the performances on the representations of the ImageNet dataset (Deng et al., 2009) from the popular GoogLeNet (Szegedy et al., 2015) of feature dimension p = 2 048. On various real-world data or features, we made similar observations as in the case of MNIST data in Figure 7 and Figure 14 : the performances of sparse f 1 and binary f 3 are very similar and generally degrade as the threshold s becomes large, while the quantized f 2 yields consistently good performances that are extremely close to that of the linear function. This is in line with the (theoretically sustained) observation in (Seddik et al., 2020) that the "deep" representations of real-world datasets behave, in the large n, p regime, very similar to simple Gaussian mixtures, thereby conveying a strong practical motivation for the present analysis. 



The norm • is the Euclidean norm for vectors and the operator norm for matrices. Otherwise we could consider -K instead of K and the largest eigenvalue becomes the smallest one.



Figure 2: (Left) Histogram of eigenvalues of K (blue) versus the limiting spectrum and spikes (red). (Right) Eigenvectors of the largest (top) and second largest (bottom) eigenvalues of K (blue), versus the rescaled class label αv/ √ n (red, from Corollary 2). f (t) = sin(t) -3 cos(t) + 3/ √ e, p = 800, n = 6 400, µ = 1.1• 1 p / √ p, v = [-1 n/2 ; 1 n/2] on Student-t data with κ = 5.

Figure 3: (Left) Empirical alignment |v T v| 2 /n (green crosses) and misclassification rate (purple circles) in markers versus their limiting behaviors in lines, for f (t) = sign(t), as a function of SNR ρ. (Right) Misclassification rate as a function of the truncation thresholds s of sparse f 1 (blue crosses) and binary f 3 (red circles) with ρ = 4. Here, p = 512, n = 256, µ ∝ N (0, I p ), v = [-1 n/2 ; 1 n/2 ] on Gaussian data. The results are obtained by averaging over 250 runs.

Figure 5: (Left) Proportion of nonzero entries with uniform versus selective sparsification f 1 and their ratio r(s), as a function of the truncation threshold s. (Right) Comparison of 1%, 10% classification error and phase transition (i.e., 50% error) curves between subsampling (green), uniform (blue) and selective sparsification f 1 (red), as a function of sparsity level ε and SNR ρ, for c = 2.

-(right); and 2. s opt = exp(-s 2 opt )/(2 √ π erfc(s opt )) ≈ 0.43 for binary f 3 , with level of sparsity ε ≈ 0.54.

Figure 6: Performance of f 1 (blue), f 2 with M = 3 (green) and f 3 (red) of the same storage size, versus SNR for c = 4 (left) and versus c for SNR ρ = 4 (middle). (Right) Optimal threshold s opt (green) and (ν/a 2 1 ) min (purple) of f 2 versus M . Curves for linear f (t) = t are displayed in black.

Figure 8: Functional inverse x(m) for m ∈ R \ {{-c/a 1 } ∪ {0}}, with f (t) = max(t, 0) -1/ √ 2π, for c = 2 (above, with two edges) and c = 1/10 (bottom, with four edges). The support of ω can be read on the vertical axes and the values of x such that x (m) = 0 are marked in green.

Figure 11: Eigenvalues of K with µ = 0 (blue) versus the limiting laws in Theorem 1 and Corollary 3 (red) for Gaussian data, p = 1 024, n = 512 and f (t) = t • 1 |t|> √ 2s with s = 0.1 (left), s = .75 (middle), and s = 1.5 (right).

Figure13: Eigenvalues of K with µ = 0 (blue) versus the limiting laws and spikes in Corollary 3 and 1 (red) for p = 400, n = 6 000, with f (t) = max(t, 0) -1/ √ 2π and Gaussian data.

Figure 14: Clustering performance (left), proportion of nonzero entries, and computational time of the top eigenvector (right, in markers) of sparse f 1 and quantized f 2 with M = 2, as a function of the truncation threshold s on the MNIST dataset: digits (0, 1) (left) and (5, 6) (middle and right) with n = 2 048 and performance of the linear function in black. Results averaged over 100 runs.

Figure 15: Clustering performance of sparse f 1 , quantized f 2 (with M = 2) and binary f 3 as a function of the truncation threshold s on: (left) Kuzushiji-MNIST class 3 versus 4, (middle) Fashion-MNIST class 0 versus 9, and (right) Kannada-MNIST class 4 versus 8, for n = 2 048 and performance of the linear function in black. Results averaged over 100 runs.

Figure16: Clustering performance of sparse f 1 , quantized f 2 (with M = 2) and binary f 3 as a function of the truncation threshold s on GoogLeNet features of the ImageNet datasets: (left) class "pizza" versus "daisy" and (right) class "hamburger" versus "coffee", for n = 1 024 and performance of the linear function in black. Results averaged over 10 runs.

ACKNOWLEDGMENTS

We would like to acknowledge DARPA, IARPA (contract W911NF20C0035), NSF, and ONR via its BRC on RandNLA for providing partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred. Couillet's work is partially supported by MIAI at University Grenoble-Alpes (ANR-19-P3IA-0003) and the HUAWEI LarDist project.

annex

Discussions. Note that, while with either a 2 = 0 or v T 1 n = 0 we obtain the same expression for the projection |v T v| 2 , the possible spike of interest λ (and its asymptotic location λ) in these two scenarios can be rather different. More precisely, 1. with a 2 = 0, there is a single possible spike λ with m(λ) = m ρ = -c a1(ρ+1) ; 2. with v T 1 n = 0, there can be up to three spikes that correspond to m ρ = -c a1(ρ+1) andThis observation leads to the following remark. Remark 4 (Noisy top eigenvector with a 2 = 0). For v T 1 n = 0 and a 2 = 0, one may havea1(ρ+1) = m ρ for instance with large a 2 and small a 1 . Since m(x) is an increasing function, the top eigenvalue-eigenvector pair of K can be non-informative, independent of the SNR ρ, and totally useless for clustering purposes. An example is provided in Figure 2 where one observes that (i) the largest spike (on the right-hand side) corresponds to a noisy eigenvector while the second largest spike has its eigenvector positively aligned with the label vector v; and (ii) the theoretical prediction of the eigen-alignment α in Corollary 2 still holds here due to v T 1 n = 0. This extends our Corollary 1 to the signal-plus-noise scenario and confirms the advantage and necessity of taking a 2 = 0.As a side remark, in contrast to Remark 3 and Figure 13 , where we observe that the non-informative spike can be lying between the two disjoint regions of the limiting measure ω, in the case of a 1 > 0, the informative spike m ρ = -c a1(ρ+1) can only appear on the right-hand side of the support of ω, since -c a1 < -c a1(ρ+1) < 0 for ρ = lim p µ 2 ≥ 0. See Figure 8 -(bottom) for an illustration.A.4 PROOF OF PROPOSITION 1Note that, for v the top isolated eigenvector of K, with Corollary 2 we can writefor some σ ∈ R, w ∈ R n a zero-mean random vector, orthogonal to v, and of unit norm. To evaluate the asymptotic clustering performance in the setting of Proposition 1 (i.e., with the estimate Ĉi = sign ([v] i ) for vT v ≥ 0), we need to assess the probability Pr(sign, and it thus remains to derive σ. Note thatwhere we recall v = √ n, which, together with an argument on the normal fluctuations of v (Kadavankandy & Couillet, 2019) , concludes the proof.

B ADDITIONAL EMPIRICAL RESULTS ON REAL-WORLD DATASETS

In Figure 14 , we compare the clustering performance, the level of sparsity, and the computational time of the top eigenvector, of the sparse function f 1 and quantized f 2 with M = 2 (so 2 bits per nonzero entry), on the MNIST dataset. We see that, different from the binary f 3 with which small entries of K are set to zero, the quantized function f 2 , by letting the small entries of K to take certain nonzero values, yields surprisingly good performance on the MNIST dataset. This performance gain comes, however, at the price of somewhat heavy computational burden that is approximately the same as the original dense matrix X T X, since we lose the sparsity with f 2 , see Figure 1-(left) . This may be interpreted as a trade-off between storage size and computational time. Also, from the left and middle displays of Figure 7 and Figure 14 , we see that for MNIST data, while the classification error rates on the pair of digits (0, 1) can be as low as 1%, the performances on the pair (5, 6) are far from satisfactory, with the linear f (t) = t and the proposed f 1 , f 2 or f 3 . This is the limitation of the proposed statistical model in (1), which only takes into account the first-order discriminative statistics. Indeed, it has been shown in (Liao & Couillet, 2019 ) that, taking a 2 = 0 (as

