LIKELIHOOD ADJUSTED SEMIDEFINITE PROGRAMS FOR CLUSTERING HETEROGENEOUS DATA

Abstract

Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves nonconvex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed K-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation -a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including K-means, SDP and EM algorithms.

1. INTRODUCTION

Clustering analysis has been widely studied and regularly used in machine learning and its applications in network science (Girvan & Newman, 2002) , computer vision (Shi & Malik, 2000; Joulin et al., 2010) , manifold learning (Chen & Yang, 2021a) and bioinformatics (Karim et al., 2020) . Perhaps by far the most popular clustering method is the K-means (MacQueen, 1967) partially because there are computationally convenient algorithms such as Lloyd's algorithm and K-means++ for heuristic approximation (Lloyd, 1982; Arthur & Vassilvitskii, 2007) . Mathematically, K-means aims to find the optimal partition of data to minimize the total within-cluster squared Euclidean distances, which is equivalent to the maximum profile likelihood estimator under the standard Gaussian mixture model (GMM) with common isotropic covariance matrices (Chen & Yang, 2021b) . Nevertheless, real data usually exhibit various degrees of heterogeneous features such as the cluster shapes may vary from component to component, which renders K-means as a sub-optimal clustering method. Another popular clustering method is the classic expectation-maximization (EM) algorithm, which is a computationally thrifty method based on the idea of data augmentation to iteratively optimize the non-convex observed data likelihood (Dempster et al., 1977) . Theoretical investigations reveal that the EM algorithm suffers from bad local maxima even in the one-dimensional standard GMM with well-separated cluster centers (Jin et al., 2016) . Thus practically even when applied in highly favorable separation-to-noise ratio settings, careful initialization, often through multiple random initializations or a warm-start by another heuristic method such as hierarchical clustering (Fraley & Raftery, 2002) , is the key for the EM algorithm to find the correct cluster labels and model parameters. With a reasonable initial start, the EM algorithm has been shown to achieve good statistical properties (Balakrishnan et al., 2017; Wu & Zhou, 2019) . In this paper, we consider the likelihood-based inference to tackle the problem of recovering cluster labels in the presence of data heterogeneity. Our motivation stems from the recent progress in understanding the computational and statistical limits for convex relaxation methods of the Kmeans clustering. Since K-means is a worst-case NP-hard problem (Aloise et al., 2009) , various heuristic approximation algorithms such as Lloyd's algorithm (Lloyd, 1982; Lu & Zhou, 2016) , and computationally tractable relaxations such as spectral clustering (Meila & Shi, 2001; Ng et al., 2001; Vempala & Wang, 2004; Achlioptas & McSherry, 2005; von Luxburg, 2007; von Luxburg et al., 2008) and semidefinite programs (SDP) (Peng & Wei, 2007; Mixon et al., 2016; Li et al., 2017; Fei & Chen, 2018; Chen & Yang, 2021a; Royer, 2017; Giraud & Verzelen, 2018; Bunea et al., 2016; Zhuang et al., 2022a) , have been proposed in literature. Among the existing solutions, the SDP approach is particularly attractive in that it attains information-theoretically optimal threshold on centroid separations for exact recovery of cluster labels (Chen & Yang, 2021b) . Our contributions. We extend the SDP approach to a general setting with heterogeneous features by integrating cluster labels as model parameters (together with other component-specific parameters) and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact observed data likelihood. Our idea is to tailor the strength of SDP relaxation of the K-means clustering method in the isotropic covariance case for likelihood-awareness inference. On one hand, iLA-SDP has a similar flavor as the EM algorithm by maximizing the likelihood function of the observed data. On the other hand, different from the EM framework, iLA-SDP treats the cluster labels as unknown parameters while profiles out the cluster centers (i.e., centroids), which brings several statistical and algorithmic advantages. First in the arguably simplest one-dimensional GMM setting, EM is known to fail in certain configurations of centroids even when they are well-separated (Jin et al., 2016) . In other words, EM is sensitive to initialization and model configuration. The main reason is due to the effort for estimating the cluster centers during the EM iterations. In iLA-SDP, cluster centers are regarded as nuisance parameters and profiled out to obtain a likelihood function in component-specific parameters including only the cluster covariance matrices. Thus iLA-SDP is more stable and performs empirically better than EM. Second, cluster labels in EM are latent variables that are estimated by their posterior probabilities and the observed log-likelihood for component parameters and mixing weights are optimized through minorizing functions during iterations. In iLA-SDP, cluster labels are regarded as parameters optimized through the likelihood function jointly in the labels and covariance matrices. Thus iLA-SDP is a more direct approach than EM for taming the non-convexity in the observed log-likelihood objective and we prove that it perfectly recovers the true clustering structure if the clusters are well-separated under a lower bound without concerning the configurations of centroids. The rest of the paper is organized as follows. In Section 2, we review some background on partitionbased formulation for model-based clustering. In Section 3, we introduce the likelihood adjusted SDP for recovering the true partition structure and discuss its connection to the EM algorithm. In Section 4, we compare the performance of several widely used clustering methods on two real datasets.

2. MODEL-BASED CLUSTERING: A PARTITION FORMULATION

We consider the model-based clustering problem. Suppose the data points X 1 , . . . , X n ∈ R p are independent random variables sampled from K-component Gaussian mixture model (GMM). Specifically, let G * 1 , . . . , G * K be the true partition of the index set [n] := {1, . . . , n} such that if i ∈ G * k , then X i = µ k + ϵ i , where µ k ∈ R p is the center of the k-th cluster and ϵ i is an i.i.d. random noise term following the common distribution N (0, Σ k ). Here we focus on the most general and realistic scenario where the within-cluster covariance matrices Σ 1 , . . . , Σ K are heterogeneous. In our formulation of the GMM, the true partition (G * k ) K k=1 is treated as a unknown parameter in model ( 20), along with the component-wise parameters (µ k , Σ k ) K k=1 . With this parameterization (G k , µ k , Σ k ) K k=1 , the log-likelihood function for observing the data X = {X 1 , . . . , X n } is given by ℓ (G k , µ k , Σ k ) K k=1 | X = - K k=1 |G k | 2 log(2π|Σ k |) - 1 2 K k=1 i∈G k (X i -µ k ) T Σ -1 k (X i -µ k ), where |G k | is the cardinality of G k and |Σ k | is the determinant of matrix Σ k . Since we are primarily interested in recovering the clustering labels (or equivalently the assignment matrix, cf. Section 3.1 below) in the presence of cluster heterogeneity, we can first profile out the nuisance parameters µ k in closed form and the resulting objective function as a profile log-likelihood for the remaining parameters (after dropping constants) is given by ℓ (G k , Σ k ) K k=1 | X = - K k=1 |G k | log(|Σ k |) - K k=1 i∈G k ∥X i ∥ 2 Σ -1 k + K k=1 1 |G k | i,j∈G k ⟨X i , X j ⟩ Σ -1 k , (2) where ⟨v, u⟩ Σ := v T Σu and ∥u∥ 2 Σ := ⟨u, u⟩ Σ for any u, v ∈ R p and Σ ≻ 0. This leads us to a combinatorial optimization problem for the profile log-likelihood function max ℓ (G k , Σ k ) K k=1 | X : K k=1 G k = [n], Σ k ≻ 0 , where the disjoint union K k=1 G k = [n] means that K k=1 G k = [n] and G j ∩ G k = ∅ if j ̸ = k. Note that the constrained optimization problem in (3) in the special case Σ 1 = • • • = Σ k = σ 2 Id p reduces to the K-means clustering method, which is known to be worst-case NP-hard (Dasgupta, 2007; Mahajan et al., 2009) . To overcome such computational difficulty, semidefinite program (SDP) relaxation is a tractable solution that achieves information-theoretically optimal exact recovery under the standard GMM with identical and isotropic covariance matrices (Chen & Yang, 2021a) . Nevertheless, all existing formulations of various SDP relaxations of the standard GMM critically depend on the assumption that Σ 1 = • • • = Σ k = σ 2 Id p with a known noise variance parameter σ 2 (Fei & Chen, 2018; Li et al., 2017; Peng & Wei, 2007; Chen & Yang, 2021a) . This motivates us to seek alternative SDP formulations adjusting the (full) information coming from the likelihood function for the observed data X.

3. LIKELIHOOD ADJUSTED SDP FOR CLUSTERING HETEROGENEOUS DATA

In this section, we introduce the likelihood adjusted SDP (LA-SDP) for recovering the true partition structure G * 1 , . . . , G * K by applying convex relaxation to the profile log-likelihood function (3).

3.1. ORACLE LA-SDP UNDER KNOWN COVARIANCE MATRICES

In this subsection, we consider the oracle case where the covariance matrices Σ 1 , . . . , Σ K are known. Let us start with a well-studied SDP relaxation formulation (Peng & Wei, 2007) for approximating the combinatorial optimization problem of maximizing the profile log-likelihood function under the isotropic setting with known Σ 1 = . . . = Σ K = σ 2 Id p , which is known (Chen & Yang, 2021b) to attain the information-theoretically optimal threshold on centroid separations for exact recovery of cluster labels. Note that there is a one-to-one correspondence between any given partition (G k ) K k=1 of [n] and a binary assignment matrix H = (h ik ) ∈ {0, 1} n×K (up to cluster labels permutation) such that h ik = 1 if i ∈ G k and h ik = 0 otherwise for i ∈ [n] and k ∈ [K]. Because each row of H contains exactly one non-zero entry, the recovery of the true clustering structure (or its associated assignment matrix) by maximizing the profile log-likelihood function (after dropping constants) can be re-expressed as a (non-convex) mixed integer program: max H ⟨A, HBH ⊤ ⟩ = K k=1 1 |G k | i,j∈G k ⟨X i , X j ⟩, subject to H ∈ {0, 1} n×K and H1 K = 1 n , (4) where A = X ⊤ X is the n × n similarity matrix, 1 n denotes the n-dimensional vector of all ones, and B is the K × K diagonal matrix whose k-th diagonal component is |G k | -1 = n i=1 h ik -1 . Here, we have used the key identity K k=1 w k i,j∈G k a ij = ⟨A, HBH ⊤ ⟩ that holds for any diagonal matrix B = diag(w 1 , . . . , w K ) and similarity matrix A = (a ij ) n i,j=1 . Relaxing the above mixed integer program (4) by lifting the assignment matrix H into Z = HBH ⊤ , we arrive at its SDP relaxation as Ẑ = arg max Z∈R n×n ⟨A, Z⟩, subject to Z ⪰ 0, tr(Z) = K, Z1 n = 1 n , Z ⩾ 0, where Z ⩾ 0 means each entry Z ij ⩾ 0 and Z ⪰ 0 means the matrix Z is symmetric and positive semi-definite. This SDP formulation relaxes the integer constraint on H into two linear constraints tr(Z) = K and Z ⩾ 0 that are satisfied by any Z = HBH T as H ranges over feasible solutions of problem (4). Now let us consider the general heterogeneous setting with (possibly) different and non-isotropic covariance matrices Σ 1 , . . . , Σ K , and extend the SDP relaxation to this setting. Two technical difficulties arise by examining the previous argument. First, the first two terms in the profile loglikelihood function (3) are no longer independent of the assignment matrix, and is therefore not negligible. In particular, they also provide partial information about the cluster labels when the covariance matrices are different: ∥X i ∥ 2 Σ -1 k in the second term quantifies how well X i aligns with the covariance matrix Σ k encoding second-order information of the k-th cluster; while the first term plays the role of balancing the cluster sizes and favors assigning more points to clusters with smaller shapes (since density is expected to be high). Second, the similarity ⟨X i , X j ⟩ Σ -1 k within cluster G k in the third term now depends on k, making the key identity K k=1 w k i,j∈G k a ij = ⟨A, HBH ⊤ ⟩ for connecting the profile log-likelihood function with the objective function of the mixed integer program (4) no longer applicable. To solve the two aforementioned difficulties, we propose to augment the single variable Z in the SDP relaxation (5) to K variables (Z k ) K k=1 , where Z k can be interpreted as the lifting of the k-th column H k of the assignment matrix H via Z k = 1 |G k | H k H ⊤ k , |G k | = n i=1 h ik = H ⊤ k 1 n , that encodes the cluster membership associated with the k-th cluster. More specifically, by extending the key identity in the isotropic setting to K k=1 w k i,j∈G k a (k) ij = K k=1 ⟨A (k) , H k w k H ⊤ k ⟩ for any weight vector w = (w k ) K k=1 and K similarity matrices A k = (a (k) ij ) n i,j=1 K k=1 , we can analogously express the maximizing profile log-likelihood problem as the following (non-convex) mixed integer program: max H K k=1 ⟨A (k) , H k w k H ⊤ k ⟩, subject to H k ∈ {0, 1} n×1 and K k=1 H k = 1 n , where w k = |G k | -1 = n i=1 h ik -1 , and the k-th cluster-specific similarity matrix A (k) is A (k) := -log(|Σ k |)1 n 1 T n - 1 2 diag(X T Σ -1 k X)1 T n + 1 n diag(X T Σ -1 k X) T + X T Σ -1 k X. (7) Here, diag(A) stands for the column vector composed of all diagonal entries of a matrix A. Now by lifting H k into Z k = H k w k H ⊤ k , we arrive at the following SDP relaxation for the profile log-likelihood objective function (2): Ẑ1 , . . . , ẐK = arg max Z1,...Z K ∈R n×n K k=1 ⟨A k , Z k ⟩, subject to Z k ⪰ 0, K k=1 tr(Z k ) = K, K k=1 Z k 1 n = 1 n , Z k ⩾ 0, ∀ k ∈ [K], (8) which relaxes the integer constraint on H = (H 1 , H 2 , • • • , H k ) into (K + 1) linear constraints K k=1 tr(Z k ) = K and Z k ⩾ 0 for k ∈ [K] that are satisfied by any Z k = H k w k H ⊤ k as H ranges over feasible solutions of problem (6). Since solving (8) requires the knowledge of the true covariance matrix for each component, we call the solution ( Ẑk ) K k=1 as the oracle likelihood adjusted SDP (LA-SDP) for estimating the cluster membership matrix of data points. In the special case of isotropic covariance matrices Σ 1 = • • • = Σ K = σ 2 Id p , Proposition 1 below shows that LA-SDP reduces to become equivalent to the previous SDP formulation (5). Proposition 1 (SDP relaxation for K-means is a special case of LA-SDP). Suppose Σ k = σ 2 Id p for all k ∈ [K]. Let Ẑ be the solution to (5) that achieves maximum M 1 and Ẑk , k = 1, . . . , K, be the solution to (5) with maximum M 2 . Then M 1 = M 2 . And Ẑ = K k=1 Ẑk , if Ẑ is unique in (5). Z 1 Z 2 Z 3 Z 4 Z = Z 1 + Z 2 + Z 3 + Z 4 Overall membership Eigenvectors of Z LA-SDP output G 1 G 2 G 3 G 4

Rounding

Figure 1 : LA-SDP membership matrices to cluster labels via spectral rounding. Note that the SDP relaxed K-means in ( 5) is originally proposed in (Peng & Wei, 2007) and has been extensively studied in literature. In particular, it achieves the information-theoretical limit for exact recovery under the standard GMM (Chen & Yang, 2021a) and it is robust against outliers and adversarial attack (Fei & Chen, 2018 ). In the case of exact recovery where Ẑ = Z * and Z * is the true cluster membership matrix such that Z * ij = |G * k | -1 if i, j ∈ G * k and Z * ij = 0 otherwise, then we can easily recover the true partition structure G * 1 , . . . , G * K or its associated assignment matrix from the block diagonal matrix Ẑ. Thus it is an interesting theoretical question of when the partition structure induced by Ẑ = K k=1 Ẑk from the LA-SDP (see Figure 1 for an illustration) can achieve exact recovery. Theorem 2 below gives a lower bound of the separation signal-to-noise ratio for achieving exact recovery in the presence of data heterogeneity. For each distinct pair (k, l) ∈ [K], let D (k,l) := p i=1 (λi-log(1+λi)) p maxi |λi| characterize the closeness between Σ k and Σ l , where λ 1 , . . . , λ p enumerate all eigenvalues of (Σ 1/2 l Σ -1 k Σ 1/2 l -Id p ). If λ i = 0, ∀i ∈ [p], we let D (k,l) = 0. Let ∆ 2 := min k̸ =l ∥Σ -1/2 k (µ k -µ l )∥ 2 denote a covariance adjusted centroid separation, n k := |G * k | the size of true cluster G * k , m = min k̸ =l 2n k n l n k +n l the least pairwise harmonic mean over cluster sizes, n = min k n k the minimal cluster size, and M := max k̸ =l ∥Σ 1/2 l Σ -1 k Σ 1/2 l ∥ op (matrix operator norm). Theorem 2 (Exact recovery for LA-SDP). Suppose there exist constants δ > 0, β ∈ (0, 1) and η ∈ (0, 1) such that log n ≥ max (1 -β) 2 β 2 , (1 -β)(1 -η)K 2 β 2 max{(M -1) 2 , 1} C 1 n m , δ ≤ β 2 (1 -β) 2 C 2 M 1/2 K , m ≥ 4(1 + δ) 2 δ 2 . Then the LA-SDP achieves exact recovery, or Ẑ = Z * , with probability at least 1 - C 7 K 3 n -δ if ∆ 2 ≥ (E 1 + E 2 ) log n, and min k̸ =l D (k,l) ≥ C 3 (1 + log n/p + p/n), where concrete expressions of E 1 and E 2 (depending on δ, β, η) are provided in Appendix A.4, and C 1 , . . . , C 7 are universal constants. Our definition of the centroid separation ∆ extends the separation-to-noise ratio (SNR) for the exact recovery under the isotropic covariance setting (Chen & Yang, 2021a) to the heterogeneous setting by taking into account the cluster shapes (i.e. second order information). From ( 18), we see that our theoretical centroid separation lower bound consists of two parts E 1 and E 2 : E 1 reduced to the information-theoretically optimal threshold when M = 1, corresponding to same covariance matrices; E 2 tends to vanish for small M close to one and satisfying M = 1 + o(1/ √ n log n) or remains as an extra term for large M . From our numerical results summarized in Figure 2 , we can observe that our defined centroid separation ∆ indeed captures the accuracy of cluster label recovery using LA-SDP-the mis-clustering error curves display almost identical patterns under different settings of the GMM. In comparison, the performance of the (original) SDP (5) and the K-means clustering method designed for the isotropic case become significantly worse as the condition number of the cluster covariance matrices increases. More details about implementation and model setups are provided in Appendix A.3.

3.2. ITERATIVE LA-SDP UNDER UNKNOWN COVARIANCE MATRICES: AN ALTERNATING MAXIMIZATION ALGORITHM

Since the oracle LA-SDP relies on the knowledge of covariance matrices Σ 1 , . . . , Σ K , we propose a simple and practical data-driven algorithm for approximating LA-SDP when these covariance  Σ 1 = Σ 2 = • • • = Σ K (M = 1 ). The left (right) plot corresponds to a moderate (large) condition number of the common covariance matrix. Here, KM refers to K-means method; SDP refers to the original SDP (5). matrices are unknown. The idea is to alternate between the SDP relaxation given a current estimate of the component covariance matrices and updating covariance matrices according to the maximum (penalized) likelihood given the new membership estimate. The next lemma gives a closed-form formula for updating covariance matrices given a current estimate of the assignments Z 1 , . . . , Z K based on their (unconstrained) MLEs on the observed data. Lemma 3 (Updating formula for covariance matrices under alternating maximization). For any feasible matrices Z 1 , . . . , Z K satisfying the constraints of (8), Σk := 1 1 T n Z k 1 n n i,j=1 1 2 (X i X T i + X j X T j ) -X i X T j Z k,ij , k ∈ [K], solve the following optimization problem Σ1 , . . . , ΣK = arg max Σ1,...Σ K ⪰0 K k=1 ⟨A k , Z k ⟩, where recall that A (k) := A (k) (Σ k ) is the Σ k -dependent similarity matrix defined in (7). Based on the lemma, we propose an iterative LA-SDP (iLA-SDP) by alternating maximization of the profile log-likelihood (3) for estimating the lifted cluster membership matrices (Z k ) K k=1 from LA-SDP (8) and the component covariance matrices (Σ k ) K k=1 , as summarized in Algorithm 1. In the special case where the lifted membership matrix Z k is of rank one, which holds for true lifted cluster membership matrices (Z * k ) K k=1 , the covariance matrices produced by iLA-SDP can be interpreted as within-cluster sample covariance matrices under soft clustering. Proposition 4 (Covariance estimation in iLA-SDP via soft clustering). If rank(Z k ) = 1, then there exists weights (w k,1 , . . . , w k,n ) such that these Σk in Lemma 3 can be written as Σk := 1 n k n i=1 w k,i (X i -μk )(X i -μk ) ⊤ , where μk := 1 n k n i=1 w k,i X i and n k = n i=1 w k,i . ( ) It is further noted from the proof of Proposition 4 that when Z k has rank one, the weights w k,1 , . . . , w k,n are proportional to the leading non-zero eigenvector of Z k . Thus the alternating maximization step (17) for updating the covariance matrices in iLA-SDP can be interpreted as a soft clustering technique that resembles the EM algorithm. Specifically, the E-step estimates the (hard) cluster label Y i ∈ {0, 1} K associated with X i by the posterior probabilities τ ik := p(Y ik | X i , θ(t) ) where θ(t) = (π (t) k , μ(t) k , Σ(t) k ) K k=1 denotes the estimated GMM parameters at the t-th iteration in the EM. Then the M-step updates the parameters via π(t+1 ) k = m (t) k /n with m (t) k = n i=1 τ (t) ik , μ(t+1) k = 1 m (t) k n i=1 τ (t) ik X i and Σ(t+1) k = 1 m (t) k n i=1 τ (t) ik (X i - μ(t+1) k )(X i - μ(t+1) k ) ⊤ . ( ) Note that ( 17) and ( 13) represent different weighting schemes in the soft clustering rule for obtaining an estimate for the cluster labels. In iLA-SDP, the weight w k,i for X i belonging to component k is determined by the SDP in (8). Once the weights are calculated, remaining parameter updates in both iLA-SDP and EM boil down to simple averages with effective component sample sizes n k and m k , respectively. In Section 3.3 to follow, we provide deeper comparison between iLA-SDP and EM. Remark 5. In Appendix A.2, we further propose two variations of iLA-SDP that can handle high-dimensional and large-size data with better computational and statistical efficiency. For highdimensional data, we apply Fisher's LDA with an initial estimate of the cluster labels to find an optimal feature subspace that increases the SNR for better clustering, and for large-size data we combine the subsampling idea with iLA-SDP to reduce computational cost (Zhuang et al., 2022b) . Algorithm 1: The iterative likelihood adjusted SDP (iLA-SDP) algorithm Input: Data matrix X ∈ R p×n containing n points. Initialization of assignments G (0) 1 , . . . , G (0) K or covariance matrices Σ (0) 1 , . . . , Σ K . The stopping criterion parameters ϵ, S. (Assignments to covariance matrices) If we have the initialization of assignments, let Σ (0) k := |G (0) k | -1 i∈G (0) k (X i -Xk )(X i -Xk ) T to be the sample covariance of each cluster k ∈ [K], where Xk := |G (0) k | -1 i∈G (0) k X i . for s = 1, . . . , S do (Adjusted-SDP) Solve the Adjusted-SDP in (8) using X and Σ (s-1) 1 , . . . , Σ (s-1) K to get solution Z (s) 1 , . . . , Z (s) K . Compute the sum Z(s) := K k=1 Z (s) k and the relative norm r (s) := ∥ Z(s) -Z(s-1) ∥ F /∥ Z(s-1) ∥ F for s ≥ 2. We will break the loop if r (s) < ϵ . (Assignments to covariance matrices) Use formula in Lemma 3 to get covariance matrices Σ (s) 1 , . . . , Σ (s) K from Z (s) 1 , . . . , Z (s) K . Perform the spectral decomposition of Z(S) and take the top K eigenvectors (û 1 , . . . , ûK ). Run K-means clustering on (û 1 , . . . , ûK ) and extract the cluster labels Ĝ1 , . . . , ĜK as a partition estimate for [n]. Output: A partition estimate Ĝ1 , . . . , ĜK for [n].

3.3. CONNECTIONS BETWEEN ILA-SDP AND EM ALGORITHMS

It is interesting to observe that our proposed iLA-SDP algorithm is closely connected to the classic EM algorithm, which approximates the maximum likelihood estimation (MLE) of the observed data in statistical models with latent variables (Dempster et al., 1977) . The key idea of EM algorithm in the model-based clustering context is data augmentation where the latent variables represent the cluster labels. More specifically, for each data point X i ∈ R p , we associate with an unobserved one-hot encoded cluster label Y i := {Y i1 , . . . , Y iK } ∈ {0, 1} K . Then the EM algorithm aims to iteratively maximize the expected log-likelihood of the complete data (X i , Y i ) n i=1 given by θ (t+1) = arg max θ Q(θ | θ (t) ) := E Y∼q(•|X,θ (t) ) [ℓ c (θ | X, Y)] , where θ = ((π k , µ k , Σ k ) K k=1 ) contains parameters in the GMM, (π k ) K k=1 are the weight parameters such that π k ⩾ 0 and K k=1 π k = 1, and the complete log-likelihood function is ℓ c (θ | X, Y) := p(X, Y | θ) = - 1 2 n i=1 K k=1 Y ik log(2π|Σ k |) -(X i -µ k ) ⊤ Σ -1 k (X i -µ k ) . Alternatively, the EM algorithm ( 14) can be interpreted as minorize-maximization (MM) that maximizes a best lower bound for the log-likelihood of the observed data and α (perturbation percentage of initialization). mEM (SDP) refers to the reduced version of EM (LA-SDP) where we consider covariance matrices as fixed and equal to identity. The first plot compares the performance of mEM and SDP when separation is large with random initialization; the second plot compares all methods when we enlarge the perturbation percentage α applied to the random initialization from hierarchical clustering (HC). ℓ(θ | X) := log p(X | θ) = log Y p(X, Y | θ) ⩾ Y q(Y | X) log p(X, Y | θ) q(Y | X) =: L(q, θ) for any posterior distribution q(Y | X). Under this perspective, the EM algorithm can be expressed as an alternating maximization algorithm on L(q, θ) between E-step q (t+1) = arg max q L(q, θ (t) ) and M-step θ (t+1) = arg max θ L(q (t+1) , θ). Thus, give any q(Y | X), the M-step maximizes the expected complete log-likelihood as a surrogate function that minorizes ℓ (θ | X) because L(q, θ) = E Y∼q(• | X) [ℓ c (θ | X, Y)] -H(q(Y | X)) where H(q) denotes the relative entropy of distribution q, while given the current parameter estimate θ (t) , the E-step is maximized at q (t+1) (Y | X) = p(Y | X, θ (t) ) because ℓ(θ (t) | X) ⩾ L(p(Y | X, θ (t) ), θ (t) ) = Y p(Y | X, θ (t) ) log p(X | θ (t) ) = ℓ(θ (t) | X), where the first inequality is actually an equality at p(Y | X, θ (t) ). Even though the EM and iLA-SDP are both alternating maximization algorithms aiming to solve the MLE for the observed data loglikelihood and both can be viewed as soft clustering methods (cf. Proposition 4), there are several important differences we would like to highlight. First, cluster labels are (random) latent variables and they are estimated via posterior probabilities in the EM algorithm, while the labels are treated as unknown parameters in iLA-SDP that are estimated via direct maximization of the observed data likelihood. Second, the EM algorithm is a special case of the minorization-maximization (MM) algorithm (Hunter & Lange, 2000) by iteratively performing the coordinate ascent on the expected complete data loglikelihood as a minorizing surrogate function, while our iLA-SDP is exact in the sense that it directly optimizes the observed data log-likelihood via a convex relaxation formulation. Thus iLA-SDP is a more direct approach than EM for tackling the non-convex observed log-likelihood objective and it is principled to perfectly recover the true clustering structure if the clusters are well-separated under an SNR lower bound in Theorem 2. As in the EM algorithm, iLA-SDP monotonically maximizes the observed data log-likelihood over iterations; cf. Figure 7 in Appendix. Third, the EM algorithm in each iteration must estimate the cluster center parameters (µ k ) K k=1 , while our iLA-SDP profiles out the effect of centroid estimation and leverages only pairwise Mahalanobis distances between data to accommodate the heterogeneity of cluster shapes. Partly because the error in estimating the centroids propagates to other parameters, EM is more sensitive to initialization with inaccurate labels and the centroid configurations even in the standard GMM (Jin et al., 2016) , and iLA-SDP behaves better than EM, an observation we empirically verify in our simulation experiments; cf. Figure 3 for comparison between iLA-SPD and EM algorithms. From the first plot we can observe that LA-SDP with isotropic known covariance matrices, which reduces to the K-means SDP in (5), performs stable and achieves exact recovery when the separation is large. However, EM fails with random initialization in this adversarial centroids configuration. Moreover, from the second figure we can see that LA-SDP is fairly stable with perturbation of initialization if the separation is large while EM can go worse as the perturbation percentage of initialization α approaches 1, i.e., all the labels are selected randomly. In other words, EM is more sensitive to initialization and iLA-SDP is more stable if the signal is strong. More details of the settings in Figure 3 can be found in Appendix A.3.

4. REAL-DATA APPLICATIONS

In this section, we test the performance of iLA-SDP against several widely used clustering methods on two real datasets from the UCI machine learning repository. Banknote authentication dataset. We first look at the performances of our methods for a banknote authentication dataset where the separation of clusters are not large. The images were taken from genuine and forged banknote-like specimens, where the features were extracted by Wavelet Transform tool. It contains 1372 samples and p = 4 attributes with K = 2 clusters. We choose total n = 1000 samples randomly and equally from two clusters to make cluster sizes balanced among total 200 replicates. HC is used as initialization for EM, KM and iLA-SDP. If the initialization for the assignments G (0) 1 , G 2 is highly unbalanced, i.e., ∥G (0) 1 ∥/∥G (0) 2 ∥ > 4 if ∥G (0) 1 ∥ > ∥G (0) 2 ∥, then the covariance matrices of two clusters should differ significantly and we calculate the covariance estimation for unconstrained optimization problem; Otherwise we will calculate the estimation of the covariance matrices through graphical lasso with parameter λ = 2 since the similarity shows that we could reduce the estimation of parameters. Then we run Algorithm 1 to get the results for iLA-SDP. The comparison of those four methods can be found from the left plot in Figure 4 , where we can observe that the separation between two clusters is not well in the sense that the medians of all methods are similar. Nevertheless, iLA-SDP can achieve better performance than other methods for most of the time. The reason iLA-SDP has lower mean is that sometimes iLA-SDP can achieve nearly exact recovery in the sense that there are 36 out of 200 times when the mis-clustering error for iLA-SDP are below 0.05, while EM can only hit around 0.25. This indicates that there are more chances for iLA-SDP to sort two clusters with fairly good performance even the separation is problematic. Landsat satellite dataset. This database was generated from landsat Multi-Spectral Scanner image data. The test set includes 2000 satellite images, 6 different clusters with 36 attributes (36 = 4 spectral bands × 9 pixels in neighbourhood). Every attribute is an integer from 0 to 255 indicating the color for certain pixel. We performed 4 methods on the transformed dataset with total 300 replicates. For each repetition, we draw total n = 1200 samples randomly and equally from clusters to make cluster sizes balanced. And for each attribute, we scale its range to [0, 1] and then take the function f (x) = log(1/x -1) entry-wise to transform the range to R + . Then, we run Algorithm 3 on the transformed dataset X to get the results for iLA-SDP with ϵ = 10 -2 , p 0 = K = 6, and R = 50. From the results we can see that our method iLA-SDP performs the best for all four methods. Especially, iLA-SDP out-performs EM since the initialization (HC) is rough, which results in both biases of the estimations of group means and covariance matrices for EM while iLA-SDP only uses the group covariance matrices as its initialization.

A APPENDIX A.1 FURTHER RESPONSE TO REVIEWERS

Sample complexity bound. To verify the sample complexity bound for LA-SDP in Theorem 2 (O(log(n))) is tight, we will change n and adjust the squared distance between clusters by multiplying log(n). More precisely, we let d = λ log(n), λ > 0. The diagonal of the covariance matrices are placed at a simplex of R p that are not identical to the corresponding centers. i.e. µ k = λ • e k , Σ k = L•diag(e k+1 ), ∀l ∈ [K] , where e K+1 = e 1 . This guarantees the symmetry of the construction. We set L = 10, p = 4, K = 4. Each time we draw the n = 120/240/480 data from the GMM. The results of the simulation for the second plot in Figure 5 are obtained through 20 total replicates, where we can observe the same pattern across different settings for n. This shows that the order log(n) for separation bound in Theorem 2 should be tight. Computational complexity for banknote authentication dataset. Now if we look at the results of time cost for clustering banknote authentication dataset in Table 1 , we can observe that the time cost for iLA-SDP is relatively high and to reduce the time cost, we could consider sub-sampling methods, e.g., the subsampling idea (Zhuang et al., 2022b ). This will be set as our future goal. In this section, we propose two variations of iLA-SDP that can handle high-dimensional and large-size data with better computational and statistical efficiency. High dimensional data. If the number attributes of the data are large, it would be hard to approximate the true covariance matrices since there are O(p 2 ) many unknown parameters. Thus, we propose two dimension reduction procedures that based on hierarchical clustering, Fisher's LDA and F-test. The detailed algorithm have been shown in Algorithm 2 and Algorithm 3. To reduce the dimension, we proposed two procedure. 1. If the number of clusters K is small and the difference between centers are sparse, we shall use HC as a benchmark method for feature selection and assume the group means according to HC as ground true. Specifically, for i-th attribute, we calculate the F-statistics and its p-value based on the H 0 that all group means w.r.t. i-th attribute are the same. At last, each attribute would likely to be selected if the p-value P i for i-th attribute is significantly small among p-values for all attributes. 2. First we use the hierarchical clustering to get the clustering results for all possible input cluster number K ∈ [p]. If we assume all the clusters have identical covariance matrices, then we may use the assignments from HC to estimate the within-cluster covariance W (with group means μl ) and get the signal-to-noise ratio ∆( K) := min k̸ =l ∥ W -1/2 (μ k -μl )∥. Here, HC serves as a benchmark method for data initial processing. We will then choose the largest K within target range such that the signal-to-noise ratio ∆( K) is maximized. Then it will lead to the new dataset with dimension q = K -1 after running Fisher's LDA on the assignments from HC with clusters number equals K. Finally we perform Algorithm 1 on the new dataset and extract the cluster labels. Large-size data. As we know that the time complexity for solving SDP is as high as O(n 3.5 ). We might use subsampling methods to bring down the time cost while maintain the superior behavior for LA-SDP (Zhuang et al., 2022b) . The proposed algorithm is shown in Algorithm 4. Algorithm 2: Likelihood adjusted SDP based iterative algorithm with unknown covariance matrices Σ 1 , . . . , Σ K for large p. Input: Data matrix X ∈ R p×n containing n points. Cluster numbers K. The stopping criterion parameters p 0 , ϵ and S. α ∈ [0, 1], C > 0. Run hierarchical clustering with data X, clusters number K and extract the cluster labels G 

A.3 EXPERIMENT RESULTS

In this section, we provide more details of the settings and post the results for simulation experiments. For all the dimension reduction procedures used in the simulation experiments, we perform step 1-7 in Algorithm 2 followed by Algorithm 3 with input parameters α = 0.7, C = 10 10 , p 0 = 2K, p 1 = 15 ϵ = 10 -2 , S = 50. The initialization we use is hierarchical clustering from mclust package in R. Here we test our algorithm on Gaussian mixture models and real datasets. We compared our algorithm iLA-SDP (HC as initialization) with HC, EM algorithm (HC as initialization), K-means (HC as initialization) and original SDP. Improvements of iLA-SDP over SDP. Recall in Theorem 2, we define the signal-to-noise ratio as ∆ 2 := min k̸ =l ∥Σ -1/2 k (µ k -µ l )∥ 2 . To verify the validity of the definition and compare iLA-SDP and SDP, we change the conditional number for covariance matrices Σ 1 , . . . , Σ K . Here we choose n = 200, p = 4, K = 4. Recall M := max k̸ =l ∥Σ 1/2 l Σ -1 k Σ 1/2 l ∥ op , we choose all the covariance matrices to be the same such that M is fixed. The covariance matrices are set to be identity matrix except that the first entry at the diagonal are set to be L + 1, which refers to the condition number of matrices. We consider two cases where L = 10, 100. Now denote e k ∈ R p as the vector with k-th sparse for all distinct pairs. And after performing the F-test on the covariates, the noisy terms get eliminated which results in better performance. Failure of EM vs SDP. Recall the failure of EM for random initialization (Jin et al., 2016) in the special case that covariance matrices equal to identity matrix and it assumes equal weights. Both covariance matrices and weights are known. In this case, EM algorithm would be reduced to the version that the weights and the mean update interactively. Meanwhile, iLA-SDP would be reduced to SDP. The random initialization indicates that we pick any data point as initialization of the centers uniformly. Following the same setting from the construction of the pitfall, we choose one dimension GMM with three clusters such that the distance between two of the centers is much smaller than others. More concisely, we let n = 300, K = 3, p = 1, µ 1 = γ, µ 2 = -γ, µ 3 = 10 • γ. The results can be observed from the first plot in Figure 3 with 300 replicates, where we denote the reduced version of EM as mEM. From the figure we can observe that the reduced version of iLA-SDP, which is SDP, performs stable and achieves exact recovery when the separation is large. However, EM would fail for random initialization. Perturbation of initialization assignments. To see how the performance of EM and iLA-SDP will change when perturbing the initialization, we set HC as initialization and proportion α (α ∈ [0, 1]) of the initialization labels will be perturbed. The diagonal of the covariance matrices are placed at a simplex of R p that are not identical to the corresponding centers. i.e. µ k = λ • e k , Σ k = L • diag(e k+1 ), ∀l ∈ [K] , where e K+1 = e 1 . This guarantees the symmetry of the construction. We set L = 10, p = 4, K = 4 and the distance between centers d = 8. Each time we draw the n = 200 data from the GMM and run HC as initialization. Then we randomly assign α proportion of the labels from HC to any cluster uniformly. The results of the simulation for the second plot in Figure 3 are obtained through 300 total replicates, where we can observe that iLA-SDP is fairly stable with perturbation of initialization if the separation is large while EM can go worse as α approaches 1, i.e., all the labels are selected randomly. In other words, EM is more sensitive to initialization and iLA-SDP is more stable if the signal is strong. Empirical evidence for monotone increasing of objective function for iLA-SDP. Here we provide examples based on previous experiment settings where we set the distance between centers d = 1/3/5/10. and try to see how the log-likelihood function of given data changes as the iteration proceeds. From Figure 7 in Appendix we can see that our algorithm guarantees that the log-likelihood function of given data increases over iteration empirically. What is more, by our construction we can show that the log-likelihood function will increase after each step for iLA-SDP theoretically. 

A.4 PROOF OF THE THEOREMS AND PROPOSITIONS

In this section, we provide the proofs for the Proposition 1, Proposition 4 and a sketch proof of Theorem 2. The proof of the main theorem follows the track from the paper solving the exact recovery for original SDP (Chen & Yang, 2021b ) and we will show the main differences in our proof. First, we provide explicit expressions of some constants appearing in Theorem 2 below: E 1 = 4(1 + 2δ)M 5/2 (1 -β) 2 η 2 M + M 2 + (1 -β) 2 (1 + δ) p m log n + C 4 R n with R n = (1 -β) 2 (1 + δ) log n √ p log n n + log n n , and  E 2 = C 5 (M -1) 3 M 2 (1 -β)(1 -η) p log n + 1 + C 6 K 2 (1 -β) β • min 1 β(M - A k ≡ 1 2 diag(X T X)1 T n + 1 n diag(X T X) T + X T X, ∀k ∈ [K]. This implies that (8) can be written as Ẑ1 , . . . , ẐK = arg max Z1,...Z K ∈R n×n X T X, K k=1 Z k subject to Z k ⪰ 0, tr K k=1 Z k = K, K k=1 Z k 1 n = 1 n , Z k ⩾ 0, ∀ k ∈ [K], Since diag(X T X)1 T n , K k=1 Z k = tr(X T X), which is a constant in the optimization problem (16). Now suppose Ẑ is a solution to (5) that achieves maximum M 1 and Ẑk , k = 1, . . . , K, is the solution to (16) that achieves maximum M 2 , then we have Theorem 2 (Exact recovery for LA-SDP). Suppose there exist constants δ > 0 and β ∈ (0, 1) such that X T X, K k=1 Z k ≤ M 1, X T X, log n ≥ max (1 -β) 2 β 2 , (1 -β)(1 -η)K 2 β 2 max{(M -1) 2 , 1} C 1 n m , δ ≤ β 2 (1 -β) 2 C 2 M 1/2 K , m ≥ 4(1 + δ) 2 δ 2 . If ∆ 2 ≥ (E 1 + E 2 ) log n, and min k̸ =l D (k,l) ≥ C 3 (1 + log n/p + p/n), where E 1 = 4(1 + 2δ)M 5/2 (1 -β) 2 η 2 M + M 2 + (1 -β) 2 (1 + δ) p m log n + C 4 R n with R n = (1 -β) 2 (1 + δ) log n √ p log n n + log n n , and  E 2 = C 5 (M -1) 3 M 2 (1 -β)(1 -η) p log n + 1 + C 6 K 2 (1 -β) β • min 1 β(M - then the LA-SDP achieves exact recovery, or Ẑ = Z * , with probability at least 1 -C 7 K 3 n -δ for some universal constants C 1 , . . . , C 7 . Sketch of the proof. Recall that we let G * 1 , . . . , G * K be the true partition of the index set [n] := {1, . . . , n} such that if i ∈ G * k , then X i = µ k + ϵ i , where µ k ∈ R p is the true center of the k-th cluster G * k (G k for simplicity) and ϵ i is an i.i.d. random Gaussian noise N (0, Σ k ). First we can write down the dual problem: min λ∈R,α∈R n , B k ∈R n×n λK + α T 1 n , subject to B k ≥ 0, λId n + 1 2 (α1 T n + 1 n α T ) -A k -B k ⪰ 0, ∀k ∈ [K]. Denote Z * k := 1 |G k | 1 G k 1 T G k , ∀k ∈ [K] then it can be shown that the sufficient conditions for the solution of SDP to be Z k = Z * k , ∀k ∈ [K] are B k ≥ 0; (C1) W k := λId n + 1 2 (α1 T n + 1 n α T ) -A k -B k ⪰ 0; (C2) tr(W k Z * k ) = 0; (C3) tr(B k Z * k ) = 0. (C4) It can be verified that if we can find symmetric B k such that B k,G k G k = 0;



Figure 2: Mis-clustering error (with shaded error bars) vs centroid separation ∆ under different conditional numbers of cluster covariance matricesΣ 1 = Σ 2 = • • • = Σ K (M = 1). The left (right) plot corresponds to a moderate (large) condition number of the common covariance matrix. Here, KM refers to K-means method; SDP refers to the original SDP (5).

Figure3: Mis-clustering error (with shaded error bars) vs γ (captures the signal strength of GMM) and α (perturbation percentage of initialization). mEM (SDP) refers to the reduced version of EM (LA-SDP) where we consider covariance matrices as fixed and equal to identity. The first plot compares the performance of mEM and SDP when separation is large with random initialization; the second plot compares all methods when we enlarge the perturbation percentage α applied to the random initialization from hierarchical clustering (HC).

Figure 4: Box plots of difference of mis-clustering error (with means) for different methods to iLA-SDP. The left (right) plot summarizes the results for the banknote authentication dataset (landsat satellite dataset). Here SC refers to spectral clustering method.

Figure 5: Mis-clustering error (with shaded error bars for the left plot) vs λ for iLA-SDP for different n.

as prior assignments for[n]. Suppose the assignments have true centers µ(0) k , k ∈ [K]. for i = 1, . .. , p doCalculate the p-value P i of the F-test F i under H 0 : µ p 0 attributes with p 0 smallest p-values P i . if there is no clear cutoff between P i 's, i.e. max i∈[p] P i / min i∈[p] P i < C, then we further keep other p -p 0 attributes with probability α > 0.Get dimension reduced data X.Run Algorithm 1 on X with initialization obtained from K clusters of HC and stopping criterion parameters ϵ and S. Then extract the cluster labels Ĝ1 , . . . , ĜK as a partition estimate for [n]. Output: A partition estimate Ĝ1 , . . . , ĜK for [n].

Figure 6: Mis-clustering error (with shaded error bars for the left plot) vs center distance D for iLA-SDP before and after dimension reduction. pLA-SDP denotes the iLA-SDP after dimension reduction.

Proposition 1 (SDP relaxation for K-means is a special case of LA-SDP). Suppose Σ k = σ 2 Id p for all k ∈ [K]. Let Ẑ be the solution to (5) that achieves maximum M 1 and Ẑk , k = 1, . . . , K, be the solution to (5) with maximum M 2 . ThenM 1 = M 2 . And Ẑ = K k=1 Ẑk , if Ẑ is unique in (5). Proof of Proposition 1 If Σ k = σ 2 Id p , ∀k ∈ [K].Then from (7) we have

where Z1 := Ẑ, Z2 = • • • = ZK = 0. In other words, M 1 = M 2 , which finishes the proof. If Ẑ is unique in (5), then we have Ẑ = K k=1 Ẑk since both of them achieve the maximum in (5). ■A.4.2 PROOF OF PROPOSITION 4Proposition 4 (iLA-SDP is a soft clustering method). If rank(Z k ) = 1, then there exists weights (w k,1 , . . . , w k,n ) such that Σk in Lemma 3 can be written asΣk := 1 n k n i=1 w k,i (X i -μk )(X i -μk ) ⊤ with μk := 1 n k n i=1 w k,i X i ,(17)wheren k = n i=1 w k,i . Proof of Proposition 4 If Z k is rank 1, then there exists a ∈ R n such that Z k = aa T . Let w k := a T 1 • a,then we have ., Z k,ij = w k,i w k,j n i=1 w k,i . Finally, by plugging in the expression of Z k,ij with w k,i we can get the target expression for Σk . ■ A.4.3 SKETCH PROOF OF THEOREM 2

Time cost (SD) for clustering banknote authentication dataset for 20 replicates.

annex

Algorithm 3: Likelihood adjusted SDP based iterative algorithm with unknown covariance matrices Σ 1 , . . . , Σ K for large p.Input: Data matrix X ∈ R p×n containing n points. Cluster numbers K. The stopping criterion parameters p 1 , ϵ and S. Select a bench mark clustering method (HC) as a way to provide a prior assignments. for K = K, K + 1, . . . , p 1 -1, p 1 do Run hierarchical clustering with data X, clusters number K and extract the cluster labelsK as prior assignments for [n] and get the group means µCalculate the within-cluster covariance matrix W , then get the signal-to-noise ratio ∆( K) := min l̸ =k ∥W -1/2 (µChoose K * such that ∆(K * ) is maximized for K * = K, K + 1 . . . , P -1, P . Perform the Fisher's LDA with data X, assignments GRun Algorithm 1 on X with initialization obtained from K clusters of HC and stopping criterion parameters ϵ and S. Then extract the cluster labels Ĝ1 , . . . , ĜK as a partition estimate forAlgorithm 4: Sketch and lift: Likelihood adjusted SDP based iterative algorithm with unknown covariance matrices Σ 1 , . . . , Σ K for large n.Input: Data matrix X ∈ R p×n containing n points. Cluster numbers K. The stopping criterion parameters P , ϵ and S. Sampling weights (w 1 , . . . , w n ) withbeing the subsampling factor. (Sketch) Independent sample an index subset T ⊂ [n] via Ber(w i ) and store the subsampled data matrix V = (X i ) i∈T .Run subroutine Algorithm 1 with input V to get a partition estimate R1 , . . . , RK for T .Compute the centroids Xk = | Rk | -1 j∈ Rk X j and within-group sample covariance matrices. And randomly assign i to any K clusters if such k doesn't exist. Output: A partition estimate Ĝ1 , . . . , ĜK for [n].entry as 1, and 0 otherwise. The centers of clusters µ 1 , . . . , µ K are placed on vertices of a regular simplex, i.e.,. This ensures that for any L, ∆ = λ, ∀λ. From Figures 2 we can observe that the signal-to-noise ratio we defined is reasonable. On the other hand, the performance of SDP becomes worse as condition number of the group covariance matrices grows since the assumption of isotropy group covariance matrices for SDP is violated and same reason for K-means.

Impact of dimension reduction.

Here we want to see the performance of iLA-SDP after dimension reduction. The covariance matrices of GMM are drawn independently followingHere U k is a random orthogonal matrix, Λ k is a diagonal matrix with entries drawn from Z = 1 + βZ • 1(Z > 0), where Z is standard Gaussian distribution, β > 0 controls the condition number of Σ k . Here we choose n = 200, p = 20, K = 4, β = 5. The covariance matrices are fixed once chosen and we perform Algorithm 1 on the dataset directly to get the results of iLA-SDP for each replicates. For dimension reduction, we follow the procedure of dimension reduction introduced in Algorithm 2 and Algorithm 3 in Appendix A.2 and get the transformed dataset X with lower dimension. Then the results of pLA-SDP is obtained from running Algorithm 1 with HC as initialization on X. The results in Figure 6 shows that after reduction of dimension in our procedure, the performance of iLA-SDP becomes significantly better when the separation is large. This is because in our setting, the difference between centersfor any triple pairs (k, l, l ′ ) that are mutually distinct and i ∈ G k , j ∈ G l . Then (C3) and (C4) hold.In fact, the target matrices can be defined throughThe following two lemma gives the sufficient conditions for (C1).Lemma 6 (Separation bound on the covariance matrices). Let λ 1 , . . . , λ p correspond to the eigenvalues of (ΣLemma 7 (Separation bound on the centers). Let δ > 0, β ∈ (0, 1), η ∈ (0, 1). If we havewherefor some large constant C.The proof of Lemma 7 follows the similar steps from the original paper (Chen & Yang, 2021b) .The two lemmas imply that (C1) can hold with high probability if the separation condition in the assumption holds. The remaining part is to verify the (C2).Denote Γ = span{1 G k : k ∈ [K]} ⊥ be the othogonal complement of the linear space spanned bywhereBy concentration bound we can getFor T k (v), first we defineThen we can write T k (v) aswhere T(1)From concentration bounds for Gaussians we have for all triple (k, l, lwith probability ≥ 1 -CK 3 /n δ for some constant C.

Note that by assumption we have

(1 + δ)p log n/m and the fact that the remaining terms of T k,ll ′ can be bounded by the above inequalities up to multiplied by some constant, we can directly verify that (C2) is true under our assumptions. ■Lemma 6 (Separation bound on the covariance matrices). Let λ 1 , . . . , λ p correspond to the eigenvalues of (Σ thenSketch of the proof.-Id p then by definition we havewhere the last three terms can be bounded by concentration bounds for Gaussians. ■

