LIKELIHOOD ADJUSTED SEMIDEFINITE PROGRAMS FOR CLUSTERING HETEROGENEOUS DATA

Abstract

Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves nonconvex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed K-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation -a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including K-means, SDP and EM algorithms.

1. INTRODUCTION

Clustering analysis has been widely studied and regularly used in machine learning and its applications in network science (Girvan & Newman, 2002 ), computer vision (Shi & Malik, 2000; Joulin et al., 2010) , manifold learning (Chen & Yang, 2021a) and bioinformatics (Karim et al., 2020) . Perhaps by far the most popular clustering method is the K-means (MacQueen, 1967) partially because there are computationally convenient algorithms such as Lloyd's algorithm and K-means++ for heuristic approximation (Lloyd, 1982; Arthur & Vassilvitskii, 2007) . Mathematically, K-means aims to find the optimal partition of data to minimize the total within-cluster squared Euclidean distances, which is equivalent to the maximum profile likelihood estimator under the standard Gaussian mixture model (GMM) with common isotropic covariance matrices (Chen & Yang, 2021b). Nevertheless, real data usually exhibit various degrees of heterogeneous features such as the cluster shapes may vary from component to component, which renders K-means as a sub-optimal clustering method. Another popular clustering method is the classic expectation-maximization (EM) algorithm, which is a computationally thrifty method based on the idea of data augmentation to iteratively optimize the non-convex observed data likelihood (Dempster et al., 1977) . Theoretical investigations reveal that the EM algorithm suffers from bad local maxima even in the one-dimensional standard GMM with well-separated cluster centers (Jin et al., 2016) . Thus practically even when applied in highly favorable separation-to-noise ratio settings, careful initialization, often through multiple random initializations or a warm-start by another heuristic method such as hierarchical clustering (Fraley & Raftery, 2002) , is the key for the EM algorithm to find the correct cluster labels and model parameters. With a reasonable initial start, the EM algorithm has been shown to achieve good statistical properties (Balakrishnan et al., 2017; Wu & Zhou, 2019) . In this paper, we consider the likelihood-based inference to tackle the problem of recovering cluster labels in the presence of data heterogeneity. Our motivation stems from the recent progress in understanding the computational and statistical limits for convex relaxation methods of the Kmeans clustering. Since K-means is a worst-case NP-hard problem (Aloise et al., 2009) , various heuristic approximation algorithms such as Lloyd's algorithm (Lloyd, 1982; Lu & Zhou, 2016) , and computationally tractable relaxations such as spectral clustering (Meila & Shi, 2001; Ng et al., 2001; Vempala & Wang, 2004; Achlioptas & McSherry, 2005; von Luxburg, 2007; von Luxburg et al., 2008) and semidefinite programs (SDP) (Peng & Wei, 2007; Mixon et al., 2016; Li et al., 2017; Fei & Chen, 2018; Chen & Yang, 2021a; Royer, 2017; Giraud & Verzelen, 2018; Bunea et al., 2016; Zhuang et al., 2022a) , have been proposed in literature. Among the existing solutions, the SDP approach is particularly attractive in that it attains information-theoretically optimal threshold on centroid separations for exact recovery of cluster labels (Chen & Yang, 2021b). Our contributions. We extend the SDP approach to a general setting with heterogeneous features by integrating cluster labels as model parameters (together with other component-specific parameters) and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact observed data likelihood. Our idea is to tailor the strength of SDP relaxation of the K-means clustering method in the isotropic covariance case for likelihood-awareness inference. On one hand, iLA-SDP has a similar flavor as the EM algorithm by maximizing the likelihood function of the observed data. On the other hand, different from the EM framework, iLA-SDP treats the cluster labels as unknown parameters while profiles out the cluster centers (i.e., centroids), which brings several statistical and algorithmic advantages. First in the arguably simplest one-dimensional GMM setting, EM is known to fail in certain configurations of centroids even when they are well-separated (Jin et al., 2016) . In other words, EM is sensitive to initialization and model configuration. The main reason is due to the effort for estimating the cluster centers during the EM iterations. In iLA-SDP, cluster centers are regarded as nuisance parameters and profiled out to obtain a likelihood function in component-specific parameters including only the cluster covariance matrices. Thus iLA-SDP is more stable and performs empirically better than EM. Second, cluster labels in EM are latent variables that are estimated by their posterior probabilities and the observed log-likelihood for component parameters and mixing weights are optimized through minorizing functions during iterations. In iLA-SDP, cluster labels are regarded as parameters optimized through the likelihood function jointly in the labels and covariance matrices. Thus iLA-SDP is a more direct approach than EM for taming the non-convexity in the observed log-likelihood objective and we prove that it perfectly recovers the true clustering structure if the clusters are well-separated under a lower bound without concerning the configurations of centroids. The rest of the paper is organized as follows. In Section 2, we review some background on partitionbased formulation for model-based clustering. In Section 3, we introduce the likelihood adjusted SDP for recovering the true partition structure and discuss its connection to the EM algorithm. In Section 4, we compare the performance of several widely used clustering methods on two real datasets.

2. MODEL-BASED CLUSTERING: A PARTITION FORMULATION

We consider the model-based clustering problem. Suppose the data points X 1 , . . . , X n ∈ R p are independent random variables sampled from K-component Gaussian mixture model (GMM). Specifically, let G * 1 , . . . , G * K be the true partition of the index set [n] := {1, . . . , n} such that if i ∈ G * k , then X i = µ k + ϵ i , where µ k ∈ R p is the center of the k-th cluster and ϵ i is an i.i.d. random noise term following the common distribution N (0, Σ k ). Here we focus on the most general and realistic scenario where the within-cluster covariance matrices Σ 1 , . . . , Σ K are heterogeneous. In our formulation of the GMM, the true partition (G * k ) K k=1 is treated as a unknown parameter in model ( 20), along with the component-wise parameters (µ k , Σ k ) K k=1 . With this parameterization (G k , µ k , Σ k ) K k=1 , the log-likelihood function for observing the data X = {X 1 , . . . , X n } is given by ℓ (G k , µ k , Σ k ) K k=1 | X = - K k=1 |G k | 2 log(2π|Σ k |) - 1 2 K k=1 i∈G k (X i -µ k ) T Σ -1 k (X i -µ k ),

