LIKELIHOOD ADJUSTED SEMIDEFINITE PROGRAMS FOR CLUSTERING HETEROGENEOUS DATA

Abstract

Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves nonconvex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed K-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation -a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including K-means, SDP and EM algorithms.

1. INTRODUCTION

Clustering analysis has been widely studied and regularly used in machine learning and its applications in network science (Girvan & Newman, 2002 ), computer vision (Shi & Malik, 2000; Joulin et al., 2010) , manifold learning (Chen & Yang, 2021a) and bioinformatics (Karim et al., 2020) . Perhaps by far the most popular clustering method is the K-means (MacQueen, 1967) partially because there are computationally convenient algorithms such as Lloyd's algorithm and K-means++ for heuristic approximation (Lloyd, 1982; Arthur & Vassilvitskii, 2007) . Mathematically, K-means aims to find the optimal partition of data to minimize the total within-cluster squared Euclidean distances, which is equivalent to the maximum profile likelihood estimator under the standard Gaussian mixture model (GMM) with common isotropic covariance matrices (Chen & Yang, 2021b). Nevertheless, real data usually exhibit various degrees of heterogeneous features such as the cluster shapes may vary from component to component, which renders K-means as a sub-optimal clustering method. Another popular clustering method is the classic expectation-maximization (EM) algorithm, which is a computationally thrifty method based on the idea of data augmentation to iteratively optimize the non-convex observed data likelihood (Dempster et al., 1977) . Theoretical investigations reveal that the EM algorithm suffers from bad local maxima even in the one-dimensional standard GMM with well-separated cluster centers (Jin et al., 2016) . Thus practically even when applied in highly favorable separation-to-noise ratio settings, careful initialization, often through multiple random initializations or a warm-start by another heuristic method such as hierarchical clustering (Fraley & Raftery, 2002) , is the key for the EM algorithm to find the correct cluster labels and model parameters. With a reasonable initial start, the EM algorithm has been shown to achieve good statistical properties (Balakrishnan et al., 2017; Wu & Zhou, 2019) .

