k-MEDIAN CLUSTERING VIA METRIC EMBEDDING: TOWARDS BETTER INITIALIZATION WITH PRIVACY Anonymous

Abstract

In clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. We propose a new initialization scheme for the k-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower errors than those from another popular initialization method, k-median++, with comparable efficiency. Our HST initialization can also be easily extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error of applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments demonstrate the effectiveness of our proposed methods.

1. INTRODUCTION

Clustering is an important problem in unsupervised learning that has been widely studied in statistics, data mining, network analysis, etc. (Punj and Stewart, 1983; Dhillon and Modha, 2001; Banerjee et al., 2005; Berkhin, 2006; Abbasi and Younis, 2007) . The goal of clustering is to partition a set of data points into clusters such that items in the same cluster are expected to be similar, while items in different clusters should be different. This is concretely measured by the sum of distances (or squared distances) between each point to its nearest cluster center. One conventional notion to evaluate a clustering algorithms is: with high probability, cost(C, D) ≤ γOP T k (D) + ξ, where C is the centers output by the algorithm and cost(C, D) is a cost function defined for C on dataset D. OP T k (D) is the cost of optimal (oracle) clustering solution on D. When everything is clear from context, we will use OP T for short. Here, γ is called multiplicative error and ξ is called additive error. Alternatively, we may also use the notion of expected cost. Two popularly studied clustering problems are 1) the k-median problem, and 2) the k-means problem. The origin of k-median dates back to the 1970's (e.g., Kaufman et al. (1977) ), where one tries to find the best location of facilities that minimizes the cost measured by the distance between clients and facilities. Formally, given a set of points D and a distance measure, the goal is to find k center points minimizing the sum of absolute distances of each sample point to its nearest center. In k-means, the objective is to minimize the sum of squared distances instead. Particularly, k-median is usually the one used for clustering on graph/network data. In general, there are two popular frameworks for clustering. One heuristic is the Lloyd's algorithm (Lloyd, 1982) , which is built upon an iterative distortion minimization approach. In most cases, this method can only be applied to numerical data, typically in the (continuous) Euclidean space. Clustering in general metric spaces (discrete spaces) is also important and useful when dealing with, for example, the graph data, where Lloyd's method is no longer applicable. A more broadly applicable approach, the local search method (Kanungo et al., 2002; Arya et al., 2004) , has also been widely studied. It iteratively finds the optimal swap between the center set and non-center data points to keep lowering the cost. Local search can achieve a constant approximation ratio (γ = O(1)) to the optimal solution for k-median (Arya et al., 2004) . Initialization of cluster centers. It is well-known that the performance of clustering can be highly sensitive to initialization. If clustering starts with good initial centers (i.e., with small approximation error), the algorithm may use fewer iterations to find a better solution. The k-median++

