k-MEDIAN CLUSTERING VIA METRIC EMBEDDING: TOWARDS BETTER INITIALIZATION WITH PRIVACY Anonymous

Abstract

In clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. We propose a new initialization scheme for the k-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower errors than those from another popular initialization method, k-median++, with comparable efficiency. Our HST initialization can also be easily extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error of applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments demonstrate the effectiveness of our proposed methods.

1. INTRODUCTION

Clustering is an important problem in unsupervised learning that has been widely studied in statistics, data mining, network analysis, etc. (Punj and Stewart, 1983; Dhillon and Modha, 2001; Banerjee et al., 2005; Berkhin, 2006; Abbasi and Younis, 2007) . The goal of clustering is to partition a set of data points into clusters such that items in the same cluster are expected to be similar, while items in different clusters should be different. This is concretely measured by the sum of distances (or squared distances) between each point to its nearest cluster center. One conventional notion to evaluate a clustering algorithms is: with high probability, cost(C, D) ≤ γOP T k (D) + ξ, where C is the centers output by the algorithm and cost(C, D) is a cost function defined for C on dataset D. OP T k (D) is the cost of optimal (oracle) clustering solution on D. When everything is clear from context, we will use OP T for short. Here, γ is called multiplicative error and ξ is called additive error. Alternatively, we may also use the notion of expected cost. Two popularly studied clustering problems are 1) the k-median problem, and 2) the k-means problem. The origin of k-median dates back to the 1970's (e.g., Kaufman et al. (1977) ), where one tries to find the best location of facilities that minimizes the cost measured by the distance between clients and facilities. Formally, given a set of points D and a distance measure, the goal is to find k center points minimizing the sum of absolute distances of each sample point to its nearest center. In k-means, the objective is to minimize the sum of squared distances instead. Particularly, k-median is usually the one used for clustering on graph/network data. In general, there are two popular frameworks for clustering. One heuristic is the Lloyd's algorithm (Lloyd, 1982) , which is built upon an iterative distortion minimization approach. In most cases, this method can only be applied to numerical data, typically in the (continuous) Euclidean space. Clustering in general metric spaces (discrete spaces) is also important and useful when dealing with, for example, the graph data, where Lloyd's method is no longer applicable. A more broadly applicable approach, the local search method (Kanungo et al., 2002; Arya et al., 2004) , has also been widely studied. It iteratively finds the optimal swap between the center set and non-center data points to keep lowering the cost. Local search can achieve a constant approximation ratio (γ = O(1)) to the optimal solution for k-median (Arya et al., 2004) . Initialization of cluster centers. It is well-known that the performance of clustering can be highly sensitive to initialization. If clustering starts with good initial centers (i.e., with small approximation error), the algorithm may use fewer iterations to find a better solution. The k-median++ algorithm (Arthur and Vassilvitskii, 2007) iteratively selects k data points as initial centers, favoring distant points in a probabilistic way. Intuitively, the initial centers tend to be well spread over the data points (i.e., over different clusters). The produced initial center is proved to have O(log k) multiplicative error. Follow-up works of k-means++ further improved its efficiency and scalability, e.g., Bahmani et al. (2012); Bachem et al. (2016) ; Lattanzi and Sohler (2019) . In this work, we propose a new initialization framework, called HST initialization, based on metric embedding techniques. Our method is built upon a novel search algorithm on metric embedding trees, with comparable approximation error and running time as k-median++. Moreover, importantly, our initialization scheme can be conveniently combined with the notion of differential privacy (DP). Clustering with Differential Privacy. The concept of differential privacy (Dwork, 2006; McSherry and Talwar, 2007) 2018) gave an optimal algorithm in terms of minimizing Wasserstein distance under some data separability condition. For private k-median clustering, Feldman et al. ( 2009) considered the problem in high dimensional Euclidean space. However, it is rather difficult to extend their analysis to more general metrics in discrete spaces (e.g., on graphs). The strategy of (Balcan et al., 2017) to form a candidate center set could as well be adopted to k-median, which leads to O(log 3/2 n) multiplicative error and O((k 2 + d) log 3 n) additive error in high dimensional Euclidean space. In discrete space, Gupta et al. ( 2010) proposed a private method for the classical local search heuristic, which applies to both k-medians and k-means. To cast privacy on each swapping step, the authors applied the exponential mechanism of (McSherry and Talwar, 2007) . Their method produced an ϵ-differentially private solution with cost 6OP T + O(△k 2 log 2 n/ϵ), where △ is the diameter of the point set. In this work, we will show that our HST initialization can improve DP local search for k-median (Gupta et al., 2010) in terms of both approximation error and efficiency. The main contributions of this work include : • We introduce the Hierarchically Well-Separated Tree (HST) to the k-median clustering problem for initialization. We design an efficient sampling strategy to select the initial center set from the tree, with an approximation factor O(log min{k, △}) et al., 2010) within a small O(k log log n) factor. This is the first clustering initialization method with differential privacy guarantee and improved error rate in general metric space. • We conduct experiments on simulated and real-world datasets to demonstrate the effectiveness of our methods. In both non-private and private settings, our proposed HST-based approach achieves smaller cost at initialization than k-median++, which may also lead to improvements in the final clustering quality. then algorithm A is said to be ϵ-differentially private.



has been popular to rigorously define and resolve the problem of keeping useful information for model learning, while protecting privacy for each individual. Private k-means problem has been widely studied, e.g., Feldman et al. (2009); Nock et al. (2016); Feldman et al. (2017), mostly in the continuous Euclidean space. The paper (Balcan et al., 2017) considered identifying a good candidate set (in a private manner) of centers before applying private local search, which yields O(log 3 n) multiplicative error and O((k 2 +d) log 5 n) additive error. Later on, the Euclidean k-means errors are further improved to γ = O(1) and ξ = O(k 1.01 • d 0.51 + k 1.5 ) by Stemmer and Kaplan (2018), with more advanced candidate set selection. Huang and Liu (

Definition 2.1 (Differential Privacy (DP) (Dwork, 2006)). If for any two adjacent data sets D and D ′ with symmetric difference of size one, for any O ⊂ Range(A), an algorithm A satisfies P r[A(D) ∈ O] ≤ e ϵ P r[A(D ′ ) ∈ O],

in the non-private setting, which is O(log min{k, d}) when △ = O(d) (e.g., bounded data). This improves the O(log k) error of k-means/median++ in e.g., the lower dimensional Euclidean space. • We propose a differentially private version of HST initialization under the setting of Gupta et al. (2010) in discrete metric space. The so-called DP-HST algorithm finds initial centers with O(log n) multiplicative error and O(ϵ -1 △k 2 log 2 n) additive error. Moreover, running DP-local search starting from this initialization gives O(1) multiplicative error and O(ϵ -1 △k 2 (log log n) log n) additive error, which improves previous results towards the well-known lower bound O(ϵ -1 △k log(n/k)) on the additive error of DP k-median (Gupta

