A COMMUNICATION EFFICIENT FEDERATED KERNEL k-MEANS Anonymous authors Paper under double-blind review

Abstract

A federated kernel k-means algorithm is developed in this paper. This algorithm resolves two challenging issues: 1) how to distributedly solve the optimization problem of kernel k-means under federated settings; 2) how to maintain communication efficiency in the algorithm. To tackle the first challenge, a distributed stochastic proximal gradient descent (DSPGD) algorithm is developed to determine an approximate solution to the optimization problem of kernel k-means. To tackle the second challenge, a communication efficient mechanism (CEM) is designed to reduce the communication cost. Besides, the federated kernel k-means provides two levels of privacy preservation: 1) users' local data are not exposed to the cloud server; 2) the cloud server cannot recover users' local data from the local computational results via matrix operations. Theoretical analysis shows: 1) DSPGD with CEM converges with an O(1/T ) rate, where T is the number of iterations; 2) the communication cost of DSPGD with CEM is unrelated to the number of data samples; 3) the clustering quality of the federated kernel k-means approaches that of the standard kernel k-means, with a (1 + ǫ) approximate ratio. The experimental results show that the federated kernel k-means achieves the highest clustering quality with the communication cost reduced by more than 60% in most cases.

1. INTRODUCTION

Conventionally, kernel k-means (Dhillon et al., 2004) is conducted in a centralized manner where training data are stored in one place, such as a cloud server. However, as a rapidly growing number of devices are connected to the Internet, the volume of generated data increase exponentially (Chiang & Zhang, 2016) . Uploading all these data to the cloud server can lead to large cost of communication bandwidth. For example, a smartphone manufacturer usually needs to analyze usage patterns of its smartphones, purposing to optimize energy consumption performance of these smartphones. The usage patterns can be obtained by clustering users' energy consumption data via kernel k-means. However, if the number of users reaches the order of millions, it may not be a cost-effective scheme to upload all the users' energy consumption data to the cloud server. Besides, uploading users' raw data to the cloud server can lead to data privacy issues. To resolve these issues, a promising approach is to develop a distributed kernel k-means algorithm that can be executed under federated settings (McMahan et al., 2017; Yang et al., 2019) where raw data are maintained by users and the cloud has no access to the raw data. In this algorithm, a local training process is conducted at each user's device, based on the local data only. The local computational results, rather than the local data, are then uploaded to the cloud server to accomplish the kernel k-means clustering. During this procedure, users' local data are no longer exposed to the cloud server, which provides a basic level of privacy. Besides, it is usually more communication efficient to upload the local computational results than to upload the local data to the cloud server. However, it is nontrivial to design a federated learning algorithm for kernel k-means due to three challenging issues: 1) how to solve the optimization problem of kernel k-means in a distributed manner without sending users' data to a central place; 2) how to maintain communication efficiency in the algorithm; 3) how to protect users' data privacy in the algorithm. Considering the first issue under federated settings, the key problem is to obtain the top eigenpairs of the kernel matrix K (as required by kernel k-means) in a distributed manner. To solve this problem, a distributed stochastic proximal gradient descent (DSPGD) algorithm is developed as follows. Since K is not available under federated settings, an estimate of K, denoted as ξ, is first constructed distributively at users' devices based on random features (Rahimi & Recht, 2008) of local data samples. Since the estimate is distributed among different devices, it is processed by the distributed Lanczos algorithm (DLA) (Penna & Stańczak, 2014) to obtain an estimate of K (denoted as Z) at the cloud server. Afterwards, an approximate version of the top eigenpairs of K can be obtained from Z through singular value decomposition (SVD). To improve the accuracy of approximation, the former steps are conducted in an iterative way. More specifically, in the t-th iteration, an estimate ξ t is constructed at users' devices, and then the estimate Z t at the cloud server is updated to Z t+1 via stochastic proximal gradient descent (SPGD) (Zhang et al., 2016) . It is proved that, after sufficient iterations, Z t can converge to a low rank matrix whose top eigenpairs are the same as those of Kfoot_0 As a result, top eigenpairs of K are finally obtained at the cloud server. To resolve the second issue, DLA operations in DSPGD need to be enhanced to reduce communication cost. When DLA is executed in DSPGD, the process of obtaining an updated Z t at the cloud server results in high communication cost between users' devices and the cloud server, because the operation is conducted upon matrices (e.g., ξ t ) with the number of rows/columns equal to the number of data samples. To prevent the communication cost from growing with the number of data samples, a communication efficient mechanism (CEM) is designed so that DLA is operated upon a different type of matrices whose dimensions are reduced and independent of the number of data samples. More specifically, a new matrix W t is designed such that: 1) W t W ⊤ t has the same eigenvectors as those of Z t+1 , but its eigenvalues are smaller by a constant; 2) W t and Z t can be constructed distributively at users' devices based on local values of ξ t . Furthermore, DLA is applied to W ⊤ t W t (instead of W t W ⊤ t ) , so its operations are performed upon matrices with a highly reduced dimension. Via DLA operations between users' devices and the cloud server, W t and Z t are updated iteratively, and then the top eigenpairs of W ⊤ t W t are obtained at users' devices. Once Z t converges, users' devices transform the top eigenpairs of W ⊤ t W t to those of W t W ⊤ t and further obtain the eigenpairs of Z t . Instead of sending these eigenpairs to the cloud server, a distributed linear k-means algorithm (Balcan et al., 2013) is incorporated into CEM so that the cloud server can perform clustering directly on the eigenpairs of the converged Z t . As shown in the process of CEM, the communication efficiency of DSPGD is significantly improved. For the third issue, FK k-means based on DSPGD and CEM provides two levels of privacy preservation: 1) users' local data are not exposed to the cloud server; 2) the cloud server cannot recover users' local data from the local computational results via matrix operations. To provide stronger privacy, a differential privacy mechanism (Dwork et al., 2006) needs to be integrated with FK k-means, which is subject to future study. The theoretical analysis shows that DSPGD with CEM converges to Z * at an O(1/T ) rate, where T is the iteration number. The communication cost of DSPGD with CEM is linear to the dimension of the right singular vector times the number of users, which can be much smaller than the number of data samples. The clustering quality of the federated kernel k-means approaches that of kernel kmeans, with a (1+ǫ) approximate ratio. The experimental results show that, compared with the stateof-the-art schemes, FK k-means achieves the highest clustering quality with the communication cost reduced by more than 60% in most cases.

2.1. DISTRIBUTED KERNEL k-MEANS

Many algorithms have been developed to conduct the kernel k-means clustering in a distributed way. The kernel approximation method is a popular approach employed in these algorithms, such as the Nyström method (Chitta et al., 2011; 2014; Wang et al., 2019) and the random feature method (Chitta et al., 2012) . A trimmed kernel k-means algorithm (Tsapanos et al., 2015) decreases the computational cost and the space complexity by significantly reducing the non-zero entries in K via a kernel matrix trimming algorithm. In (Elgohary et al., 2014) an approximate nearest centroid (APNC) embedding is developed to embed the data samples so that the clustering assignment step



More specifically, the top eigenvectors of the low rank matrix are the same as those of K and its nonzero eigenvalues are smaller than those of K by a constant.

