A COMMUNICATION EFFICIENT FEDERATED KERNEL k-MEANS Anonymous authors Paper under double-blind review

Abstract

A federated kernel k-means algorithm is developed in this paper. This algorithm resolves two challenging issues: 1) how to distributedly solve the optimization problem of kernel k-means under federated settings; 2) how to maintain communication efficiency in the algorithm. To tackle the first challenge, a distributed stochastic proximal gradient descent (DSPGD) algorithm is developed to determine an approximate solution to the optimization problem of kernel k-means. To tackle the second challenge, a communication efficient mechanism (CEM) is designed to reduce the communication cost. Besides, the federated kernel k-means provides two levels of privacy preservation: 1) users' local data are not exposed to the cloud server; 2) the cloud server cannot recover users' local data from the local computational results via matrix operations. Theoretical analysis shows: 1) DSPGD with CEM converges with an O(1/T ) rate, where T is the number of iterations; 2) the communication cost of DSPGD with CEM is unrelated to the number of data samples; 3) the clustering quality of the federated kernel k-means approaches that of the standard kernel k-means, with a (1 + ǫ) approximate ratio. The experimental results show that the federated kernel k-means achieves the highest clustering quality with the communication cost reduced by more than 60% in most cases.

1. INTRODUCTION

Conventionally, kernel k-means (Dhillon et al., 2004) is conducted in a centralized manner where training data are stored in one place, such as a cloud server. However, as a rapidly growing number of devices are connected to the Internet, the volume of generated data increase exponentially (Chiang & Zhang, 2016) . Uploading all these data to the cloud server can lead to large cost of communication bandwidth. For example, a smartphone manufacturer usually needs to analyze usage patterns of its smartphones, purposing to optimize energy consumption performance of these smartphones. The usage patterns can be obtained by clustering users' energy consumption data via kernel k-means. However, if the number of users reaches the order of millions, it may not be a cost-effective scheme to upload all the users' energy consumption data to the cloud server. Besides, uploading users' raw data to the cloud server can lead to data privacy issues. To resolve these issues, a promising approach is to develop a distributed kernel k-means algorithm that can be executed under federated settings (McMahan et al., 2017; Yang et al., 2019) where raw data are maintained by users and the cloud has no access to the raw data. In this algorithm, a local training process is conducted at each user's device, based on the local data only. The local computational results, rather than the local data, are then uploaded to the cloud server to accomplish the kernel k-means clustering. During this procedure, users' local data are no longer exposed to the cloud server, which provides a basic level of privacy. Besides, it is usually more communication efficient to upload the local computational results than to upload the local data to the cloud server. However, it is nontrivial to design a federated learning algorithm for kernel k-means due to three challenging issues: 1) how to solve the optimization problem of kernel k-means in a distributed manner without sending users' data to a central place; 2) how to maintain communication efficiency in the algorithm; 3) how to protect users' data privacy in the algorithm. Considering the first issue under federated settings, the key problem is to obtain the top eigenpairs of the kernel matrix K (as required by kernel k-means) in a distributed manner. To solve this problem, a distributed stochastic proximal gradient descent (DSPGD) algorithm is developed as follows. Since K is not available

