A COMMUNICATION EFFICIENT FEDERATED KERNEL k-MEANS Anonymous authors Paper under double-blind review

Abstract

A federated kernel k-means algorithm is developed in this paper. This algorithm resolves two challenging issues: 1) how to distributedly solve the optimization problem of kernel k-means under federated settings; 2) how to maintain communication efficiency in the algorithm. To tackle the first challenge, a distributed stochastic proximal gradient descent (DSPGD) algorithm is developed to determine an approximate solution to the optimization problem of kernel k-means. To tackle the second challenge, a communication efficient mechanism (CEM) is designed to reduce the communication cost. Besides, the federated kernel k-means provides two levels of privacy preservation: 1) users' local data are not exposed to the cloud server; 2) the cloud server cannot recover users' local data from the local computational results via matrix operations. Theoretical analysis shows: 1) DSPGD with CEM converges with an O(1/T ) rate, where T is the number of iterations; 2) the communication cost of DSPGD with CEM is unrelated to the number of data samples; 3) the clustering quality of the federated kernel k-means approaches that of the standard kernel k-means, with a (1 + ǫ) approximate ratio. The experimental results show that the federated kernel k-means achieves the highest clustering quality with the communication cost reduced by more than 60% in most cases.

1. INTRODUCTION

Conventionally, kernel k-means (Dhillon et al., 2004) is conducted in a centralized manner where training data are stored in one place, such as a cloud server. However, as a rapidly growing number of devices are connected to the Internet, the volume of generated data increase exponentially (Chiang & Zhang, 2016) . Uploading all these data to the cloud server can lead to large cost of communication bandwidth. For example, a smartphone manufacturer usually needs to analyze usage patterns of its smartphones, purposing to optimize energy consumption performance of these smartphones. The usage patterns can be obtained by clustering users' energy consumption data via kernel k-means. However, if the number of users reaches the order of millions, it may not be a cost-effective scheme to upload all the users' energy consumption data to the cloud server. Besides, uploading users' raw data to the cloud server can lead to data privacy issues. To resolve these issues, a promising approach is to develop a distributed kernel k-means algorithm that can be executed under federated settings (McMahan et al., 2017; Yang et al., 2019) where raw data are maintained by users and the cloud has no access to the raw data. In this algorithm, a local training process is conducted at each user's device, based on the local data only. The local computational results, rather than the local data, are then uploaded to the cloud server to accomplish the kernel k-means clustering. During this procedure, users' local data are no longer exposed to the cloud server, which provides a basic level of privacy. Besides, it is usually more communication efficient to upload the local computational results than to upload the local data to the cloud server. However, it is nontrivial to design a federated learning algorithm for kernel k-means due to three challenging issues: 1) how to solve the optimization problem of kernel k-means in a distributed manner without sending users' data to a central place; 2) how to maintain communication efficiency in the algorithm; 3) how to protect users' data privacy in the algorithm. Considering the first issue under federated settings, the key problem is to obtain the top eigenpairs of the kernel matrix K (as required by kernel k-means) in a distributed manner. To solve this problem, a distributed stochastic proximal gradient descent (DSPGD) algorithm is developed as follows. Since K is not available under federated settings, an estimate of K, denoted as ξ, is first constructed distributively at users' devices based on random features (Rahimi & Recht, 2008) of local data samples. Since the estimate is distributed among different devices, it is processed by the distributed Lanczos algorithm (DLA) (Penna & Stańczak, 2014) to obtain an estimate of K (denoted as Z) at the cloud server. Afterwards, an approximate version of the top eigenpairs of K can be obtained from Z through singular value decomposition (SVD). To improve the accuracy of approximation, the former steps are conducted in an iterative way. More specifically, in the t-th iteration, an estimate ξ t is constructed at users' devices, and then the estimate Z t at the cloud server is updated to Z t+1 via stochastic proximal gradient descent (SPGD) (Zhang et al., 2016) . It is proved that, after sufficient iterations, Z t can converge to a low rank matrix whose top eigenpairs are the same as those of Kfoot_0 As a result, top eigenpairs of K are finally obtained at the cloud server. To resolve the second issue, DLA operations in DSPGD need to be enhanced to reduce communication cost. When DLA is executed in DSPGD, the process of obtaining an updated Z t at the cloud server results in high communication cost between users' devices and the cloud server, because the operation is conducted upon matrices (e.g., ξ t ) with the number of rows/columns equal to the number of data samples. To prevent the communication cost from growing with the number of data samples, a communication efficient mechanism (CEM) is designed so that DLA is operated upon a different type of matrices whose dimensions are reduced and independent of the number of data samples. More specifically, a new matrix W t is designed such that: 1) W t W ⊤ t has the same eigenvectors as those of Z t+1 , but its eigenvalues are smaller by a constant; 2) W t and Z t can be constructed distributively at users' devices based on local values of ξ t . Furthermore, DLA is applied to W ⊤ t W t (instead of W t W ⊤ t ), so its operations are performed upon matrices with a highly reduced dimension. Via DLA operations between users' devices and the cloud server, W t and Z t are updated iteratively, and then the top eigenpairs of W ⊤ t W t are obtained at users' devices. Once Z t converges, users' devices transform the top eigenpairs of W ⊤ t W t to those of W t W ⊤ t and further obtain the eigenpairs of Z t . Instead of sending these eigenpairs to the cloud server, a distributed linear k-means algorithm (Balcan et al., 2013) is incorporated into CEM so that the cloud server can perform clustering directly on the eigenpairs of the converged Z t . As shown in the process of CEM, the communication efficiency of DSPGD is significantly improved. For the third issue, FK k-means based on DSPGD and CEM provides two levels of privacy preservation: 1) users' local data are not exposed to the cloud server; 2) the cloud server cannot recover users' local data from the local computational results via matrix operations. To provide stronger privacy, a differential privacy mechanism (Dwork et al., 2006) needs to be integrated with FK k-means, which is subject to future study. The theoretical analysis shows that DSPGD with CEM converges to Z * at an O(1/T ) rate, where T is the iteration number. The communication cost of DSPGD with CEM is linear to the dimension of the right singular vector times the number of users, which can be much smaller than the number of data samples. The clustering quality of the federated kernel k-means approaches that of kernel kmeans, with a (1+ǫ) approximate ratio. The experimental results show that, compared with the stateof-the-art schemes, FK k-means achieves the highest clustering quality with the communication cost reduced by more than 60% in most cases.

2.1. DISTRIBUTED KERNEL k-MEANS

Many algorithms have been developed to conduct the kernel k-means clustering in a distributed way. The kernel approximation method is a popular approach employed in these algorithms, such as the Nyström method (Chitta et al., 2011; 2014; Wang et al., 2019) and the random feature method (Chitta et al., 2012) . A trimmed kernel k-means algorithm (Tsapanos et al., 2015) decreases the computational cost and the space complexity by significantly reducing the non-zero entries in K via a kernel matrix trimming algorithm. In (Elgohary et al., 2014) an approximate nearest centroid (APNC) embedding is developed to embed the data samples so that the clustering assignment step of kernel k-means can be parallel executed. A communication efficient kernel principle component analysis (PCA) algorithm (Balcan et al., 2016) along with distributed liner k-means can approximately solve the optimization problem of kernel k-means while maintaining the communication efficiency. However, these algorithms are designed with an assumption that they are executed at the cloud server where users' raw data are collected. Besides, many of these algorithms (Chitta et al., 2011; 2012; 2014; Wang et al., 2019) are one-shot algorithms, i.e., they only determine an approximate kernel matrix K once. Thus, their clustering quality is limited by the accuracy of K. In contrast to these algorithms, FK k-means is the first distributed kernel k-means scheme designed under federated settings. In addition, FK k-means is an iterative algorithm that can approach the top eigenpairs of K more accurately by employing more iterations.

2.2. FEDERATED LEARNING

Federated learning (McMahan et al., 2017) is a new machine learning framework aiming to protect users' data privacy and save the communication cost during the learning process. In the framework, a local model is updated at each user's device, and these local models instead of users' local data are then aggregated at the cloud server to generate a global model. The distributed optimization method in the framework is applicable to the models whose optimization problem can be decomposed into several independent subproblems, such as neural networks. and many algorithms (Konečnỳ et al., 2016; Yang et al., 2018; Yurochkin et al., 2019) are developed. However, it is non-trivial to decompose the optimization problem of kernel k-means under the federated learning framework. Some algorithms (Liu et al., 2017; Caldas et al., 2018) improve federated multi-task learning (Smith et al., 2017) with kernel. However, these algorithms either employ explicit feature mapping (Liu et al., 2017) that can lead to impractical computational cost, or require to send the support vectors of users' local data (i.e., some local data samples) to the cloud server (Caldas et al., 2018) , which can leak users' privacy information. Due to these limitations, these algorithms are not applicable to kernel k-means under the federated learning framework. Recently, a concept of clustered federated learning (Ghosh et al., 2020; Sattler et al., 2020; Mansour et al., 2020) is proposed, where the clients are clustered according to their gradient updates or their local models. However, their clustering problems are different from the optimization problem of kernel k-means, and thus are not feasible for the optimization problem of kernel k-means in the federated setting.

2.3. STOCHASTIC KERNEL PCA

In (Zhang et al., 2016) , the stochastic kernel PCA is accomplished a stochastic proximal gradient descent (SPGD) algorithm. As a result, SPGD is a centralized counterpart of the distributed proximal gradient descent (DSPGD) algorithm in FK k-means. However, DSPGD is distinct from SPGD in three features. First, DSPGD is conducted under federated settings while SPGD is conducted in a centralized manner where users' raw data are collected at the cloud server. Second, the communication cost is considered in the design of DSPGD, which results in CEM, while the communication cost is not considered in SPGD. Third, although both DSPGD and the SPGD aim at approaching the top eigenpairs of K, in the t-th iteration, DSPGD only needs to obtain an approximate solution Z t+1 to the problem of updating Z t instead of the exact solution Z * t+1 to the same problem like SPGD, which leads to less communication cost under federated settings.

3. PRELIMINARY

Let {x i } N i=1 ⊆ X be a set of N data samples. Given a feature mapping φ(•) : X → H and the number of clusters k, the problem of kernel k-means whose objective is to find an optimal indicator matrix Y * can be written as min Y∈{0,1} N ×k Tr(K) -Tr(L 1 2 Y ⊤ KYL 1 2 ) s.t. Y1 k = 1 n , ( ) where K is the kernel matrix with each entry K ij = φ(x i ) ⊤ φ(x j ), L 1 2 = Diag([ 1 √ N1 , . . . , 1 √ N k ] ) is a diagonal matrix, N i is the number of samples in the i-th cluster, and 1 k is a column vector with all the k items equal to 1. However, the problem in equation ( 1) is an NP-hard problem (Garey et al., 1982; Wang et al., 2019) . To this end, an approximate solution Ỹ is required. An efficient approach to obtaining the approximate solution is as follows. K is decomposed as K = UΛU ⊤ via eigenvalue decomposition (EVD), and then linear k-means is applied to the matrix H = UΛ 1 2 to obtain Ỹ (Ding et al., 2005; Chitta et al., 2012; Wang et al., 2019) . To reduce the computational complexity, only the first s column vectors of H are selected as the input of linear k-means (Cohen et al., 2015) .

4. FEDERATED KERNEL k-MEANS

In FK k-means, the approximate solution to the problem in equation ( 1) is also determined based on the top-s eigenpairs of the kernel matrix K. To obtain these eigenpairs under federated settings, a distributed stochastic proximal gradient descent (DSPGD) algorithm is developed in Section 4.1. A communication efficient mechanism is then designed to reduce the communication cost of DSPGD in Section 4.2.

4.1. DISTRIBUTED STOCHASTIC PROXIMAL GRADIENT DESCENT

The key problem for designing FK k-means is to obtain the the top-s eigenpairs of K in a distributed manner. To solve this problem, a distributed stochastic proximal gradient descent algorithm is developed as follows. Under the federated settings, the main challenge on determining the top-s eigenpairs of K is that K is not available since users' local data cannot be exposed to the cloud server or other users. To this end, an estimate of K, denoted as ξ, is constructed distributively at users' devices based on the random features (Rahimi & Recht, 2008; Kar & Karnick, 2012) of their local data samples. More specifically, ξ = 1 D AA ⊤ and E[ξ] = K, where D is the number of random features of each data samples, and A = [A[1] ⊤ , . . . , A[M ] ⊤ ] ⊤ is the random feature matrix distributed over M users' devices (the details of the random feature method are included in Appendix A). Since ξ is distributed over users' devices, it is then processed by the distributed Lanczos algorithm (DLA) (Penna & Stańczak, 2014) to obtain an estimate of K, i.e., Z at the cloud server. Afterwards, an approximate version of the top eigenpairs of K can be obtained from Z through SVD. To improve the accuracy of approximation, one method is to increase the value of D. However, this method is only feasible when each user's device has enough memory space. To adapt DSPGD to devices with different memory space, DSPGD improves the accuracy of approximation via an iterative method. More specifically, in the t-th iteration, an estimate ξ t is constructed at users' devices, and then the estimate Z t at the cloud server is updated to Z t+1 via stochastic proximal gradient descent (Zhang et al., 2016) : Z t+1 = arg min Z∈R n×n 1 2 ||Z -Z t || 2 F + η t Z -Z t , Z t -ξ t + η t λ||Z|| * , where η t is a learning rate. Z t+1 has an explicit expression Z t+1 = i: λi,t>ηtλ ( λi,t -η t λ)ũ i,t ũ⊤ i,t , where (ũ i,t , λi,t ) is the i-th eigenpair of a matrix R t = (1 -η t )Z t + η t ξ t . Since ξ is distributed over users' devices, Z t+1 is determined via DLA at the cloud server. It is proved that, after sufficient iterations, Z t can converge to a low rank matrix K = i:λi>λ (λ i -λ)u i u ⊤ i where (u i , λ i ) is the i-th eigenpair of K. As a result, the top eigenpairs of K are finally obtained at the cloud server. The t-th iteration of DSPGD is executed as follows. The main task is to approach the top eigenpairs of R t via DLA. The cloud server first initializes a random vector c 1 ∈ R N . In the q-th iteration of DLA, the cloud server determines a vector g q = R t c q = (1 -η t )Z t c q + η t A t A ⊤ t c q /D, where Z t c q is computed at the cloud server, and A t A ⊤ t c q is computed in a distributed manner. The computation of A t A ⊤ t c q is accomplished by five steps: 1) the cloud server partitions the vector c q = [c q [1] ⊤ , . . . , c q [M ] ⊤ ] ⊤ into M parts and sends the m-th part c q [m] to the user m; 2) the user m computes a local vector A t [m] ⊤ c q [m ] and uploads the vector to the cloud server; 3) the cloud server sums up these vectors to obtain the vector A ⊤ t c q and then broadcasts this vector to all the users; 4) the m-th user computes a new local vector A t [m]A ⊤ t c q and then uploads this vectors to the cloud server; 5) the cloud server finally concatenates these vectors from users to form A t A ⊤ t c q ; When g q is determined, the cloud server then applies the Lanczos algorithm to the collected vectors {g 1 , ..., g q } to approximate the top eigenpairs of R t (the details about the Lanczos algorithm and the complete procedure of DLA are provided in Appendix B). After sufficient iterations of DLA, the top eigenpairs of R t are obtained at the cloud server, and then Z t+1 is determined accordingly.

4.2. COMMUNICATION EFFICIENT MECHANISM

When DLA is executed in DSPGD, the process of obtaining an updated Z t at the cloud server results in high communication cost between the cloud server and users' devices, because its operation is upon matrices (e.g., ξ t ) with the number of rows/columns equal to the number of data samples. To prevent the communication cost from growing with the number of data samples, a communication efficient mechanism (CEM) is designed so that DLA is operated upon a different type of matrices whose dimensions are independent from the number of data samples.  A new matrix W t that satisfies W t W ⊤ t equals W t = [ ηt D A t , √ 1 -η t B t ]. Assume that B t is divided like A t , i.e., B t = [B t [1] ⊤ , . . . , B t [M ] ⊤ ] ⊤ , and B t [m] is maintained at the m-th user's device. As a result, a submatrix of W t can be constructed at the m-th device by W t [m] = [ ηt D A t [m], √ 1 -η t B t [m]]. Furthermore, DLA is applied to W ⊤ t W t instead of W t W ⊤ t . The number of rows/columns of W ⊤ t W t equals r t +D where r t is the rank of Z t and D is the number of random features. Compared with the number of data samples N , r t +D is usually much smaller than N , so the operation of DLA is upon matrices with a highly reduced dimension. In the q-th iteration of DLA, the computation of g q = W ⊤ t W t c q is accomplished by three steps: 1) the cloud server first broadcasts a vector c q ; 2) each user m computes a local vector W t [m] ⊤ W t [m]c q and uploads the vector to the cloud server; 3) the cloud server sums up these vectors to obtain g q . The cloud server then transforms g q into c q+1 following the Lanzcos iteration. After sufficient iterations of DLA, the top-s t eigenpairs {( λi,t , ṽi,t ), i = 1, ..., s t } of W ⊤ t W t converge at the cloud server. The cloud server then broadcasts the obtained top-s t eigenpairs to all the users' devices. Since the user m keeps W t [m], it then determines  s t vectors ũi,t [m] = 1 √ λi,t W t [m]ṽ i,t , i = 1, ..., [m] of B t+1 via B t+1 [m] = [ λ1,t -η t λũ 1,t [m], ..., λi-1,t -η t λũ i-1,t [m]], which enables the (t + 1)-th iteration of DSPGD with CEM. As DSPGD with CEM converges after T iterations, a matrix H[m] is constructed at the m- (Balcan et al., 2013) is then applied to the rows of  th user's device by H[m] = [ λ1,T + (1 -η T )λũ 1,T [m], ..., λs,T + (1 -η T )λũ s,T [m]]. A distributed linear k-means algorithm H = [H[1] ⊤ , ..., H[M ] ⊤ ] ⊤ to

5. THEORETICAL ANALYSIS

The convergence of DSPGD with CEM is analyzed in Section 5.1. The communication cost of CEM is analyzed in Section 5.2, which shows CEM is important for FK k-means to maintain the communication efficiency. It is then proved that the clustering quality of FK k-means can approach that of the standard kernel k-means in Section 5.3. Besides, the privacy preservation provided by FK k-means is analyzed in Section 5.4.

5.1. CONVERGENCE ANALYSIS FOR DSPDG

The convergence rate of DSPGD with CEM is derived in Theorem 1.  Theorem 1. Define γ = max t∈[T ] ||Z t || * and C 2 = max t∈[T ] ||Z t -ξ t || 2 F . Assume ||ξ t -K|| F ≤ G, and ||Z t -Z * || F ≤ H, ∀t > 2. By setting η t = 2/t, Construct W t [m] = [ ηt D A t [m], √ 1 -η t B t [m]] 9: end for 10: Call DLA to determine the eigenpairs {( λi,t , ṽ1,t ), i = 1, ..., s t } of W ⊤ t W t 11: for each client m, m = 1, 2, ..., M in parallel do 12: Compute ũi,t [m] = 1 √ λi,t W t [m]ṽ i,t for i = 1, ..., s t 13: Compute B t+1 [m] = [ λ1,t -η t λũ 1,t [m], ..., λst,t -η t λũ st,t [m]] 14: end for 15: end for 16: for each client m, m = 1, 2, ..., M in parallel do 17: Construct H[m] = [ λ1,T + (1 -η T )λũ 1,T [m], ..., λs,T + (1 -η T )λũ s,T [m]] 18: end for 19: Apply a distributed linear k-means algorithm over the rows of H = [H[1] ⊤ , ..., H[M ] ⊤ ] ⊤ 20: Return clustering assignment for each data sample ||Z T +1 -Z * || 2 F holds with a probability at least 1 -δ ||Z T +1 -Z * || 2 F ≤ 4 T C 2 + λγ + 2G 2 τ + 2 3 GHτ + GH = O(1/T ), ( ) where τ = log ⌈2 log 2 T ⌉ δ . The result in Theorem 1 indicates that DSPGD with CEM converges to Z * at an O(1/T ) rate. The proof of Theorem 1 is provided in Appendix C.

5.2. COMMUNICATION COST ANALYSIS FOR CEM

Define the communication cost as the number of floating-point numbers uploaded from users' devices to the cloud server. In Theorem 2, the communication cost of DSPGD with CEM and the communication cost of DSPGD without CEM are both analyzed. Theorem 2. For DSPGD with CEM, in the t-th iteration, its communication cost is linear to r t + D where r t is the rank of Z t and D is the number of random features. Define the communication ratio as the ratio of the communication cost of DSPGD without CEM to that of DSPGD with CEM. In the t-th iteration, the communication ratio equals (N +M D)Q0 M (rt+D)Q1 where N is the number of data samples; M is the number of users; Q 0 and Q 1 are the number of Lanczos iterations for DSPGD without CEM and DSPGD with CEM, respectively. By Theorem 2, the communication cost of DSPGD with CEM is unrelated to the number of data samples N , and the communication cost reduced by CEM can be revealed by the ratio. The values of Q 0 and Q 1 are affected by the selection of initial vector c 1 . However, empirically the values of Q 0 and Q 1 are at the same order no matter which initial vectors are chosen. The dominant factor of the ratio is still N +M D M (rt+D) . Since Z t is used to approach the top-s eigenpairs of K, empirically its rank r t has an upper bound. In our experiments, the value of r t is at the same order of s, i.e. the number of eigenvectors of K to be determined by DSPGD. Usually, the number of data samples at a user's device is much larger than s so that it is easy to satisfied that N > M r t , and CEM can definitely reduce the communication cost for DSPGD in these cases. The proof of Theorem 2 and the empirical results for r t are given in Appendix D.

5.3. APPROXIMATE RATIO ANALYSIS FOR FEDERATED KERNEL k-MEANS

Before the analysis, a γ-approximate algorithm is first defined as follows. Definition 1. A linear k-means algorithm is applied to a matrix H with n row, where an indicator matrix Ỹ is obtained. This algorithm is called a γ-approximate algorithm if, for any matrix H, f ( Ỹ; H) ≤ γ min Y f (Y; H) where f is the objective function of linear k-means. It has been proved that the standard kernel k-means algorithm (Dhillon et al., 2004 ) is a γapproximate algorithm (Wang et al., 2019) . The approximate ratio is then derived in Theorem 3 for FK k-means. Theorem 3. The objective function of kernel k-means in ( 1) is denoted as f K . H T is the output of DSPGD with CEM after T iterations, and a γ-approximate algorithm is applied to the first s columns of H T to obtain Y T . Assume the assumptions in Theorem 1 hold. For Y T , the following inequality holds with a probability at least 1 -δ(T ), i.e., f K ( Y T ) ≤ γ(1 + ε + k s ) min Y f K (Y), where ε = O( s T ). Note that as T increases, By Theorem 4, the cloud server cannot recover the random feature matrices from the local computational results. Without such random feature matrices, it is infeasible for the cloud server to recover users' local data via matrix operations. More explanation and the proof of Theorem 4 are provided in Appendix F. Moreover, FK k-means can incorporate the differential privacy mechanism (Dwork et al., 2006; Su et al., 2016) or random perturbation (Lin, 2016) to provide higher level of privacy preservation, which is subject to future work. (a) Mushrooms (b) MNIST-small δ(T ) decreases. If s = O(k/ε), T = O(k/ε 3 )

6. EXPERIMENTS

Figure 1 : The convergence curves of the two versions of DSPGD and standard deviation curves of the normalized recover error on the Mushrooms dataset and the MNIST-small dataset. N 2 2 N i=1 N j=1 xi-xj 2 2 . The threshold parameter λ of DSPGD is set to the (k + 2)-th eigenvalue obtained in the first iteration of DSPGD. The configuration of the number of random features D and the parameter s in the top-s eigenvectors for each dataset is provided in Table 1 (the hyperparameter configuration of the existing methods and a discussion on the configuration of D are provided in Appendix G). In the experiments, M = 5 worker processes and one coordinator process are generated to simulate users' devices and the cloud server, respectively. The worker processes communicate with the coordinator process via the message passing interface (MPI) in a synchronized manner. All the experiments are executed in a server with one i7-6850k CPU and 32 GB RAM.

6.2. EXPERIMENTAL RESULTS

The experimental results are presented from three aspects. First, the convergence results of DSPGD is shown in Figure 1 to verify its convergence rate. Second, the average communication cost per iteration of the two versions of DSPGD is provided in Figure 2 to show that CEM highly reduces the communication cost of DSPGD. Third, in Figure 3 , FK k-means is compared with the cloudbased kernel k-means schemes in terms of clustering quality to show FK k-means can achieve the comparable clustering results as that of the cloud-based schemes; FK k-means is also compared with the existing distributed kernel k-means schemes under the federated settings to show the higher communication efficiency of FK k-means. The convergence of DSPGD is validated over two datasets, Mushrooms and MNIST-small, whose low rank matrices K = s i=1 λ i u i u ⊤ i can be computed by performing SVD on their kernel matrices K. A normalized recover error ||Kt-K|| 2 F N 2 (Zhang et al., 2016) is recorded for each iteration t of DSPGD, where K t = s i=1 ( λi,t + (1 -η t )λ)ũ i,t ũ⊤ i,t is the estimation of K at iteration t. In the left subfigure of Figure 1 (a) and that of Figure 1 (b), the convergence curves of DSPGD are lower than the curve of 0.4/t and the curve of 0.03/t, respectively, which verifies DSPGD converges at an O(1/t) rate. In Figure 1 (a) and 1(b), the curves of two versions of DSPGD nearly overlap, indicating that CEM has little impact on the convergence of DSPGD. The average communication cost per iteration of DSPGD with CEM and that of DSPGD without CEM are compared in Figure 2 to evaluate the effectiveness of CEM. The log-scale is used for the y-axis of each subfigure, and the unit of the y-axis is the number of the floating-point numbers. As shown in the four subfigures, CEM can reduce communication cost of DSPGD by more than 98%, which indicates that CEM is important for FK k-means to maintain communication efficiency. In order to evaluate the clustering quality and the communication cost of FK k-means, curves of average normalized mutual information (NMI) (Strehl & Ghosh, 2002) versus average communi- 

7. CONCLUSION

In this paper, FK k-means was developed. In the algorithm, a distributed stochastic proximal gradient descent approach was first designed to determine the eigenpairs of the kernel matrix in a distributed manner. A communication efficient mechanism was then designed to reduce the communication cost. In theoretical analysis, DSPGD with CEM was proved to converge at an O(1/T ) rate. The communication cost of DSPGD with CEM is unrelated to the number of data samples. The clustering loss of FK k-means can approach that of the centralized kernel k-means. It was also analyzed that FK k-means provided two levels of privacy preservation. The effectiveness of the FK k-means was validated by experiments on several real-world datasets. FK k-means can still be improved in terms of the asynchronous execution, the robustness to dropout users, and stronger privacy, which can be interesting topics in our future work. Algorithm 2 Lanczos Algorithm 1: Input: An symmetric matrix R, an initial vector c 1 2: Output: An approximation P Q to the eigenvectors of R, and an approximation σ = [σ 1 , ..., σ Q ] to the eigenvalues of R 3: Initialize β 0 = 0 and c 0 = 0 4: for q = 1, 2, . . . , Q do 5: g = Rc q 6: α q = c ⊤ q g 7: g = g -α q c q -β q-1 c q-1 8: β q = ||g|| 2 9: if β q = 0 then  c q+1 = g/β q 13: Construct a symmetric tridiagonal matrix T Q 14: Perform EVD on T Q to obtain its eigenvectors P Q and its eigenvalues σ = [σ 1 , ..., σ Q ] 15: Compute C Q P Q 16: end for A DETAILS OF RANDOM FEATURE METHOD For a kernel matrix K, a random feature method (Rahimi & Recht, 2008; Kar & Karnick, 2012) can generate an unbiased estimate of K, denoted as ξ, with the following expression: ξ = 1 D AA ⊤ , where the i-th row of A is the random feature vectors a(x i ) for the data sample x i . The matrix ξ satisfies E[ξ] = K. We then use the example of shift-invariant kernels to show how a random feature vector is constructed. For popular shift-invariant kernels κ(x i , x j ) with Fourier representation κ(x i , x j ) = p(w)exp(jw ⊤ (x ix j ))dw where p(w) is a probability density function, they can be estimated by the random Fourier features (Rahimi & Recht, 2008) as follows. By randomly drawing D independent samples {w 1 , . . . , w D } from p(w), a random feature vector a(x i ) for a data sample x i can be written as a( x i ) = [ √ 2 cos(w ⊤ 1 x i + b 1 ), . . . , √ 2 cos(w ⊤ D x i + b D )] ⊤ where {b 1 , . . . , b D } are independent random variables drawn from [0, 2π) uniformly. As a result, an unbiased estimation of K can be written as ξ = 1 D AA ⊤ where A = [a(x 1 ), • • • , a(x n )] ⊤ .

B DETAILS ABOUT DISTRIBUTED LANCZOS ALGORITHM

To find the eigenpairs of a symmetric matrix R, the Lanczos algorithm (LA) (Lanczos, 1950) first build a Krylov subspace K q (R, c 1 ) = span[c 1 , Rc 1 , ..., R q-1 c 1 ] where c 1 is an initial vector, and then it employs the Rayliegh-Ritz procedure to construct the best approximate eigenpairs for R in the Krylov subspace. In the first step, LA constructs an orthogonal basis of the Krylov subspace following the procedure of line 5 to line 12 in Algorithm 2. Meanwhile, a symmetric tridiagonal matrix T Q = C ⊤ Q RC Q can be explicitly constructed with α Q and β Q via T Q =      α 1 β 1 β 1 . . . . . . . . . . . . β Q-1 β Q-1 α Q      . Based on T Q , the Rayliegh-Ritz procedure can be utilized to approximate the eigenpairs of R. Let T Q = P Q Σ Q P ⊤ Q be the eigendecomposition of T Q . It has been proved that the columns of C Q P Q and the diagonal entries of Σ Q are the optimal approximation to the eigenvectors and eigenvalues of R, respectively (Demmel, 1997) . Thus, in the Rayliegh-Ritz procedure, P Q and Σ Q are determined by performing EVD on T Q , and then C Q P Q are computed as the approximation to the eigenvectors of R. As the number of iteration Q increases, the columns of C Q P Q and the diagonal entries of Σ Q can converge to the eigenvectors and eigenvalues of R, respectively (Demmel, 1997) . As for the distributed Lanczos algorithm (DLA), only the step of line 5 in Algorithm 2 is conducted in a distributed manner, and other steps are conducted at the cloud server. In our problem, if R = (1 -η t )Z t + η t ξ t = (1 -η t )Z t + η t D A t A ⊤ t , where Z t and c q are known at the cloud server, and A t = [A t [1] ⊤ , . . . , A t [M ] ⊤ ] ⊤ are distributed over M users' devices, then (1 -η t )Z t c q is computed at the cloud server and η t A t A ⊤ t c q /D is computed in a distributed manner as follows. The vector c q = [c q [1] ⊤ , . . . , c q [M ] ⊤ ] ⊤ is first partitioned into M parts at the cloud server, and the m-th part c q [m] is sent to the m-th user's device. A local vector A t [m] ⊤ c q [m] is then computed at the m-th user's device. These local vectors from M users' devices are summed up at the cloud to obtain a vector A ⊤ t c q . A ⊤ t c q is then broadcast to M users' devices, and a vector A t [m]A ⊤ t c q is computed at the m-th user's device. These M vectors are sent back to the cloud server where they are concatenated to form A t A ⊤ t c q . If R = W ⊤ t W t = M m=1 W t [m] ⊤ W t [m], where W t [m] ⊤ W t [m] can be computed at the m-th user's device, then each user' devices first determines W t [m] ⊤ W t [m]c q locally, and these M vectors are then uploaded to the cloud server where they are summed up to form W ⊤ t W t c q . In the t-th iteration of DSPGD, DLA is used to compute the eigenvalues larger than η t λ of the R t . Thus, in practice, the convergence criterion of DLA is that all the approximated eigenvalues larger than η t λ converge, rather than that the number of iteration reaches its maximal value Q. One issue of LA and DLA in practice is that they can only be conducted in floating point arithmetic, which can destroy the orthogonality of the columns in C q , and further affect the convergence of DSPGD. To this end, a full reorthogonalization method (Demmel, 1997) is utilized to guarantee that C q is an orthogonal matrix with a high probability. The key idea of this method to generate a new vector c q from a subspace that is orthogonal to all the previous vectors {c 1 , ..., c q-1 }, which can be accomplished by replacing line 7 in Algorithm 2 with g = g - q-1 i=1 g ⊤ c i c i . (3) The operation in (3) can be called multiple times in one iteration of LA to increase the probability that C q is an orthogonal matrix. In the implementation of federated kernel k-means, such operation is called twice in each iteration of DLA. Note that the full reorthogonalization only requires more flops at the cloud server, which does not affect the algorithm complexity at users' devices.

C PROOF OF THEOREM 1

This proof partially follows the proof of Theorem 1 in Zhang et al. (2016) . The difference is that the t-th iteration of Z * obtained by DSPGD, i.e., Z t , may not equal Z * t = D ηtλ [(1 -η t )Z t + η t ξ t ]. The gap between Z t and Z * t is caused by that the distributed Lanczos algorithm (DLA) only approximates the eigenpairs of a target matrix. Thus, in this proof, it is assumed that ||Z t -Z * t || 2 F ≤ ǫ, ∀t, when DLA reaches its convergence criterion, where ǫ ≪ 1. Before the proof, we first define F (Z) = 1 2 E[||Z -ξ|| 2 F ], f t (Z) = 1 2 ||Z -ξ t || 2 F . For a µ-strongly convex function l(Z), if l(Z 1 ) ≥ l(Z 2 ), then l(Z 1 ) -l(Z 2 ) ≥ µ 2 ||Z 1 -Z 2 || 2 F . In the t+1-th iteration of DSPGD, the goal is to determine the optimal solution Z * t+1 to the following optimization problem min Z∈R n×n 1 2 ||Z -Z t || 2 F + η t Z -Z t , ∇f t (Z t ) + ηλ||Z|| * . By DLA, an approximate solution Z t+1 that satisfies (4) can be obtained. The following lemma is a key step in this proof. Lemma 1. Before the convergence of DSPGD, the following inequality holds, i.e., 1 2 ||Z t+1 -Z t || 2 F + η t Z t+1 -Z t , ∇f t (Z t ) + ηλ||Z t+1 || * ≤ 1 2 ||Z * -Z t || 2 F + η t Z * -Z t , ∇f t (Z t ) + ηλ||Z * || * . Proof. The objective function in ( 6) can be rewritten as 1 2 ||Z -Z t || 2 F + η t Z -Z t , ∇f t (Z t ) + ηλ||Z|| * = 1 2 ||Z -Z t || 2 F + η t Z -Z t , ∇f t (Z t ) + η 2 t 2 ||∇f t (Z t )|| 2 F - η 2 t 2 ||∇f t (Z t )|| 2 F + ηλ||Z|| * = 1 2 ||Z -[(1 -η t )Z t + η t ξ]|| 2 F + ηλ||Z|| * - η 2 t 2 ||∇f t (Z t )|| 2 F . Since η 2 t 2 ||∇f t (Z t )|| 2 F is a constant, we can only consider l(Z) = 1 2 ||Z -[(1 -η t )Z t + η t ξ]|| 2 F + ηλ||Z|| * in the following part of the proof. Now we first assume that l(Z * ) ≤ l(Z t+1 ), then we have l(Z t+1 ) -l(Z * t+1 ) ≥ l(Z * ) -l(Z * t+1 ) ≥ µ 2 ||Z * t+1 -Z * || 2 F . Moreover, l(Z t+1 ) -l(Z * t+1 ) can be expanded as l(Z t+1 ) -l(Z * t+1 ) = 1 2 ||Z t+1 -R t || 2 F + ηλ||Z t+1 || * - 1 2 ||Z * t+1 -R t || 2 F + ηλ||Z * t+1 || * = 1 2 (||Z t+1 -R t || F -||Z * t+1 -R t || F )(||Z t+1 -R t || F + ||Z * t+1 -R|| F ) + η t λ(||Z t+1 || * -||Z * t+1 || * ) ≤ 1 2 ||Z t+1 -Z * t+1 || F (||Z t+1 -Z * t+1 || F + 2||Z * t+1 -R|| F ) + η t λ||Z t+1 -Z * t+1 || * It is well known that given a matrix M the following inequality holds for its nuclear norm and its Frobenius norm, i.e., ||M|| 2 * ≤ rank(M)||M|| 2 F . By this inequality, we have ||Z t+1 -Z * t+1 || * ≤ √ r||Z t+1 -Z * t+1 || F ≤ √ rǫ, ( ) where r is the rank of (Z t+1 -Z * t+1 ). Substitute (4) and ( 10) into (9), we have l(Z t+1 ) -l(Z * t+1 ) ≤ 1 2 ǫ 2 + ǫ||Z * t+1 -R t || F + η t λ √ rǫ. Since ||Z * t+1 -R t || F is a constant, this upper bound of l(Z t+1 ) -l(Z * t+1 ) become arbitrarily small if ǫ is arbitrarily small. Hence, according to (8), ||Z * t+1 -Z * || 2 F can also be arbitrarily small. However, this contradicts that ||Z * t+1 -Z * || 2 F cannot become arbitrarily small before the convergence of DSPGD. Therefore, before the convergence of DSPGD, l(Z * ) ≥ l(Z t+1 ) is satisfied. The rest part then follows the proof of Theorem 1 in Zhang et al. (2016) . Based on Lemma 1 and the property of strongly convex function in (5), the update rule of SPDG implies 1 2 ||Z t+1 -Z t || 2 F + η t Z t+1 -Z t , ∇f t (Z t ) + ηλ||Z t+1 || * ≤ 1 2 ||Z * -Z t || 2 F + η t Z * -Z t , ∇f t (Z t ) + η t λ||Z * || * - 1 2 ||Z * -Z t+1 || 2 F . Since F (Z) is 1-strongly convex, it can be shown that 1 2 ||Z t -Z * || 2 F ≤F (Z t ) + λ||Z t || * -F (Z * ) -λ||Z * || * ≤ Z t -Z * , ∇F (Z t ) - 1 2 ||Z t -Z * || 2 F + λ||Z t || * -λ||Z * || * = Z t -Z * , ∇f t (Z t ) -λ||Z * || * - 1 2η t ||Z t -Z * || 2 F + λ||Z t || * - 1 2 ||Z t -Z * || 2 F + 1 2η t ||Z t -Z * || 2 F + ∇F (Z t ) -∇f t t ), Z t -Z * (11) ≤ Z t -Z t+1 , ∇f t (Z t ) -λ||Z t+1 || * - 1 2η t ||Z t+1 -Z t || 2 F - 1 2η t ||Z * -Z t+1 || 2 F + λ||Z t || * + 1 2 1 η t -1 ||Z t -Z * || 2 F + ∇F (Z t ) -∇f t (Z t ), Z t -Z * ≤ max W W, ∇f t (Z t ) - 1 2η t ||W|| 2 F - 1 2η t ||Z t+1 -Z * || 2 F + λ||Z t || * -λ||Z t+1 || * + 1 2 1 η t -1 ||Z t -Z * || 2 F + ∇F (Z t ) -∇f t (Z t ), Z t -Z * = η t 2 ||∇f t (Z t )|| 2 F - 1 2η t ||Z t+1 -Z * || 2 F + λ||Z t || * -λ||Z t+1 || * + 1 2 1 η t -1 ||Z t -Z * || 2 F + ∇F (Z t ) -∇f t (Z t ), Z t -Z * , where the third inquality holds based on the inequality in (11).

By substituting δ

t = ξ t -K, Z t -Z * and C 2 = max t∈[T ] ||Z t -ξ t || 2 F into (12), ||Z t+1 -Z * || 2 F ≤ η 2 t C 2 + 2η t δ t + 2λη t (||Z t || * -||Z t+1 || * ) + (1 -2η t ) ||Z t -Z * || 2 F . The inequality in ( 13) is the same as the result of Lemma 1 in Zhang et al. (2016) . Thus, the following lemmasfoot_1 from Zhang et al. ( 2016) can be directly utilized to derive a probability bound for ||Z t+1 -Z * || 2 F . Lemma 2 (Lemma 2 in Zhang et al. (2016)). Define γ = max t∈[T ] ||Z t || * . By setting η t = 2 t , an upper bound of ||Z t+1 -Z * || 2 F can be written as ||Z T +1 -Z * || 2 F ≤ 4(C 2 + λγ) T + 2 T (T -1) 2 T t=2 (t -1)δ t - T t=2 (t -1)||Z t -Z * || 2 F . The upper bound of T t=2 (t -1)δ t in Lemma 2 is then provided in Lemma 3. Lemma 3 (Lemma 3 in Zhang et al. (2016) ). Assume ||ξ t -K|| F ≤ G, and  ||Z t -Z * || F ≤ H, ∀t > 2. With a probability at least 1 -δ, T t=2 (t -1)δ t is upper bounded by T t=2 (t -1)δ t ≤ 1 2 T t=2 (t -1)||Z t -Z * || 2 F + 2G 2 τ (T -1) + 2 3 GH(T -1)τ + GH(T -1), ||Z T +1 -Z * || 2 F ≤ 4 T C 2 + λγ + 2G 2 τ + 2 3 GHτ + GH = O(1/T ). D PROOF OF THEOREM 2 AND EMPIRICAL RESULTS For DSPGD with CEM, in the t-the iteration, the m-th user's device only needs to upload one vector in each iteration of DLA, i.e., W t [m] ⊤ W t [m]c q . Since W t [m] = [ ηt D A t [m], √ 1 -η t B t [m]], the dimension of W t [m] ⊤ W t [m]c q equals D + r t , where r t is the rank of Z t and D is the number of random feature. Moreover, DLA requires several Lanczos iterations to approach the eigenpairs of W ⊤ t W t . Thus, the communication cost of DSPGD with CEM is linear to D + r t . To compute the ratio, we first derive the communication cost for DSPGD without CEM. in the t-the iteration, the m-th user's device needs to upload two vectors in each iteration of DLA: A t [m] ⊤ c q [m] and A t [m]A ⊤ t c q , where the dimension of A ⊤ t c q equals D. For the concatenation of all M vectors {A t [m]A ⊤ t c q , m = 1, ..., M }, its dimension equals the number of data samples N . Thus, the communication cost of DSPGD without CEM is linear to N + M D. Given the number of Lanczos iterations Q 0 for DSPGD without CEM and the number of Lanczos iterations Q 1 for DSPGD with CEM, the ratio can be determined by (N +M D)Q0 M (rt+D)Q1 . According to Figure 4 , it can be seen that the average value of Q 0 is close to that of Q 1 in each iteration t, which indicates Q0 Q1 ≈ 1. As a result, the dominant factor of the ratio is still N +M D M (rt+D) Empirically, the figures of the average value of r t versus the number of iterations t the four real-world datasets are shown in Figure 5 . The results show that the rank r t tends to converge as Figure 5 : The rank of Z t versus the number of iteration t for the four real-world datasets the value of t increases. Besides, the upper bound of r t is a constant factor larger than the number of eigenvectors s in Table 1 , and such upper bound is much smaller than the number of users' local data samples, which can explain the dramatical reduction on the communication cost in Figure 2 .

E PROOF OF THEOREM 3

Define K = UΛU ⊤ , and P = UΛ 1 2 . The low-rank approximation of K with rank s is denoted as K s = UΛ s U ⊤ , and P s = UΛ 1 2 s where the diagonal of Λ s contains the s largest eigenvalues of K while its rest diagonal entries are all zero. The output of DSPGD at iteration t is an estimation of K s , denoted as K t , and K t = P t P ⊤ t . The following two lemmas will be used in the proof of Theorem 3. Lemma 4. Given K t , the following inequality holds with a probability at least 1 -δ for any rank k projection matrix Π ∈ R n×n , Tr(Π(K s -K t )) ≤ O( s t ) Proof. Since Π is a rank-k projection matrix, it is obvious that Tr(Π(K s -K t )) ≤ ||K s -K t || * . For a rank-s matrix A, the following inequality holds for its Nuclear norm and its Frobenius norm ||A|| 2 * ≤ s||A|| 2 F . Hence, ||K s -K t || * ≤ √ s||K s -K t || F . By Theorem 1, Z t converges to Z * at an O(1/t) rate. Note Z * has the same eigenvectors as K s . Thus, K t constructed based on Z t also converges to K s at an O(1/t) rate with a probability at least 1 -δ, i.e., ||K s -K t || 2 F has an upper bound as ||K s -K t || 2 F ≤ O(1/t). Hence, the following inequality holds with a probability at least 1 -δ Tr(Π(K s -K t )) ≤ √ s||K s -K t || F ≤ O( s t ). Lemma 5. Fix an error parameter ε ∈ (0, 1). For any rank k projection matrix Π ∈ R n×n , Tr Π(K -K t )Π ≤ (ε + k s )||P -ΠP|| 2 F . Proof. It holds that Tr (I n -Π)(K -K t ) = Tr(K -K t ) -Tr Π(K -K t )Π = ||K -K s || * + Tr(K s -K t ) -Tr Π(K -K t )Π . Thus, Tr Π(K -K t )Π can be rewritten as Tr Π(K -K t )Π = ||K -K s || * + Tr(K s -K t ) -Tr (I n -Π)(K -K t )(I n -Π) . It follows that Tr (I n -Π)(K -K t )(I n -Π) =Tr ((I n -Π)(K -K s )(I n -Π)) + Tr (I n -Π)(K s -K t )(I n -Π) =Tr ((I n -Π)(K -K s )(I n -Π)) + Tr(K s -K t ) -Tr(Π(K s -K t )) ≥||P -P s+k || 2 F + Tr(K s -K t ) -O( s t ), where the last inequality comes from Lemma 4. Thus, Tr Π(K -K t )Π ≤ ||K -K s || * -||P -P s+k || 2 F + O( s t ) = ||P -P s || 2 F -||P -P s+k || 2 F + O( s t ) = n i=s+1 σ 2 i (P) - n i=s+k+1 σ 2 i (P) + O( s t ) = s+k i=s+1 σ 2 i (P) + O( s t ) ≤ k s s+k i=k+1 σ 2 i (P) + O( s t ). Since O( s t ) can be arbitrarily small, it can be rewritten as O( s t ) = ε||P -P k || 2 F . Besides, s+k i=k+1 σ 2 i (P) ≤ n i=k+1 σ 2 i (P) = ||P -P k || 2 F . Hence, it can be obtained that Tr -K t )Π ≤ (ε + k s )||P -P k || 2 F . It can be obtained that ||(I n -Π)P|| 2 F -||(I n -Π) P t || 2 F = Tr((I n -Π)PP ⊤ ) -Tr((I n -Π) P t P ⊤ t ) = Tr(PP ⊤ -P t P ⊤ t ) -Tr(Π(PP ⊤ -P t P ⊤ t )Π). Let α = Tr(PP ⊤ -P t P ⊤ t ), and then the above equation can be rewritten as ||(I n -Π)P|| 2 F + Tr(Π(PP ⊤ -P t P ⊤ t )Π) = α + ||(I n -Π) P t || 2 F . After sufficient iterations, both α and Tr(Π(PP ⊤ -P t P ⊤ t )Π) are non-negative with a high probability. Thus, by Lemma 5 it holds that ||(I n -Π)P|| 2 F ≤ α + ||(I n -Π) P t || 2 F = ||(I n -Π)P|| 2 F + Tr(Π(PP ⊤ -P t P ⊤ t )Π) ≤ (1 + ε + k s )||(I n -Π)P|| 2 F . Based on (15), Theorem 3 can be proved as follows . Let Π = Y t L t Y ⊤ t , where Y t is the indicator matrix obtained by applying a γ-approximate algorithm to P t , then ||(I n -Y t L t Y ⊤ t )P|| 2 F ≤ α + ||(I n -Y t L t Y ⊤ t ) P t || 2 F ≤ α + γ||(I n -Y * t L * t Y * ⊤ t ) P t || 2 F , where Y * t is the optimal indicator matrix for the linear k-means problem on P t . Since γ > 1, it follows that α + γ||(I n -Y * t L * t Y * ⊤ t ) P t || 2 F ≤ α + γ||(I n -Y * L * Y * ⊤ ) P t || 2 F ≤ γ(1 + ε + k s )||(I n -Y * L * Y * ⊤ )P|| 2 F . Thus, ||(I n -Y t L t Y ⊤ t )P|| 2 F ≤ γ(1 + ε + k s )||(I n -Y * L * Y * ⊤ )P|| 2 F , which is equivalent to f K ( Y t ) ≤ γ(1 + ε + k s ) min Y f K (Y). F PRIVACY PRESERVATION PROPERTY OF FEDERATED KERNEL k-MEANS

F.1 RECOVER USERS' DATA FROM RANDOM FEATURE MATRICES

A random feature for a data sample x i has the form cos(ω ⊤ x i +b) where the ω and b are determined by the cloud server. Since the value of ω ⊤ x i +b cannot be arbitrarily large, the number of its possible values is limited. If enough such random features are collected, the cloud server can determine the value of ω ⊤ x i + b for each random feature, and then recover x i by solving a system of linear equations.

F.2 PROOF OF THEOREM 4

We then prove that the cloud server can at most recover the matrices {W t [m] ⊤ W t [m], m = 1, ..., M } (only the multiplication W t [m] ⊤ W t [m] not the matrix W t [m]) from the local com- putational results (e.g., W t [m] ⊤ W t [m]c q ). The eigenpairs W t [m] ⊤ W t [m] are determined via the distributed Lanczos algorithm. Since W t [m] ⊤ W t [m] =   ηt D A t [m] ⊤ A t [m] ηt(1-ηt) D A t [m] ⊤ B t [m] ηt(1-ηt) D B t [m] ⊤ A t [m] (1 -η t )B t [m] ⊤ B t [m]   , A t [m] ⊤ A t [m] can be recovered from W t [m] ⊤ W t [m]. For a matrix A t [m] ∈ R nm×D (n m < D), a matrix A ′ ∈ R nm×D can be constructed via A ′ = U o A t [m], where U o ∈ R nm×nm is an arbitrary orthogonal matrix with U ⊤ o U o = I n . By this construction, it can be derived that A ′⊤ A ′ = A t [m] ⊤ U ⊤ o U o A t [m] = A t [m] ⊤ I n A t [m] = A t [m] ⊤ A t [m]. Since there exist infinite matrices U o satisfying U ⊤ o U o = I n , the problem A t [m] ⊤ A t [m] = A ′⊤ A ′ has infinite solutions. Hence, recovering the random feature matrix A t [m] from A t [m] ⊤ A t [m] is an ill-posed problem with infinite solutions. Since by employing CEM, the cloud cannot recover the random feature matrices via matrix operations, according to Section F.1, it is infeasible for the cloud server to recover users' data by solving a system of linear equations.

G ADDITIONAL EXPERIMENTAL SETTINGS

The three public datasets (Mushrooms, MNIST, Covtype)foot_2 are selected from the LIBSVM dataset repository. The Smartphone dataset is provided by a company. The smartphone dataset contains the power consumption data of one app on users' smartphones. Its twelve features represent the power consumption on twelve hardware components. The clustering task for this dataset is to find the distinct usage patterns of the app based on the power consumption data. For the concern of privacy, the Smartphone dataset will not be disclosed. The description of the four existing methods used in the experiments are listed as follows. 3. Distributed kernel k-means with random feature (Chitta et al., 2012) (denoted as RFK kmeans): first transform the raw data samples to the corresponding random vector via the random Fourier feature method (Rahimi & Recht, 2008) and then utilize a distributed linear k-means to find the clusters in space of these random features; 4. Communication efficient distributed kernel PCA Balcan et al. (2016) (denoted as CE PCA): first conduct dimension reduction on the raw data samples through the communication efficient kernel PCA that integrates subspace embedding and adaptive sampling techniques to perform approximated kernel PCA in a distributed manner, and then apply a distributed linear k-means algorithm to the data samples after the dimension reduction. For three distributed algorithms (FK k-means, RFK k-means, and CE PCA), the distributed linear kmeans algorithm developed in (Balcan et al., 2013) is utilized to obtain the clustering results. Thus, the number of data samples C in the coreset should be assigned. For FK k-means, the maximal iteration number T is selected from [10, 20, 30, 40, 50] . In the experiments of the Mushrooms dataset, C is set to 1000. In the experiment of the MNIST dataset, C is set to 1000. In the experiment of the Covtype dataset, C is set to 4000. In the experiment of the Smartphone dataset, C is set to 4000. RFK k-means has two hyperparameters: the kernel parameter γ, and the number of random features D. The hyperparameter configuration for RFK k-means is set as follows. The value of γ is the same as that of FK k-means. In the experiments of the Mushrooms dataset, D = 200, and C is selected from [100, 300, 500, 700] . In the experiment of the MNIST dataset, D = 800, and C is selected from [100, 300, 500, 700, 900] . In the experiment of the Covtype dataset, D = 100, and C is selected from [1000, 2000, 3000, 4000, 5000] . In the experiment of the Smartphone dataset, D = 50, and C is selected from [500, 1000, 2000, 3000] . CE PCA has six hyperparameters: the kernel parameter γ, the number of principle components d after PCA, the number of random features D, the subspace embedding dimension for the feature expansion d s , The subspace embedding dimension for the data points d p , and the number of representative points p. The hyperparameter configuration for CE PCA is set as follows. The value of γ is the same as that of FK k-means In the experiments over different datasets, some hyperparameters of these two methods are not changed. For CE PCA, d s = 50, d p = 250, and p = 500. Besides, the number of principle components d in CE PCA is the same as the number of eigenvectors s in FK kmeans. To obtain different communication cost and normalized mutual information scores, the value of D is set to different values for both methods. In the experiment over the Mushrooms dataset, D is selected from [20, 50, 100, 200] , and C is set to 1000. In the experiment over the MNIST dataset, D is selected from [100, 200, 400, 800] , C is set to 1000. In the experiment over the Covtype dataset, D are selected from [20, 50, 100, 200] , and C is set to 4000. In the experiment over the Smartphone dataset, D is selected from [20, 50, 100], C is set to 4000. The discussion of the configuration of D is as follows. For RFK k-means and CE PCA, D is usually set to large values (more than 100). While for federated kernel k-means, D can be set to relatively small values (less than 50). The reason is as follows. RFK k-means (and CE PCA) only employs random feature once to estimate the kernel matrix. Thus, it requires a large number of random features to obtain an estimation of the kernel matrix with low approximation error, and furthermore a high NMI score. In contrast, federated kernel k-means is an iterative algorithm where random features are employed in each iteration to reduce the gap between the estimation and the kernel matrix. Hence, the number of random feature is not necessary to be set to a large value in each iteration. The communication cost of the three algorithms are determined as follows. For FK k-means, the communication cost is the communication cost of DSPGD with CEM plus the that of the distributed linear k-means. For RFK k-means, its communication cost equals the communication cost of the distributed linear k-means. For CE PCA, its communication cost is the communication cost of performing distributed PCA plus the communication cost of the distributed linear k-means. The communication cost the distributed linear k-means equals the number of data samples in the coreset times the dimension of a data sample. In both FK k-means and CE PCA, the dimension of a data sample equals the number of eigenvectors s. In RFK k-means, the dimension of a data sample equals the number of random features D since the raw data cannot be exposed to the cloud server, and only the random feature vectors can be uploaded to the cloud server.



More specifically, the top eigenvectors of the low rank matrix are the same as those of K and its nonzero eigenvalues are smaller than those of K by a constant. These lemmas can be found in the supplementary material ofZhang et al. (2016) that can be downloaded from https://cs.nju.edu.cn/zlj/pdf/AAAI-2016-Zhang-S.pdf These datasets can be downloaded from https://www.csie.ntu.edu.tw/ ˜cjlin/libsvmtools/datasets/.



obtain the clustering result. DSPGD with CEM and the distributed linear k-means algorithm constitute FK k-means. The pseudo code of FK k-means is shown in Algorithm 1.

the following upper bound of Algorithm 1 Federated Kernel k-Means Algorithm 1: Input: The threshold parameter λ, the number of eigenvectors s to be approached, the number of random features D, the maximal number of iterations T , local datasets L m , m = 1, ..., M , the initial local matrix B 1 [m] = 0, m = 1, ..., M 2: Output: the clustering assignment for each data sample 3: Server executes: 4: for t = 1, 2, . . . , T do 5: Initialize η t = 1/t, q = 0 6: for each client m, m = 1, 2, ..., M in parallel do 7: Compute A t [m] by applying a random feature method to L m 8:

Figure 2: The average communication cost per iteration of the two versions of DSPGD.

Figure 3: The NMI score versus the average communication cost of FK k-means and the existing methods on the four datasets.

Figure 4: The values of Q 0 and Q 1 versus the number of iteration t for the four real-world datasets

Centralized kernel k-means Zha et al. (2001) (denoted as CK k-means): directly perform truncated SVD on the kernel matrix K = UΛU ⊤ to obtain a matrix that consists of the first s column vectors of UΛ 1 2 , and then apply linear k-means to this matrix; 2. Scalable kernel k-means Wang et al. (2019) (denoted as SK k-means): utilize the Nyström method to approximate the kernel matrix K, and conduct kernel k-means over the approximated kernel matrix;

R t is designed as follows. Let Z t = Ũt Λt Ũ⊤ t be the eigendecomposition of Z t , and B t equal Ũt Λ 1 2 t . Based on B t and the random feature matrix A t , W t is constructed as

6.1 EXPERIMENTAL SETTINGFour types of existing schemes are considered in the experiments: centralized kernel k-means(Zha et al., 2001) (denoted as CK k-means), scalable kernel k-means(Wang et al., 2019) (denoted as SK k-means), distributed kernel k-means with random feature(Chitta et al., 2012) (denoted as RFK k-means), and communication efficient distributed kernel PCA(Balcan et al., 2016) (denoted as CE PCA). CK k-means and SK k-means are executed at the cloud server (denoted as cloudbased algorithms), and the rest methods are executed in a distributed manner where users' raw data cannot be uploaded to the cloud server (denoted as client-based algorithms). Besides, Gaussian kernel is used in each algorithm. Four datasets are selected for performance evaluation: Three public datasets (Mushrooms, MNIST, and Covtype) from the LIBSVM dataset repository and one dataset Datasets statistics and hyperparameter settings for FK k-means

