MACHINE UNLEARNING OF FEDERATED CLUSTERS

Abstract

Federated clustering (FC) is an unsupervised learning problem that arises in a number of practical applications, including personalized recommender and healthcare systems. With the adoption of recent laws ensuring the "right to be forgotten", the problem of machine unlearning for FC methods has become of significant importance. We introduce, for the first time, the problem of machine unlearning for FC, and propose an efficient unlearning mechanism for a customized secure FC framework. Our FC framework utilizes special initialization procedures that we show are well-suited for unlearning. To protect client data privacy, we develop the secure compressed multiset aggregation (SCMA) framework that addresses sparse secure federated learning (FL) problems encountered during clustering as well as more general problems. To simultaneously facilitate low communication complexity and secret sharing protocols, we integrate Reed-Solomon encoding with special evaluation points into our SCMA pipeline, and prove that the client communication cost is logarithmic in the vector dimension. Additionally, to demonstrate the benefits of our unlearning mechanism over complete retraining, we provide a theoretical analysis for the unlearning performance of our approach. Simulation results show that the new FC framework exhibits superior clustering performance compared to previously reported FC baselines when the cluster sizes are highly imbalanced. Compared to completely retraining K-means++ locally and globally for each removal request, our unlearning procedure offers an average speed-up of roughly 84x across seven datasets. Our implementation for the proposed method is available at https://github.com/thupchnsky/mufc.

1. INTRODUCTION

The availability of large volumes of user training data has contributed to the success of modern machine learning models. For example, most state-of-the-art computer vision models are trained on large-scale image datasets including Flickr (Thomee et al., 2016) and ImageNet (Deng et al., 2009) . Organizations and repositories that collect and store user data must comply with privacy regulations, such as the recent European Union General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Canadian Consumer Privacy Protection Act (CPPA), all of which guarantee the right of users to remove their data from the datasets (Right to be Forgotten). Data removal requests frequently arise in practice, especially for sensitive datasets pertaining to medical records (numerous machine learning models in computational biology are trained using UK Biobank (Sudlow et al., 2015) which hosts a collection of genetic and medical records of roughly half a million patients (Ginart et al., 2019) ). Removing user data from a dataset is insufficient to ensure sufficient privacy, since training data can often be reconstructed from trained models (Fredrikson et al., 2015; Veale et al., 2018) . This motivates the study of machine unlearning (Cao & Yang, 2015) which aims to efficiently eliminate the influence of certain data points on a model. Naively, one can retrain the model from scratch to ensure complete removal, yet retraining comes at a high computational cost and is thus not practical when accommodating frequent removal requests. To avoid complete retraining, specialized approaches have to be developed for each unlearning application (Ginart et al., 2019; Guo et al., 2020; Bourtoule et al., 2021; Sekhari et al., 2021) . et al., 2017; Bell et al., 2020; So et al., 2022; Chen et al., 2022) . Since data privacy is the main goal in FL, it should be natural for a FL framework to allow for frequent data removal of a subset of client data in a cross-silo setting (e.g., when several patients request their data to be removed in the hospital database), or the entire local dataset for clients in a cross-device setting (e.g., when users request apps not to track their data on their phones). This leads to the largely unstudied problem termed federated unlearning (Liu et al., 2021; Wu et al., 2022; Wang et al., 2022) . However, existing federated unlearning methods do not come with theoretical performance guarantees after model updates, and often, they are vulnerable to adversarial attacks. Our contributions are summarized as follows. 1) We introduce the problem of machine unlearning in FC, and design a new end-to-end system (Fig. 1 ) that performs highly efficient FC with privacy and low communication-cost guarantees, which also enables, when needed, simple and effective unlearning. 2) As part of the FC scheme with unlearning features, we describe a novel one-shot FC algorithm that offers order-optimal approximation for the federated K-means clustering objective, and also outperforms the handful of existing related methods (Dennis et al., 2021; Ginart et al., 2019) , especially for the case when the cluster sizes are highly imbalanced. 3) For FC, we also describe a novel sparse compressed multiset aggregation (SCMA) scheme which ensures that the server only has access to the aggregated counts of points in individual clusters but has no information about the point distributions at individual clients. SCMA securely recovers the exact sum of the input sparse vectors with a communication complexity that is logarithmic in the vector dimension, outperforming existing sparse secure aggregation works (Beguier et al., 2020; Ergun et al., 2021) , which have a linear complexity. 4) We theoretically establish the unlearning complexity of our FC method and show that it is significantly lower than that of complete retraining. 5) We compile a collection of datasets for benchmarking unlearning of federated clusters, including two new datasets containing methylation patterns in cancer genomes and gut microbiome information, which may be of significant importance to computational biologists and medical researchers that are frequently faced with unlearning requests. Experimental results reveal that our one-shot algorithm offers an average speed-up of roughly 84x compared to complete retraining across seven datasets.

2. RELATED WORKS

Due to space limitations, the complete discussion about related works is included in Appendix A. Federated clustering. The goal of this learning task is to perform clustering using data that resides at different edge devices. Most of the handful of FC methods are centered around the idea of sending exact (Dennis et al., 2021) or quantized client (local) centroids (Ginart et al., 2019) directly to the server, which may not ensure desired levels of privacy as they leak the data statistics or cluster information of each individual client. To avoid sending exact centroids, Li et al. (2022) proposes sending distances between data points and centroids to the server without revealing the membership of data points to any of the parties involved, but their approach comes with large computational



Figure 1: Overview of our proposed FC framework. K-means++ initialization and quantization are performed at each client in parallel. The SCMA procedure ensures that only the server knows the aggregated statistics of clients, without revealing who contributed the points in each individual cluster. The server generates points from the quantization bins with prescribed weights and performs full K-means++ clustering to infer the global model.At the same time, federated learning (FL) has emerged as a promising approach to enable distributed training over a large number of users while protecting their privacy(McMahan et al., 2017; Chen et al.,  2020; Kairouz et al., 2021; Wang et al., 2021; Bonawitz et al., 2021). The key idea of FL is to keep user data on their devices and train global models by aggregating local models in a communicationefficient and secure manner. Due to model inversion attacks(Zhu et al., 2019; Geiping et al., 2020), secure local model aggregation at the server is a critical consideration in FL, as it guarantees that the server cannot get specific information about client data based on their local models(Bonawitz  et al., 2017; Bell et al., 2020; So et al., 2022; Chen et al., 2022). Since data privacy is the main goal in FL, it should be natural for a FL framework to allow for frequent data removal of a subset of client data in a cross-silo setting (e.g., when several patients request their data to be removed in the hospital database), or the entire local dataset for clients in a cross-device setting (e.g., when users request apps not to track their data on their phones). This leads to the largely unstudied problem termed federated unlearning(Liu et al., 2021; Wu et al., 2022; Wang et al., 2022). However, existing federated unlearning methods do not come with theoretical performance guarantees after model updates, and often, they are vulnerable to adversarial attacks.

