GU A R DHFL: PRIVACY GUARDIAN FOR HETEROGE-NEOUS FEDERATED LEARNING

Abstract

Heterogeneous federated learning (HFL) enables clients with different computation and communication capabilities to collaboratively train their own customized models via a query-response paradigm on auxiliary datasets. However, such paradigm raises serious privacy issues due to the leakage of highly sensitive query samples and response predictions. Although existing secure querying solutions may be extended to enhance the privacy of HFL with non-trivial adaptation, they suffer from two key limitations: (1) lacking customized protocol designs and (2) relying on heavy cryptographic primitives, which could lead to poor performance. In this work, we put forth GuardHFL, the first-of-its-kind efficient and privacypreserving HFL framework. GuardHFL is equipped with a novel HFL-friendly secure querying scheme that is built on lightweight secret sharing and symmetrickey techniques. Its core is a set of customized multiplication and comparison protocols, which substantially boost the execution efficiency. Extensive evaluations demonstrate that GuardHFL outperforms the state-of-the-art works in both runtime and communication overhead.

1. INTRODUCTION

As a promising variant of federated learning (FL), heterogeneous federated learning (HFL) (Li & Wang, 2019) enables clients equipped with different computation and communication capabilities to collaboratively train their own customized models that may differ in size, numerical precision or structure (Lin et al., 2020) . In particular, the knowledge of models is shared via a query-response paradigm on auxiliary datasets, such as unlabeled datasets from the same task domain (Choquette-Choo et al., 2021) or related datasets from different task domains (Li & Wang, 2019; Lin et al., 2020) . In such a paradigm, each client queries others with samples in the auxiliary querying dataset, and obtains aggregated response predictions via a centralized cloud serverfoot_0 . Then he retrains his local model on the query data and corresponding predictions. This flexible approach facilitates customized FL-driven services in areas like healthcare and finance (Kairouz et al., 2019) , while resolving the intellectual property concerns of FL models (Tekgul et al., 2021) . However, HFL suffers from several privacy issues. First, directly sharing query samples violates their privacy. For example, in healthcare applications, the auxiliary datasets may contain patients' medical conditions. Disclosure of such information is illegal under current regulations like General Data Protection Regulation. Second, sharing response predictions may still compromise the privacy of local data (Papernot et al., 2016) . Several works have shown that given black-box access to a model, adversaries can infer the membership (Salem et al., 2019) and attribute information (Ganju et al., 2018) of the target sample or even reconstruct the original training data (Yang et al., 2019) . Although in traditional FL systems, the privacy issue could be mitigated through well-studied secure gradient aggregation protocols (Bell et al., 2020) , it becomes more challenging to realize this guarantee in HFL, due to the heterogeneity of the clients' models (refer to Appendix A.2.3). To bridge this gap, a possible solution is to structurally integrate into HFL existing secure querying (a.k.a. private inference) schemes (Rathee et al., 2020; Huang et al., 2022; Wagh et al., 2019; Tan et al., 2021) . These schemes utilize various cryptographic primitives, including homomorphic encryption (HE) (Gentry, 2009) , garbled circuit (GC) (Yao, 1986) or oblivious transfer (OT) (Asharov et al., 2013) , to provide rigorous privacy guarantees for query data and prediction results. While it is possible to non-trivially extend these secure querying schemes (refer to Section 2.3), they have two major limitations: (1) the underlying protocols are not customized for HFL; (2) they incur huge overhead due to the adoption of heavy cryptographic primitives. These bottlenecks lead to poor performance and hinder the efficient instantiation of HFL. Therefore, it is necessary but challenging to provide customized protocols and implement a privacy-preserving HFL with desirable performance. In this work, we introduce GuardHFL, the first efficient and privacy-preserving HFL framework to address the above challengesfoot_1 . GuardHFL is built upon the standard HFL training paradigm (Li & Wang, 2019), which contains three stages: local training, querying and local re-training (refer to Section 2.1). We formalize the workflow of HFL, and present a novel HFL-friendly secure querying scheme as an important building block. The core of our scheme is a set of customized multiplication and comparison protocols, which substantially boost the execution efficiency compared to existing works. More precisely, we optimize the parallel prefix adder (PPA) logic (Harris, 2003) to build a more advanced comparison protocol. Over an ℓ-bit ring, these optimizations reduce the evaluation of log ℓ AND gates and also the number of the communication rounds, which are two crucial factors that dominate the performance. Moreover, our PRF-based multiplication protocol only communicates 3 elements in an ℓ-bit ring, achieving significant improvement compared to the widely used Beaver's multiplication solution (Beaver, 1991; Demmler et al., 2015) . We provide formal privacy guarantees for the designed protocols, and evaluate GuardHFL on different datasets (SVHN, CI-FAR10, Tiny ImageNet), system configurations (IID and Non-IID training sets) and heterogeneous models. Results show that GuardHFL outperforms the state-of-the-art works in efficiency while ensuring the model utility.

2.1. HETEROGENEOUS FEDERATED LEARNING

We briefly review the workflow of HFL (Li & Wang, 2019) , where clients independently design their own unique models. Due to such model heterogeneity, clients cannot directly share model parameters with each other as in the traditional FL. Instead, they learn the knowledge of other models via a query-response mechanism, which is similar as the knowledge distillation technique (Hinton et al., 2015) . To be more precise, each client P Q (called the querying party) performs threephase operations collaboratively with a server: (1) Local training: P Q first trains the local model on his private dataset. (2) Querying: The server selects C fraction of clients as the responding parties P A . Given the auxiliary querying dataset, the server receives the prediction results from these P A , computes the aggregated result and returns it back to P Q . (3) Local re-training: P Q then retrains the local model based on the private dataset, as well as the query samples and corresponding predictions. Each client in HFL can play the roles of the querying party and the responding party at the same time, and the above process is iterated until each local model meets the pre-defined accuracy requirement. Note that as illustrated in existing works (Bonawitz et al., 2017; Bell et al., 2020) , the server is responsible for routing the messages between clients, since the clients (e.g., mobile devices) generally cannot establish direct communication channels with others. GuardHFL is in line with the above paradigm with the additional benefit of privacy protection. The only difference lies in the acquisition of auxiliary query samples in the querying stage. In general HFL (Li & Wang, 2019), there is a large public auxiliary dataset (used as query samples) that every party can access. However, considering the privacy limitation, such public dataset is hard to collect in real-world scenarios such as healthcare. To tackle this problem, in GuardHFL, each party can locally construct a synthesized querying set based on its private training samples, by utilizing existing data augmentation strategies (refer to Section 3.4).

2.2. THREAT MODEL

As described in Section 1, in the querying phase of HFL, the query samples, prediction results and model parameters may contain sensitive information that is of interest to adversaries. In line



As demonstrated in Bonawitz et al. (2017); Bell et al. (2020), clients (e.g., mobile devices) in real-world applications are generally widely distributed and coordinated only by the server. Choquette-Choo et al. (2021) presented a general collaborative learning scheme, called CaPC, which enables each party to improve his local model from others' models using the secure querying scheme(Boemer et al., 2019b). However, it cannot be directly applied to the HFL scenario as it requires cross-client communication. Meanwhile, it causes prohibitively high overhead (refer to Section 4.1).

