GU A R DHFL: PRIVACY GUARDIAN FOR HETEROGE-NEOUS FEDERATED LEARNING

Abstract

Heterogeneous federated learning (HFL) enables clients with different computation and communication capabilities to collaboratively train their own customized models via a query-response paradigm on auxiliary datasets. However, such paradigm raises serious privacy issues due to the leakage of highly sensitive query samples and response predictions. Although existing secure querying solutions may be extended to enhance the privacy of HFL with non-trivial adaptation, they suffer from two key limitations: (1) lacking customized protocol designs and (2) relying on heavy cryptographic primitives, which could lead to poor performance. In this work, we put forth GuardHFL, the first-of-its-kind efficient and privacypreserving HFL framework. GuardHFL is equipped with a novel HFL-friendly secure querying scheme that is built on lightweight secret sharing and symmetrickey techniques. Its core is a set of customized multiplication and comparison protocols, which substantially boost the execution efficiency. Extensive evaluations demonstrate that GuardHFL outperforms the state-of-the-art works in both runtime and communication overhead.

1. INTRODUCTION

As a promising variant of federated learning (FL), heterogeneous federated learning (HFL) (Li & Wang, 2019) enables clients equipped with different computation and communication capabilities to collaboratively train their own customized models that may differ in size, numerical precision or structure (Lin et al., 2020) . In particular, the knowledge of models is shared via a query-response paradigm on auxiliary datasets, such as unlabeled datasets from the same task domain (Choquette-Choo et al., 2021) or related datasets from different task domains (Li & Wang, 2019; Lin et al., 2020) . In such a paradigm, each client queries others with samples in the auxiliary querying dataset, and obtains aggregated response predictions via a centralized cloud serverfoot_0 . Then he retrains his local model on the query data and corresponding predictions. This flexible approach facilitates customized FL-driven services in areas like healthcare and finance (Kairouz et al., 2019) , while resolving the intellectual property concerns of FL models (Tekgul et al., 2021) . However, HFL suffers from several privacy issues. First, directly sharing query samples violates their privacy. For example, in healthcare applications, the auxiliary datasets may contain patients' medical conditions. Disclosure of such information is illegal under current regulations like General Data Protection Regulation. Second, sharing response predictions may still compromise the privacy of local data (Papernot et al., 2016) . Several works have shown that given black-box access to a model, adversaries can infer the membership (Salem et al., 2019) and attribute information (Ganju et al., 2018) of the target sample or even reconstruct the original training data (Yang et al., 2019) . Although in traditional FL systems, the privacy issue could be mitigated through well-studied secure gradient aggregation protocols (Bell et al., 2020) , it becomes more challenging to realize this guarantee in HFL, due to the heterogeneity of the clients' models (refer to Appendix A.2.3). To bridge this gap, a possible solution is to structurally integrate into HFL existing secure querying (a.k.a. private inference) schemes (Rathee et al., 2020; Huang et al., 2022; Wagh et al., 2019; Tan et al., 2021) . These schemes utilize various cryptographic primitives, including homomorphic encryption (HE) (Gentry, 2009) , garbled circuit (GC) (Yao, 1986) or oblivious transfer (OT) (Asharov et al., 2013) , to provide rigorous privacy guarantees for query data and prediction results. While it is possible to non-trivially extend these secure querying schemes (refer to Section 2.3), they have two major limitations: (1) the underlying protocols are not customized for HFL; (2) they incur huge overhead due to the adoption of heavy cryptographic primitives. These bottlenecks lead to poor performance and hinder the efficient instantiation of HFL. Therefore, it is necessary but challenging to provide customized protocols and implement a privacy-preserving HFL with desirable performance. In this work, we introduce GuardHFL, the first efficient and privacy-preserving HFL framework to address the above challengesfoot_1 . GuardHFL is built upon the standard HFL training paradigm (Li & Wang, 2019) , which contains three stages: local training, querying and local re-training (refer to Section 2.1). We formalize the workflow of HFL, and present a novel HFL-friendly secure querying scheme as an important building block. The core of our scheme is a set of customized multiplication and comparison protocols, which substantially boost the execution efficiency compared to existing works. More precisely, we optimize the parallel prefix adder (PPA) logic (Harris, 2003) to build a more advanced comparison protocol. Over an ℓ-bit ring, these optimizations reduce the evaluation of log ℓ AND gates and also the number of the communication rounds, which are two crucial factors that dominate the performance. Moreover, our PRF-based multiplication protocol only communicates 3 elements in an ℓ-bit ring, achieving significant improvement compared to the widely used Beaver's multiplication solution (Beaver, 1991; Demmler et al., 2015) . We provide formal privacy guarantees for the designed protocols, and evaluate GuardHFL on different datasets (SVHN, CI-FAR10, Tiny ImageNet), system configurations (IID and Non-IID training sets) and heterogeneous models. Results show that GuardHFL outperforms the state-of-the-art works in efficiency while ensuring the model utility.

2.1. HETEROGENEOUS FEDERATED LEARNING

We briefly review the workflow of HFL (Li & Wang, 2019) , where clients independently design their own unique models. Due to such model heterogeneity, clients cannot directly share model parameters with each other as in the traditional FL. Instead, they learn the knowledge of other models via a query-response mechanism, which is similar as the knowledge distillation technique (Hinton et al., 2015) . To be more precise, each client P Q (called the querying party) performs threephase operations collaboratively with a server: (1) Local training: P Q first trains the local model on his private dataset. (2) Querying: The server selects C fraction of clients as the responding parties P A . Given the auxiliary querying dataset, the server receives the prediction results from these P A , computes the aggregated result and returns it back to P Q . (3) Local re-training: P Q then retrains the local model based on the private dataset, as well as the query samples and corresponding predictions. Each client in HFL can play the roles of the querying party and the responding party at the same time, and the above process is iterated until each local model meets the pre-defined accuracy requirement. Note that as illustrated in existing works (Bonawitz et al., 2017; Bell et al., 2020) , the server is responsible for routing the messages between clients, since the clients (e.g., mobile devices) generally cannot establish direct communication channels with others. GuardHFL is in line with the above paradigm with the additional benefit of privacy protection. The only difference lies in the acquisition of auxiliary query samples in the querying stage. In general HFL (Li & Wang, 2019) , there is a large public auxiliary dataset (used as query samples) that every party can access. However, considering the privacy limitation, such public dataset is hard to collect in real-world scenarios such as healthcare. To tackle this problem, in GuardHFL, each party can locally construct a synthesized querying set based on its private training samples, by utilizing existing data augmentation strategies (refer to Section 3.4).

2.2. THREAT MODEL

As described in Section 1, in the querying phase of HFL, the query samples, prediction results and model parameters may contain sensitive information that is of interest to adversaries. In line et al., 2018; Sun & Lyu, 2021; Choquette-Choo et al., 2021) , we consider an honest-but-curious adversary setting (Goldreich, 2009) , where each entity (including the clients and the server) strictly follows the specification of the designed protocol but attempts to infer more knowledge about these private information of other clients. Moreover, to maintain the reputation and provide more services, the server does not collude with any clients, namely that an attacker either corrupts the server or a subset of clients but not both.

2.3. EXTENDING EXISTING SECURE QUERYING SOLUTIONS TO HFL

To provide privacy guarantees against adversaries in Section 2.2, the clients and the server need to privately execute the querying process. Although this process consists of three entities (i.e., P Q , the server and P A ), it is non-trivial to directly extend existing secure 3-party computation protocols (3PC) (Wagh et al., 2019; 2021; Knott et al., 2021; Tan et al., 2021) to instantiate this process. The main reason is the incapability of direct communication between P Q and P A in realistic HFL scenarios (Bonawitz et al., 2017; Bell et al., 2020) , which hinders the usage of these 3PC solutions in HFL, unless we redesign the underlying protocols and make substantial modifications to their corresponding implementations. On the other hand, we can extend state-of-the-art 2PC solutions (Rathee et al., 2020; Huang et al., 2022) into this process via using the server as the communication medium with adaptive protocol modifications (refer to Appendix A.2.4 for more details). Unfortunately, as mentioned in Section 1, such extension comes at the cost of heavy computational and communication complexity. Motivated by these challenges, we design a set of lightweight and customized protocols for improving the efficiency of the secure querying phase (Section 3), which show significant performance gains over extending the advanced 2PC schemes to HFL (Section 4.1).

2.4. CRYPTOGRAPHIC PRIMITIVES

Secret sharing. GuardHFL adopts the 2-out-of-2 arithmetic secret sharing scheme over the ring Z 2 ℓ (Shamir, 1979; Demmler et al., 2015) . Share(x) denotes the sharing algorithm that takes x as input and outputs random sampled shares [x] 0 , [x] 1 with the constraint x = [x] 0 + [x] 1 in Z 2 ℓ . The reconstruction algorithm Recon([x] 0 , [x] 1 ) takes the two shares as input and outputs x = [x] 0 + [x] 1 in Z 2 ℓ . Besides, our comparison protocol adopts the boolean secret sharing (Shamir, 1979; Demmler et al., 2015) , where x ∈ Z 2 is shared as [x] B 0 and [x] B 1 satisfying [x] B 0 ⊕[x] B 1 = x in Z 2 . The security ensures that given [x] 0 or [x] 1 (similarly, [x] B 0 or [x] B 1 ), the value of x is perfectly hidden. Arithmetic operations on secret-shared values can be implemented with existing techniques (Appendix A.2.2). Pseudo-random Function. A pseudo-random function y ← PRF(Sk, x) is a deterministic function that takes a uniformly random seed Sk and a payload x as input and outputs a fixed-length pseudorandom string y. The security of PRFs ensures that the output is indistinguishable from the uniform distribution. In GuardHFL, PRFs enable two parties to generate the same pseudo-random values without communication. Details can be found in Appendix A.2.2.

3. GU A R DHFL

GuardHFL is built upon standard HFL systems as shown in Section 2.1 and enhances their privacy protections with cryptographic techniques. Figure 1 shows the overview of GuardHFL and the detailed description is given in Algorithm 1. Similar as vanilla HFL, it includes three phases: Local training, Secure querying and Local re-training. Since Local training and Local re-training are standard HFL training processes without privacy issues, below we focus on formalizing our core construction, i.e., Secure querying. As detailed in Section 2.3, extending existing secure querying solutions to HFL introduces expensive overhead due to the usage of heavy cryptographic primitives and the lack of customized protocols. To tackle this challenge, we propose a tailored secure querying scheme utilizing lightweight secret sharing and PRF techniques, which is decomposed into three steps: secure query-data sharing, secure model prediction and secure result aggregation. In generally, P Q first constructs querying samples locally using data argumentation strategies (Section 3.4). Since querying samples imply the semantic information of private training data, they for each querying party P j Q , j ∈ [n] do 6: P j Q randomly samples query data {x b } b∈[B] from the auxiliary querying dataset that are generated via the data argumentation strategies described in Section 3.4. 7: for each responding party P i A , i ∈ C do 8: P j Q secret-shares {[x b ]} b∈[B] with P i A and the server, based on protocol ΠShare in Figure 2 . 9: P i A , P j Q and the server jointly perform the secure model prediction protocol in Section 3.2. 10: P i A secret-shares the predictions {[y i b ]} b∈[B] to PQ and the server. 11: end for 12: P j Q computes {y b } b∈[B] , where y b = i∈C y i b , via protocol ΠAgg in Figure 5 with the server. 13: P j Q retrains Mj based on the query dataset {x b , y b } b∈[B] and Dj.

14:

end for 15: end for cannot be directly exposed to the server and P A for prediction. Therefore, GuardHFL secret-shares the query samples to the server and P A using the designed secure query-data sharing protocol (Section 3.1). Then given the secret-shared samples, P A , P Q and the server can jointly execute the proposed secure model prediction protocol (Section 3.2) to obtain the secret-shared inference logits. After that, the secure result aggregation protocol (Section 3.3) comes in handy, which takes as input the secret-shared logits and returns the aggregated results to P Q . During the entire querying phase, GuardHFL maintains the following invariant (Rathee et al., 2020; Huang et al., 2022) : the server and P A start each protocol with arithmetic shares of inputs, and end with arithmetic shares of outputs over the same ring. This allows us to sequentially stitch the proposed protocols to obtain a fully private querying scheme. The formal security analysis is given in Appendix A.3.

3.1. SECURE QUERY-DATA SHARING

To perform secure prediction based on secret sharing techniques, P Q first secret-shares the query data x with the server and P A . Considering the communication constraint between P Q and P A , we utilize PRFs to share x. Specifically, we first construct PRF seeds in pairs for P Q , P A and the server, denoted as Sk QA , Sk SA , and Sk SQ , which are used to generate the same random values between two parties without communication (refer to Figure 12 in Appendix A.2.2). After that, P Q can share x using protocol Π Share as shown in Figure 2 . In particular, P Q non-interactively shares [x] 0 = r with P A using PRFs on the seed Sk QA . Then P Q computes [x] 1 = x -r and sends it to the server.  Server 𝑷𝑷 𝑄𝑄 𝑷𝑷 𝐴𝐴 Query data 𝑥𝑥 𝑟𝑟 ← PRF(𝑆𝑆𝑆𝑆 𝑄𝑄𝐴𝐴 ) 𝑥𝑥 -𝑟𝑟 [𝑥𝑥] 0 =𝑟𝑟 [𝑥𝑥] 1 =𝑥𝑥 -𝑟𝑟 𝑟𝑟 ← PRF(𝑆𝑆𝑆𝑆 𝑄𝑄𝐴𝐴 )

3.2. SECURE MODEL PREDICTION

In this step, the server and P A execute secure prediction on the secret-shared query data with the assistance of P Q . Figure 11 in Appendix A.2.1 gives an end-to-end graphic depiction. Below we elaborate the customized protocols for three components: linear layers, ReLU and MaxPooling. Linear layers. Linear layers consist of fully-connection, convolution, average pooling and batch normalization, and the main operation of these layers is matrix multiplication (Wagh et al., 2019; Huang et al., 2022) . We utilize PRFs and secret sharing to design a customized matrix multiplication protocol Π Matmul , which is not only compatible with communication-limited HFL settings, but also improves communication efficiency. Specifically, as shown in Figure 3 , P A and the server aim to compute ωx, where the model parameter ω is held by P A and the shares [x] 0 , [x] 1 of x are held by P A and the server, respectively. Given that ωx = ω[x] 0 + ω[x] 1 , P A can compute ω[x] 0 locally. To evaluate ω[x] 1 , P Q first generates three random matrices a, b and [c] 0 using PRFs, meanwhile, computes and sends [c] 1 that satisfies [c] 1 + [c] 0 = ab in Z 2 ℓfoot_2 to the server. At the same time, using PRFs, the server generates the same b and P A generates the same a and [c] 0 . Then P A and the server can learn [y] 0 and [y] 1 (i.e., the secret shares of ωx), respectively, through one round of interaction. Overall, the communication cost is 3ℓ within 1 communication round. Remark. In our fixed-point representations, to prevent values from overflowing due to the multiplication operations, we use the truncation technique from Mohassel & Zhang (2017) , which is consistent with existing methods (Mishra et al., 2020; Wagh et al., 2019) . This technique simply truncates the extra LSBs of fixed-point values, albeit at the cost of a 1-bit error, which is negligible on the model accuracy (Mohassel & Zhang, 2017) . Server 𝑷𝑷 𝑄𝑄 𝑷𝑷 𝐴𝐴 𝑎𝑎, [𝑐𝑐] 0 ← PRF(𝑆𝑆𝑆𝑆 𝑄𝑄𝐴𝐴 ) 𝑏𝑏 ← PRF(𝑆𝑆𝑆𝑆 𝑆𝑆𝑄𝑄 ) 𝜔𝜔 + 𝑎𝑎 [𝑥𝑥] 1 -𝑏𝑏 [𝑦𝑦] 0 = 𝜔𝜔[𝑥𝑥] 0 +𝜔𝜔([𝑥𝑥] 1 -𝑏𝑏)-[𝑐𝑐] 0 𝑎𝑎, [𝑐𝑐] 0 ← PRF(𝑆𝑆𝑆𝑆 𝑄𝑄𝐴𝐴 ) [𝑐𝑐] 1 = 𝑎𝑎𝑏𝑏 -[𝑐𝑐] 0 [𝑦𝑦] 1 = (𝜔𝜔 + 𝑎𝑎)𝑏𝑏-[𝑐𝑐] 1 𝑏𝑏 ← PRF(𝑆𝑆𝑆𝑆 𝑆𝑆𝑄𝑄 ) Model parameter 𝜔𝜔, [𝑥𝑥] 1 [𝑥𝑥] 0 Figure 3: Secure matrix multiplication protocol Π Matmul ReLU. ReLU is computed as ReLU(x) = x • (1 ⊕ MSB(x)) , where MSB(x) equals 0 if x ≥ 0 and 1 otherwise. Thus, the evaluation of ReLU consists of MSB and multiplication operations. In the following, we propose an optimized MSB protocol building on the parallel prefix adder (PPA) logic (Harris, 2003; Mohassel & Rindal, 2018; Patra et al., 2021) . Before giving specific optimizations, we first review the PPA-based MSB method. PPA-based MSB method. Given that the bit decomposition of [x] 0 and [x] 1 are e ℓ , . . . , e 1 and f ℓ , . . . , f 1 , respectively, the PPA-based method evaluates MSB(x) = e ℓ ⊕ f ℓ ⊕ c ℓ , where c ℓ is the ℓ-th carry bit. Note that c ℓ = c ℓ-1 ∧ (e ℓ-1 ⊕ f ℓ-1 ) ⊕ (e ℓ-1 ∧ f ℓ-1 ). Given the carry signal tuple (g 0 i , p 0 i ) where g 0 i = e i ∧ f i and p 0 i = e i ⊕ f i for i ∈ [ℓ], c ℓ can be reformulated as c ℓ = g 0 ℓ-1 ⊕ (p 0 ℓ-1 ∧ g 0 ℓ-2 ) ⊕ • • • ⊕ (p 0 ℓ-1 ∧ • • • ∧ p 0 2 ∧ g 0 1 ). The PPA evaluates Eq.1 by constructing a log ℓ-layer tree and traversing from the leaves with inputs (g 0 i , p 0 i ) for i ∈ [ℓ] until reaching the root. Each node k at the n-th layer (n ∈ [log ℓ]) is adhered with the following operation: g n k = g n-1 j+1 ⊕ (g n-1 j ∧ p n-1 j+1 ) and p n k = p n-1 j+1 ∧ p n-1 j , where it takes as inputs of two adjacent signal tuples (g n-1 j+1 , p n-1 j+1 ) and (g n-1 j , p n-1 j ), and outputs a signal tuple (g n k , p n k ). Finally, c ℓ is obtained as g log ℓ 1 , and MSB(x) = e ℓ ⊕ f ℓ ⊕ g log ℓ 1 . Overall, this method requires 3ℓ -4 AND gates within log ℓ + 1 communication rounds, where each AND gate is evaluated by the standard Beaver triple-based multiplication protocol (Appendix A.2.2). Therefore, it totally needs 30ℓ -40 bits communication. Optimizations. Straightforwardly adopting the above PPA-based MSB method cannot achieve the best efficiency. The performance bottleneck comes from: (1) unnecessary AND gates evaluation introduces extra communication overhead; (2) more communication rounds due to the separation of input computation and tree evaluation. To solve these challenges, firstly, we simplify the PPA circuit via removing unnecessary AND gates. In each level n ∈ {1, 2, • • • , log ℓ}, we eliminate the generation of "least significant" p n 1 that consumes an evaluation of AND but it is unnecessary for computing MSB. For the same reason, we also remove the evaluation of g 0 ℓ . Secondly, we can further reduce the communication complexity by utilizing the above multiplication protocol Π Matmul to evaluate remaining AND gates, instead of the Beaver triple-based method. Thirdly, we integrate the round of communication used to compute inputs g 0 i for i ∈ {1, 2, • • • , ℓ -1} into the evaluation of the PPA circuit via a non-trivial modification of the evaluation logic. Overall, our MSB protocol totally communicates 9ℓ-3 log ℓ-12 bits within log ℓ rounds, a 3.4× communication improvement over the above PPA-based MSB method. Algorithm 2 gives the detailed construction of Π msb .  * i ] B for i ∈ [ℓ -1]. 4: P0 sets [p * i ] B 0 = ei and P1 sets [p * i ] B 1 = fi. 5: P0 and P1 invoke Π AND with inputs [p * 2i-1 ] B and [p * 2i-2 ] B to obtain [p * i ] B for i ∈ {2, • • • , ℓ 2 }. 6: for r ∈ {2, • • • , log ℓ} do 7: for i ∈ {2, • • • , ℓ 2 r-1 } do 8: If r = 2, P0 and P1 invoke Π AND with inputs [g * 2i-2 ] B and [p * 2i-1 ] B to obtain [ti] B , and sets [g * i ] B = [g * 2i-1 ] B ⊕ [ti] B . ] B = [g * 2 ] B ⊕ [t1] B . 11: for i ∈ {2, • • • , ℓ 2 r-1 } do 12: If r = 2, : P0 sets [msb(x)] B 0 = e ℓ ⊕ [g * 1 ] B 0 and P1 sets [msb(x)] B 1 = f ℓ ⊕ [g * 1 ] B 1 . After obtaining [MSB(x)] B , we need to compute [x] • (1 ⊕ [MSB(x)] B ), i.e., the secret shares of ReLU(x). Given z 0 = [MSB(x)] B 0 and z 1 = 1 ⊕ [MSB(x)] B 0 for simplicity, we have ReLU(x) = ([x] 0 + [x] 1 )(z 0 + z 1 -2z 0 z 1 ) = z 0 [x] 0 + z 1 [x] 1 + z 1 (1 -2z 0 )[x] 0 + z 0 (1 -2z 1 )[x] 1 . The first two terms can be evaluated locally by P A and the server respectively, while the other two terms are evaluated using protocol Π Matmul . For example, to compute z 1 (1 -2z 0 )[x] 0 , the protocol Π Matmul inputs t 0 = (1 -2z 0 )[x] 0 and z 1 provided by P A and the server, and outputs [t 0 z 1 ] 0 and [t 0 z 1 ] 1 to P A and the server, respectively. Finally, P A and the server learn [y] 0 and [y] 1 respectively, such that y = ReLU(x). The complete secure ReLU protocol Π ReLU is shown in Figure 4 .  Server 𝑷𝑷 𝑄𝑄 𝑷𝑷 𝐴𝐴 [𝑥𝑥] 1 [𝑥𝑥] 0 ∏ msb [msb(𝑥𝑥)] 1 𝐵𝐵 [msb(𝑥𝑥)] 0 𝐵𝐵 = 𝑧𝑧 0 ∏ Matmul (𝑎𝑎, 𝑏𝑏, [𝑐𝑐] 0 , [𝑐𝑐] 1 ) 𝑧𝑧 0 , 𝑧𝑧 1 = [msb(𝑥𝑥)] 1 𝐵𝐵 ⨁1, 𝑡𝑡 1 = 1 -2𝑧𝑧 1 𝑥𝑥 1 [𝑡𝑡 0 𝑧𝑧 1 ] 1 , [𝑡𝑡 1 𝑧𝑧 0 ] 1 [𝑡𝑡 0 𝑧𝑧 1 ] 0 , [𝑡𝑡 1 𝑧𝑧 0 ] 0 𝑡𝑡 0 = (1 -2𝑧𝑧 0 ) 𝑥𝑥 0 [𝑦𝑦] 1 = 𝑧𝑧 1 [𝑥𝑥] 1 + [𝑡𝑡 0 𝑧𝑧 1 ] 1 +[𝑡𝑡 1 𝑧𝑧 0 ] 1 [𝑦𝑦] 0 = 𝑧𝑧 0 [𝑥𝑥] 0 +[𝑡𝑡 0 𝑧𝑧 1 ] 0 +[𝑡𝑡 1 𝑧𝑧 0 ] 0

3.3. SECURE RESULT AGGREGATION

After the secure prediction, the predicted logit [x i ] is secret-shared between the server and each responding party P i A , where i ∈ [C] and C is the set of responding parties. To prevent privacy leakage from a single prediction (Salem et al., 2019; Ganju et al., 2018; Yang et al., 2019) , we return the aggregated logit to P Q via the secure aggregation protocol Π Agg in Figure 5 . Specifically, P i A and P Q first generate a random value r i based on PRFs. Then each P i A computes and sends [x i ] 0 -r i to the server. The server sums all received values and sends the masked aggregation to P Q , which will reconstruct the aggregated logits of the query data. Notice that our secure aggregation protocol can be extended to output the aggregated label rather than the logit, using the above Π ReLU protocol. (Hinton et al., 2015) . Moreover, we also construct a synthesized dataset via the mixup method (Zhang et al., 2018 ) (Appendix A.1.2). The synthesized dataset construction is a universal and modular method, and it can be readily extended with advanced data augmentation strategies, such as cutout (DeVries & Taylor, 2017) and cutmix (Yun et al., 2019) . Note that this process does not reveal any private information, since the samples are constructed locally by the querying party, without involving any other parties and their private datasets. We present some exploration and experiments in Appendix A.1.2 and Figure 10(c) . Server [𝑥𝑥 𝑖𝑖 ] 1 , 𝑖𝑖 ∈ [𝐶𝐶] 𝑷𝑷 𝐴𝐴 𝑖𝑖 [𝑥𝑥 𝑖𝑖 ] 0 𝑟𝑟 𝑖𝑖 ← PRF(𝑆𝑆𝑆𝑆 𝑄𝑄𝐴𝐴 ) [𝑥𝑥 𝑖𝑖 ] 0 -𝑟𝑟 𝑖𝑖 𝑆𝑆 = ∑ 𝑖𝑖=1 |𝐶𝐶| ([𝑥𝑥 𝑖𝑖 ] 0 -𝑟𝑟 𝑖𝑖 + [𝑥𝑥 𝑖𝑖 ] 1 ) 𝑷𝑷 𝑄𝑄 𝑟𝑟 𝑖𝑖 ← PRF(𝑆𝑆𝑆𝑆 𝑄𝑄𝐴𝐴 ) 𝑦𝑦 ← softmax(𝑆𝑆 + ∑ 𝑖𝑖=1 |𝐶𝐶| 𝑟𝑟 𝑖𝑖 ) 𝑆𝑆 GPU-friendly evaluation. Our scheme is friendly with GPUs and can be processed by highlyoptimized CUDA kernels (Tan et al., 2021) . As discussed above, the cryptographic protocols of GuardHFL only involve simple vectorized arithmetic operations, rather than homomorphic encryption and garbled circuits in prior works (Rathee et al., 2020; Huang et al., 2022; Choquette-Choo et al., 2021) . As a result, GuardHFL is suitable for batch querying (i.e., executing multiple querying at the same time) with a lower amortized cost. We evaluate the designed protocols on GPUs in Section 4.1 and show the advantage of GPU acceleration over CPUs in Appendix A.1.2.

4. EVALUATION

Datasets and models. We evaluate GuardHFL on three image datasets (SVHN, CIFAR10 and Tiny ImageNet). By default, we assume independent and identically distributed (IID) training data among clients. We also simulate disjoint Non-IID training data via the Dirichlet distribution Dir(α) in Lin et al. (2020) . The value of α controls the degree of Non-IID-ness, where a smaller α indicates a higher degree of Non-IID-ness. Moreover, we simulate the heterogeneity property in HFL. In particular, for SVHN and CIFAR10, we set the number of clients n = 50 and use VGG-7, ResNet-8 and ResNet-10 architectures as the clients' local models. For Tiny ImageNet, we set n = 10 and use ResNet-14, ResNet-16, and ResNet-18 architectures. Each model architecture is used by n/3 clients. Besides, the query data are constructed via two methods as shown in Section 3.4: using the private training data (Q-priv) or synthesizing samples (Q-syn) via mixup (Zhang et al., 2018) . Cryptographic protocol. Following existing works (Rathee et al., 2020; Tan et al., 2021) , we set secret-sharing protocols over a 64-bit ring Z 2 64 , and encode inputs using a fixed-point representation with 20-bit precision. The security parameter κ is 128 in the instantiation of PRFs. Unless otherwise stated, we only report the performance on the GPU accelerator. More experimental setup is given in Appendix A.1.1.

4.1. EFFICIENCY

We report the efficiency of GuardHFL, and compare it with CaPC (Choquette-Choo et al., 2021) and instantiations of HFL based on state-of-the-art secure querying protocols (Rathee et al., 2020; Huang et al., 2022; Tan et al., 2021) . End-to-end performance. We show the extra overhead introduced by GuardHFL compared with the vanilla HFL system in the plaintext environment. This is caused by the secure querying phase, which consists of three steps described in Section 3. Table 1 reports the runtime of each step for different models and datasetsfoot_3 . We observe that the cost is dominated by the secure model prediction step. Specifically, it takes 16.9 minutes to evaluate 5000 query samples securely on VGG-7 and CIFAR10, and only 11.32 second and 0.3 second are spent on the secure query-data sharing and secure result aggregation steps. More time is required to evaluate Tiny ImageNet because of larger input sizes and model architectures. (Boemer et al., 2019b )) on MNIST. We observe that GuardHFL is two orders of magnitude faster than CaPC on the three models. In terms of communication overhead, we provide a theoretical comparison. (1) For linear layers, CaPC requires to communicate 2 homomorphic ciphertexts within 2 rounds. GuardHFL needs communicating 3 ring elements (each with 64-bit). Note that the size of ciphertexts is much larger than the size of the ring elements. ( 2) For non-linear layers, e.g., ReLU, CaPC adopts the garbled circuit technique that requires 2 rounds with 8ℓλ -4λ communication bits (λ = 128 and ℓ = 64 in our setting) (Rathee et al., 2020) . GuardHFL only requires communicating 15ℓ -3 log ℓ -12 bits, a 70× improvement over CaPC. Comparison with SOTA works. To further demonstrate the efficiency of GuardHFL, we instantiate HFL based on SOTA 2PC schemes, including Cheetah (Huang et al., 2022) and CrypTFlow2 (Rathee et al., 2020) , using the methods described in Appendix A.2.4. Table 3 reports the runtime and communication comparison of the secure prediction phase over CIFAR10. We observe that GuardHFL achieves a significant efficiency improvement on three heterogeneous models. For example, GuardHFL requires 57.4∼75.6× less runtime and 8.6∼12.7× less communication compared to CrypTFlow2. This is because the latter needs heavy HE-based multiplication and OT-based comparison within multi-communication rounds. Moreover, as shown in Section 2.3, extending 3PC protocols such as CryptGPU (Tan et al., 2021) to HFL is non-trivial. However, since GryptGPU is one of the most advanced protocols under GPU analogs built on CrypTen (Knott et al., 2021) , we also compare with it assuming no communication limitation. We would like to mention that despite such an unfair comparison, GuardHFL still has performance advantages, i.e., roughly 2.1× and 2.0× in computation and communication overheads, respectively.

4.2. ACCURACY

We report the accuracy of each heterogeneous model in GuardHFL, and explore the impact of various factors on the model accuracy such as the Non-IID setting, and the number of query data. End-to-end model accuracy. Table 4 reports the model accuracy on three datasets in GuardHFL. We observe that for SVHN and CIFAR10, using Q-priv to query can increase the accuracy by about 4%, while the accuracy gain is about 10% when using 10K query samples with Q-syn. The main reason is that synthetic samples could provide a good coverage of the manifold of natural data. We also observe that more synthetic query data can achieve better performance from Table 4 . Furthermore, with an increased number of participating clients, the accuracy improves slightly. Figure 6 shows the accuracy curves versus the number of iterations. We use SVHN and CIFAR10 as examples, as they converge much faster with better readable curves than Tiny ImageNet. We can observe that each heterogeneous model on both datasets can converge well based on two types of query data, and Q-syn shows better performance.  (Q-syn) Resnet8 (Q-syn) Resnet10 (Q-syn) VGG7 (Q-priv) Resnet8 (Q-priv) Resnet10 (Q-priv) Q-syn) Resnet8 (Q-syn) Resnet10 (Q-syn) VGG7 (Q-priv) Resnet8 (Q-priv) Resnet10 (Q-priv)

5. CONCLUSION

We propose GuardHFL, an efficient and private HFL framework to formally provide the privacy guarantees of query samples, model parameters and prediction results. The core constructions of GuardHFL are customized multiplication and comparison protocols based on lightweight secret sharing and PRFs techniques. Extensive experiments demonstrate that GuardHFL outperforms prior art in both communication and runtime performance. We consider the following future directions. (1) The communication cost of GuardHFL, which is also the limitation of the standard HFL paradigm, will be further improved. One possible mitigation is to extend the insight of the k-regular graph in FL (Bell et al., 2020) to HFL, and carefully design protocols from scratch. The main idea is that in FL it is enough for each party to speak to k < n -1 other parties via the server, where n is the number of parties. (2) The security of GuardHFL will be improved to defeat more powerful malicious adversaries. Unfortunately, even using the best-known technique (Koti et al., 2021) , the overhead will be increased by several orders of magnitude. We leave these improvements as future work.

A APPENDIX

A.1 MORE DETAILS ON EXPERIMENT EVALUATION A.1.1 EXPERIMENTAL SETUP Datasets. We evaluate GuardHFL on the following standard datasets for image classification: (1) SVHN is a real-world image dataset obtained from house numbers in Google Street View images, which contains 600,000 32×32 RGB images of printed digits from 0 to 9. (2) CIFAR10 consists of 60,000 32×32 RGB images in 10 classes. There are 50,000 training images and 10,000 test images. (3) Tiny ImageNet contains 100,000 images of 200 classes downsized to 64×64 colored images. Each class has 500 training images, 50 validation images and 50 test images. Experimental configuration. Each of entities, i.e., P Q , P A and the server, is run on the Ubuntu 18.4 system with Intel(R) 562 Xeon(R) CPU E5-2620 v4(2.10 GHz) and 16 GB of RAM and NVIDIA 1080Ti GPU. We leverage PyTorch's torch.distributed package for communication similar as Knott et al. (2021) ; Tan et al. (2021) . We ran our benchmarks in the LAN setting, where following Huang et al. (2022) the bandwidth is about 384 MBps and the latency is 0.3ms. Training procedure. At the local training phase, each client trains the local model from scratch using stochastic gradient descent optimization. For SVHN, CIFAR10, and Tiny ImageNet, the loss function is cross-entropy with the learning rate of 0.5, 0.1, 0.01, respectively. Besides, the batch size is 256, 64 and 64, respectively. When the clients retrain the local model at the local retraining step, they use Adam optimizer for 50 epochs with learning rate of 2e-3 decayed by a factor of 0.1 on 25 epochs, where the batch size is 256 on SVHN, and 64 on both CIFAR10 and Tiny ImageNet.

A.1.2 EXPERIMENTAL RESULTS

Impact of GPU acceleration. To further explore the impact of GPU acceleration, we evaluate GuardHFL on both CPU and GPU with different batch sizes of query data. Impact of the number of query data. Figure 9 shows the accuracy of each heterogeneous model with different numbers of query data. We observe that GuardHFL consistently improves the model accuracy on above datasets and heterogeneous models. Specifically, as the number of query data increases (from 2.5K to 10K), the accuracy of all three models increases by about 5%. Impact of the number of private training data. We can observe that as the number of training data increases, the model performance is on the rise. The main reason is that models can learn better on more training data and can construct more synthetic samples to query, so as to promote the transfer of model knowledge. Impact of query data construction strategies. Taylor, 2017), and mixup (Zhang et al., 2018) . Cutmix (Yun et al., 2019) can be formulated as xi,j = M • x i + (1 -M ) • x j , where M ∈ {0, 1} W ×H is a binary mask matrix of size W × H to indicate the location of dropping out and filling from the two images x i and x j . Cutout (DeVries & Taylor, 2017) augments the dataset with partially occluded versions of original samples. Mixup (Zhang et al., 2018) constructs synthetic samples by a convex combination on two images x i and x j with different coefficients λ, in which xi,j = λ • x i + (1 -λ) • x j . We observe that those strategies are good choices for query data construction in GuardHFL.

A.2.1 GRAPHIC DEPICTION OF END-TO-END SECURE PREDICTION SCHEME

Figure 11 gives a graphic depiction to illustrate the end-to-end secure prediction scheme, where the input is secret-shared sample [x]. The whole process maintains the following invariant: the server and P A begin with secret shares of the input and after each layer, end with secret shares (over the same ring) of the output. Honest-but-curious security of GuardHFL will follow trivially from sequential composibility of individual layers. To be specific, [x] first passes through a convolutional layer that can be formalized as the secure matrix multiplication operation ω 1 [x] (ω 1 is the parameter) using protocol Π Matmul in Figure 3 . The secret-shared outputs of this layer, i.e., [y 1 ] 0 and [y 1 ] 1 , are obtained by P A and the server, respectively. For the subsequent ReLU layer, protocol Π ReLU in Figure 4 is executed to return [y 2 ] 0 and [y 2 ] 1 to P A and the server respectively. Then Maxpooling on [y 2 ] can be evaluated via protocol Π ReLU as described in Section 3.2, to outputs the secret-shared values [y 3 ] 0 and [y 3 ] 1 . When the secure prediction reaches the final fully-connected layer with inputs [y n-1 ] 0 and [y n-1 ] 1 , protocol Π Matmul is executed. In the end, P A and the server obtain the secret-shared predicted logit, i.e., [logit] GuardHFL is in line with the standard HFL paradigm (Li & Wang, 2019) with the additional benefit of privacy protection. As declared in Introduction, GuardHFL is the first-of-its-kind privacypreserving HFL framework, which is different from existing privacy-preserving training efforts. The latter can be divided into two categories: (1) privacy-preserving federated learning (Bonawitz et al., 2017; Bell et al., 2020) , and (2) secure multi-party training (Tan et al., 2021; Keller & Sun, 2022) . In the following, we give a detailed analysis. Comparison to privacy-preserving federated learning. In the privacy-preserving federated learning (FL), each user locally computes the gradient with his private database, and then a secure aggregation protocol is executed at the server side for aggregating the local gradients and updating the global model. However, as shown in Introduction, secure gradient aggregation cannot be realized in heterogeneous FL (HFL), due to the heterogeneity of the clients' models. Instead, GuardHFL follow a general HFL training paradigm (Li & Wang, 2019) , which contains three steps: local training, querying, and local re-training. GuardHFL focuses on solving the privacy issue caused by the querying stage, and mainly proposes a query datasets generation (refer to Section 3.4) and a secure querying protocol (refer to Section 3.1 -Section 3.3). 

A.2.4 EXTEND EXISTING 2PC PROTOCOLS TO HFL

With non-trivial adaptation, existing secure 2-party querying schemes (Mishra et al., 2020; Rathee et al., 2020; Huang et al., 2022) can be extended to the communication-limited HFL setting. However, as shown in Section 4.1, such extension introduces expensive communication and computation overheads compared with our GuardHFL. In the following we divide these schemes into three categories, i.e., pure OT-based protocols, pure HE-based protocols, and hybrid protocols, and give the corresponding extension designs. To extend the pure OT-based secure querying protocols such as CrypTFlow2 (Rathee et al., 2020) into HFL, P Q first secret-shares query samples to the server and P A using our protocol Π Share in Section 3.1. Then the server and P A execute secure prediction based on the method proposed in Rathee et al. (2020) . After that, adopting our secure aggregation protocol Π Agg in Section 3.3, the aggregated predictions will return to P Q . Although the OT-based schemes can be extended to HFL by combining the designed protocols in GuardHFL, it requires too many communication rounds due to the usage of OT primitives. To extend the pure HE-based secure querying protocols (Gilad-Bachrach et al., 2016; Lee et al., 2021) to HFL, P Q first encrypts the query samples and asks the server to pass them to P A . After that, P A evaluates secure prediction non-interactively in the ciphertext environment. Then P A sends encrypted predictions to the server. The server aggregates these encrypted predictions utilizing the additive homomorphism of HE and sends the aggregated results to P Q . Although it is trivial to extend the schemes equipped with the HE-based scheme to the communication-limited setting, they have two key problems: 1) activation functions need to be approximated as low-degree polynomials, which leads to serious accuracy loss; 2) due to the inherent high computation complexity, HE-based secure prediction is difficult to extend to large-scale models. For hybrid secure querying protocols that evaluates linear layers using HE and non-linear layers using OT or GC, such as Cheetah (Huang et al., 2022) , we discuss the extended algorithms for each layer separately. For the linear layer, 1) P Q encrypts query samples with HE and sends the ciphertext to P A through the serverfoot_4 . 2) P A evaluates linear layers locally, and returns the encrypted masked outputs to P Q through the server. 3) P Q decrypts it to obtain the masked outputs, which are then sent to the server. As a result, the masked outputs of linear layers are secret-shared between the server and P A . For the non-linear layer, given that the server and P A hold shares of the linear layer's outputs, the two parties invoke the OT protocols to evaluate the non-linear functions. Therefore, such extension comes at the cost of heavy computational and communication complexity. In summary, although existing 2PC protocols can be extended to the HFL setting with the crosscommunication restriction, they sacrifice the efficiency due to the lack of customized protocols and the adoption of heavy cryptographic primitives. Therefore, GuardHFL shows better adaptability and efficiency in the natural HFL scenarios. Result aggregation functionality F Agg :

A.3 SECURITY ANALYSIS

• Input: Server: [x i ] 1 ∈ Z 2 ℓ , i ∈ [C]. P i A : [x i ] 0 ∈ Z 2 ℓ . • Output: P Q : y = softmax( |C| i=1 x i ). Intuitively, GuardHFL only reveals the aggregated prediction to P Q without the responding parties' model parameters, and the server and P A learn zero information about the querying parties' data. This is because all intermediate sensitive values are secret-shared. Next, we give a formal analysis. Our security proof follows the standard ideal-world/real-world paradigm (Canetti, 2001) : in the real world, three parties (i.e., the server, P Q , and P A ) interact according to the protocol specification, and in the ideal world, they have access to an ideal functionality shown in Table 5 . When a protocol invokes another sub-protocol, we use the F-hybrid model for the security proof by replacing the subprotocol with the corresponding functionality. Note that our proof works in the F PRF -hybrid model where F PRF represents the ideal functionality corresponding to the PRF protocol. The executions in both worlds are coordinated by the environment Env, who chooses the inputs to parties and plays



As demonstrated inBonawitz et al. (2017);Bell et al. (2020), clients (e.g., mobile devices) in real-world applications are generally widely distributed and coordinated only by the server. Choquette-Choo et al. (2021) presented a general collaborative learning scheme, called CaPC, which enables each party to improve his local model from others' models using the secure querying scheme(Boemer et al., 2019b). However, it cannot be directly applied to the HFL scenario as it requires cross-client communication. Meanwhile, it causes prohibitively high overhead (refer to Section 4.1). (a, b, [c]0, [c]1) with the constrain c = ab in Z 2 ℓ can be seen as a variant of the Beaver's multiplication triple. Details refer to Appendix A.2.2. To clearly illustrate the efficiency of GuardHFL, unless otherwise specified, we only show the overhead of one user in each iteration as described in Section 3. To be more precise, this step is for the input layer. In the hidden layer, one of the input shares of the linear layer should be encrypted by the server and sent to PA.



Figure 1: High-level view of GuardHFL with prior works (Phonget al., 2018;Sun & Lyu, 2021;Choquette-Choo et al., 2021), we consider an honest-but-curious adversary setting(Goldreich, 2009), where each entity (including the clients and the server) strictly follows the specification of the designed protocol but attempts to infer more knowledge about these private information of other clients. Moreover, to maintain the reputation and provide more services, the server does not collude with any clients, namely that an attacker either corrupts the server or a subset of clients but not both.

Figure 2: Secure query-data sharing protocol Π Share

Secure MSB Protocol Π msb Input: The arithmetic shares [x] Output: The boolean shares [msb(x)] B 1: Initiate g * and p * with size ℓ. Let g * i and p * i are the i-th positions of g * and p * respectively. 2: P0 and P1 set the bit decomposition of [x]0 and [x]1 to e ℓ , • • • , e1 and f ℓ , • • • , f1 respectively. 3: P0 and P1 invoke Π AND with inputs ei and fi to obtain [g

Figure 4: Secure ReLU protocol Π ReLU Maxpooling. The evaluation of Maxpooling can be performed with the protocol Π ReLU as well as a tree-based round optimization that recursively partitions the values into two halves and then compares the elements of each half. Precisely, the entities arrange the input of m elements into a 2-ary tree with the depth of log m, and evaluate the tree in a top-down fashion. In each comparison of two secret-shared elements [x] and [y], we utilize the observation of max([x], [y]) = ReLU([x] -[y]) + [y]. Hence the protocol complexity of Maxpooling mainly comes from the evaluation of m -1 ReLU. Besides, as illustrated in Wagh et al. (2019); Mishra et al. (2020), AvgPooling can be evaluated locally without communication.

Figure 5: Secure result aggregation protocol Π Agg 3.4 DISCUSSION Query data construction. Unlike existing HFL works relying on auxiliary datasets as the query data (Choquette-Choo et al., 2021;Lin et al., 2020), we demonstrate the feasibility of model knowledge transfer in GuardHFL by constructing a synthesized query set based on private training data, to alleviate potential limitations (e.g., privacy, acquisition and storage) of auxiliary datasets. A simple solution is to directly use the private training data to query, like well-studied knowledge distillation(Hinton et al., 2015). Moreover, we also construct a synthesized dataset via the mixup method(Zhang et al., 2018) (Appendix A.1.2). The synthesized dataset construction is a universal and modular method, and it can be readily extended with advanced data augmentation strategies, such as cutout (DeVries & Taylor, 2017) and cutmix(Yun et al., 2019). Note that this process does not reveal any private information, since the samples are constructed locally by the querying party, without involving any other parties and their private datasets. We present some exploration and experiments in Appendix A.1.2 and Figure10(c).

Figure 6: Accuracy curves of each heterogeneous model in GuardHFL. Left (SVHN); Right (CIFAR10)Impact of Non-IID datasets. We illustrate the impact of Non-IID data on model accuracy in Figure7, using CIFAR10 as an example. Figures7(a), 7(b) and 7(c) visualize the distributions of Non-IID samples among clients with different Dir(α). When α = 100, the distribution is close to uniform sampling. When α = 0.5, the sample distribution of each class among clients is extremely uneven. From Figure7(d) we observe that the higher the degree of Non-IID-ness, the lower the accuracy of models. Notably, GuardHFL can still significantly improve the performance of models under the Non-IID environment.

Figure 8: The runtime of GuardHFL under CPU/GPU with varied batch sizes of query data on CIFAR10.

Figures 10(a) and 10(b) illustrate the model accuracy of GuardHFL under different number of private training data on SVHN and CIFAR10.

Figure 9: The accuracy of each heterogeneous model as the number of query data increases. Dashed lines represent the model accuracy before GuardHFL.

Figure11gives a graphic depiction to illustrate the end-to-end secure prediction scheme, where the input is secret-shared sample[x]. The whole process maintains the following invariant: the server and P A begin with secret shares of the input and after each layer, end with secret shares (over the same ring) of the output. Honest-but-curious security of GuardHFL will follow trivially from sequential composibility of individual layers. To be specific, [x] first passes through a convolutional layer that can be formalized as the secure matrix multiplication operation ω 1 [x] (ω 1 is the parameter) using protocol Π Matmul in Figure3. The secret-shared outputs of this layer, i.e., [y 1 ] 0 and [y 1 ] 1 , are obtained by P A and the server, respectively. For the subsequent ReLU layer, protocol Π ReLU in Figure4is executed to return [y 2 ] 0 and [y 2 ] 1 to P A and the server respectively. Then Maxpooling on [y 2 ] can be evaluated via protocol Π ReLU as described in Section 3.2, to outputs the secret-shared values [y 3 ] 0 and [y 3 ] 1 . When the secure prediction reaches the final fully-connected layer with inputs [y n-1 ] 0 and [y n-1 ] 1 , protocol Π Matmul is executed. In the end, P A and the server obtain the secret-shared predicted logit, i.e., [logit] 0 and [logit] 1 , respectively.A.2.2 MORE DETAILS ON CRYPTOGRAPHIC PROTOCOLSSecret sharing and Beaver's multiplication protocol. As shown in Section 2.4, GuardHFL utilizes the additive secret sharing primitive to protect the privacy of sensitive information. Assuming two secret-shared values are [x] and [y] owned by two parties P i , i ∈ {0, 1}, addition and subtraction operations ([z] = [x] ± [y] in Z 2 ℓ ) can be realized locally without any communication, i.e., each P i

to secure multi-party training. Secure multi-party training is typically an outsourced training setting, where resource-constrained clients outsource the entire training task to non-colluding multiple servers in a privacy-preserving manner. It requires a secure training protocol to finally yield a well-trained model. Different from secure multi-party training, GuardHFL enables clients to collaboratively and securely train their own customized models that may differ in size and structure. Moreover, as discussed above, the general HFL paradigm contains three steps: local training, querying and local re-training, where the local training and re-training stages are evaluated locally without revealing privacy. Therefore, the privacy-preserving HFL requires an HFL-friendly secure querying protocol (i.e., a customized inference protocol).

Output: P A : [x] 0 ∈ Z 2 ℓ . Server: [x] 1 = x -[x] 0 mod 2 ℓ . Matrix multiplication functionality F Matmul : • Input: Server: [x] 1 ∈ Z 2 ℓ . P A : [x] 0 ∈ Z 2 ℓ , model parameter ω. • Output: Server: [y] 1 ∈ Z 2 ℓ . P A : [y] 0 = ωx -[y] 1 mod 2 ℓ . MSB functionality F msb : • Input: Server: [x] 1 ∈ Z 2 ℓ . P A : [x] 0 ∈ Z 2 ℓ . • Output: Server: [msb(x)] B 1 ∈ Z 2 . P A : [msb(x)] B 0 = msb(x) ⊕ [msb(x)] B 1 mod 2. ReLU functionality F ReLU : • Input: Server: [x] 1 ∈ Z 2 ℓ . P A : [x] 0 ∈ Z 2 ℓ . • Output: Server: [y] 1 ∈ Z 2 ℓ . P A : [y] 0 = ReLU(x) -[y] 1 mod 2 ℓ .Maxpooling functionality F Maxpool :• Input: Server: [x] 1 ∈ Z 2 ℓ . P A : [x] 0 ∈ Z 2 ℓ .• Output: Server: [y] 1 ∈ Z 2 ℓ . P A : [y] 0 = Maxpool(x) -[y] 1 mod 2 ℓ .

Algorithm 1 The GuardHFL framework Input: Each client Pj, j ∈ [n], holds a private dataset Dj and a customized local model Mj. iter is the number of iterations. B is the number of query samples and C is the set of selected responding parties in the current query-response phase.

Else P0 and P1 invoke Π AND with inputs [g * 2i-1 ] B and [p * 2i ] B to obtain [ti] B , and sets [g * i ] B = [g * 2i ] B ⊕ [ti] B . P0 and P1 invoke Π AND with inputs [g * 1 ] B and [p * 2 ] B to obtain [t1] B , and sets [g * 1

P0 and P1 invoke Π AND with inputs [p * 2i-1 ] B and [p * 2i-2 ] B to obtain [p * i ] B . Else P0 and P1 invoke Π AND with inputs [p * 2i ] B and [p * 2i-1 ] B to obtain [p * i ] B .

Extra runtime (sec) of GuardHFL over vanilla HFL systems in the plaintext environment. CIFAR10 and SVHN have the same runtime due to the same input size and model architecture.

Comparison with CaPC on runtime (sec) over MNIST and three heterogeneous models as the batch size (BS) of query data increases.

Comparison with advanced secure prediction protocols on runtime (sec) and communication (MB) cost over three heterogeneous models.

The model accuracy of three datasets in GuardHFL on different ratios of participating clients (0.6, 0.8 and 1), and querying strategies (Q-priv and Q-syn).

The ideal functionality Input sharing functionality F Share :• Input: P Q : query data x.

Conv

 (Protocol ) [𝑦𝑦1] = 𝜔𝜔1 [𝑥𝑥] ReLU (Protocol ) [𝑦𝑦2] = ReLU([𝑦𝑦1])

MaxPooling

(Protocol )[𝑦𝑦3] ← maxpool([𝑦𝑦2])

FC (Protocol )

[logit] = 𝜔𝜔𝑘𝑘[𝑦𝑦𝑛𝑛-1] …[𝑥𝑥] 1[𝑥𝑥] 0 , , is evaluated using Beaver's multiplication triples (Demmler et al., 2015) , where each triple refers to (a, b, c) with the constraint c = ab that is generated by cryptographic techniques (Demmler et al., 2015) or a trusted dealer (Riazi et al., 2018) . Specifically, the multiplication over secret-sharing values can be evaluated in the following:where for i ∈ {0, 1}, [x] i [y] i can be computed locally, and [x] i [y] 1-i will be evaluated as follows.Taking [x] 0 [y] 1 as an example, assuming P 0 and P 1 already hold (a, [c] 0 ) and (b, [c] 1 ), respectively, P 0 first sends [x] 0 + a to P 1 , while P 1 sends [y] 1 -b to P 0 . Then P 0 computes one share asIn this way, the outputs are still in the form of secret sharing.Diffie-Hellman key agreement protocol. In GuardHFL, we utilize PRFs to overcome the crossclient communication limitation, where the consistent PRF seed between clients are generated using the Diffie-Hellman Key Agreement (DH) protocol (Diffie & Hellman, 1976) . Note that the consistent seed between the server and the client can be directly sampled by the server and then sent to the client without the DH protocol. Figure 12 gives the secure seed generation protocol Π seed . Formally, the DH protocol consists of the following three steps:• DH.param(k) → (G, g, q, H) generates a group G of prime order q, along with a generator g, and a hash function H.• DH.gen(G, g, q, H) → (x i , g xi ) randomly samples x i ∈ Z q as the secret key and let g xi as the public key. • DH.agree(x i , g xj , H) → s i,j outputs the seed s i,j = H((g xj ) xi ).Correctness requires that for any key pairs (x i , g xi ) and (x j , g xj ) generated by two paries P i and P j using DH.gen under the same parameters (G, g, q, H), DH.agree(x i , g xj , H) = DH.agree(x j , g xi , H). Besides, in GuardHFL security requires that for any adversary who steals g xi and g xj (but neither of the corresponding x i and x j ), the agreed secret s i,j derived from those keys is indistinguishable from a uniformly random value (Abdalla et al., 2001) . the role of a distinguisher between the real and ideal executions. We will show that the real-world distribution is computationally indistinguishable to the ideal-world distribution. Theorem A.1. Π Share securely realizes the functionality F Share in the F PRF -hybrid model.Proof. Note that P Q and P A receive no messages in Π Share , and hence the protocol is trivially secure against corruption of P Q and P A . Next, the only message that the server receives is the value [x] 1 . However, [x] 1 = x -r, where given the security of PRF, r is a random value unknown to the server. Thus, the distribution of [x] 1 is uniformly random from the server's view and the information learned by the server can be simulated.Theorem A.2. Π Matmul securely realizes the functionality F Matmul in the F PRF -hybrid model.Proof. Note that P Q receives no messages in Π Matmul , and hence the protocol is trivially secure against corruption of P Q . The only message that P A receives is the value [x] 1 -b. However, given the security of PRF, b is a random value unknown to P A . Thus, the distribution of [x] 1 -b is computationally indistinguishable from a uniformly random distribution in P A 's view, and the information learned by P A can be simulated. Next, during the protocol, the server learns [c] 1 and w+ a. However, the distribution of [c] 1 and w + a is computationally indistinguishable from a uniformly random distribution in the server's view, since given the security of PRF, a and [c] 1 are random values unknown to the server. Thus, the information learned by the server can be simulated.Theorem A.3. Π ReLU securely realizes the functionality F ReLU in the (F Matmul , F msb )-hybrid model.Proof. Note that as shown in Section 3.2, Π ReLU consists of Π msb and Π Matmul . Therefore, the ReLU protocol is trivially secure in the (F Matmul , F msb )-hybrid model.Theorem A.4. Π Maxpool securely realizes the functionality F Maxpool in the F ReLU -hybrid model.Proof. As shown in Section 3.2, Π Maxpool consists of several invocations of Π ReLU . Therefore, the protocol Π Maxpool is trivially secure in the F ReLU -hybrid model.Theorem A.5. Π Agg securely realizes the functionality F Agg in the F PRF -hybrid model.Proof. Note that P A receives no messages in Π Agg , and hence the aggregation protocol is trivially secure against the corruption of P A . Next, the only message that the server receives is the value [x i ] 0 -r i . However, given the security of PRF, r i is a random value unknown to the server. Thus, the distribution of [x i ] 0 -r i is computationally indistinguishable from a uniformly random distribution in the server's view and the information learned by the server can be simulated. After the aggregation, P Q only learns the aggregated result i∈[C] x i , but is unknown to each x i . Therefore, the aggregation protocol is secure assuming the aggregation result will not reveal privacy.

A.4 RELATED WORK

A.4.1 HETEROGENEOUS FEDERATED LEARNING Federated learning (FL) achieves collaboration among clients via sharing model gradients. While successful, it still faces many challenges, among which, of particular importance is the heterogeneity that appear in all aspects of the learning process. This consists of model heterogeneity (Li & Wang, 2019) and statistical heterogeneity (Zhu et al., 2021) . Statistical heterogeneity means that parties' data comes from distinct distributions (i. Our work mainly focuses on the model heterogeneity that has been explored in recent works (Li & Wang, 2019; Lin et al., 2020; Choquette-Choo et al., 2021) . In particular, Li & Wang (2019) proposed the first FL framework FedMD supporting heterogeneous models by combining transfer learning and knowledge distillation techniques. They first used a public dataset to pre-train the model and transferred to the task of private datasets. After that, to exchange the knowledge, each party used the public data and the aggregated predictions from others as carrier for knowledge distillation. To further improve model accuracy, Lin et al. (2020) proposed FedDF, similar to FedMD, which also used the model distillation technique for knowledge sharing. The difference is that they first performed FedAvg on parties' local models and integrated knowledge distillation on the aggregated model. The dependence on model averaging leads to limited model heterogeneity. Besides, Diao et al. (2021) focused on heterogeneous parties equipped with different computation and communication capabilities. In their framework, each party only updated a subset of global model parameters through varying the width of hidden channels, which reduces the computation and communication complexity of local models. However, this approach only learns a single global model, rather than unique models designed by parties. Moreover, as described in Introduction, HFL suffers from several privacy issues, which are not considered in the above works. The first is the direct leakage of querying samples. The task-related querying samples may contain private information such as disease diagnosis in healthcare. If the querying party directly delivers query samples to responding parties for prediction via the server, such sensitive information will be leaked. The second is the implicit disclosure based on inference attacks. Given black-box access to a model, adversaries can infer the membership (Salem et al., 2019) and attribute information (Ganju et al., 2018) of the target sample or even reconstruct the original training data (Yang et al., 2019) . For example, based on the prediction y of the querying sample x from the responding party P A , the querying party can launch the membership inference attack (Salem et al., 2019) to infer whether x is a training sample of P A .To address the privacy concern, GuardHFL provide end-to-end privacy-preserving HFL services.The privacy-preserving techniques (i.e., secure aggregation) have been studied in federated learning (Bonawitz et al., 2017; Bell et al., 2020; Sav et al., 2021; Jayaraman & Wang, 2018) . However, these techniques can not be directly extended to privacy-preserving HFL. More recently, Sun & Lyu (2021) proposed a noise-free differential privacy solution for HFL to guarantee each party's privacy. However, as shown in Jayaraman & Evans (2019) , there is a huge gap between the upper bounds on privacy loss analyzed by advanced mechanisms and the effective privacy loss. Thus, differentially private mechanisms offer undesirable utility-privacy trade-offs. To further formally guarantee the privacy, Choquette-Choo et al. ( 2021) leveraged hybrid cryptographic primitives to realize confidential and private collaborative learning. Specifically, parties learn from each other collaboratively utilizing a secure inference strategy based on 2PC and HE protocols and a private aggregation method. As noted in the Introduction, CaPC's use of heavy cryptography leads to significant efficiency and communication overheads.In summary, we give a comparison between prior FL works and GuardHFL in Table 6 .

A.4.2 SECURE NEURAL NETWORK PREDICTION

Since secure prediction is a critical component of GuardHFL, we briefly introduce its recent progress. Neural networks present a challenge to cryptographic protocols due to their unique struc-ture and exploitative combination of linear computations and non-linear activation functions. In real scenarios, model prediction can be viewed as a two-party computation case, where one party with private query data wants to obtain prediction results from the other party who owns the model. During the whole process, the cryptographic protocols, typically HE and secure multi-party computation (MPC), are applied to ensure the confidentiality of the private data and model parameters.Many existing works (Boemer et al., 2019b; Gilad-Bachrach et al., 2016; Brutzkus et al., 2019) support pure HE protocols for secure predictions. Typically, nGraph-HE (Boemer et al., 2019b; a) allows linear computations using the CKKS HE scheme. However, since a solution that builds upon HE protocols should be restricted to compute low degree polynomials, the non-polynomial activation functions, such as Maxpooling and ReLU, are forced to be evaluated in the clear by the party who owns private query data. This leaks the feature maps, from which adversaries may deduce the model weights. To solve this problem, Gilad-Bachrach et al. ( 2016) and Chen et al. ( 2019) use low-degree polynomial approximation to estimate non-linear functions. Unfortunately, it will affect the inference accuracy, while leading to huge computation overhead.On the other hand, several libraries (Mohassel & Zhang, 2017; Knott et al., 2021; Wagh et al., 2019) employ primarily MPC techniques in secure predictions, which provide linear and non-linear protocols through the usage of oblivious transfer (OT), garbled circuit (GC) and secret sharing. In particular, CryptTen (Knott et al., 2021) performs linear operations based on n-out-of-n additive secret sharing over the ring Z 2 l . However, it requires a trusted third party to assist the secure prediction process, which is unrealistic in the real-world setting. CrpytGPU (Tan et al., 2021) Keller & Sun (2022) are designed for a specific network architecture, i.e., quantized neural networks, which cannot be applied to the general models in GuardHFL. Therefore, Keller & Sun (2022) and GuardHFL are two fully orthogonal works.Many other works focus on hybrid protocols, in which they combines the advantages of HE and MPC to improve prediction efficiency (Juvekar et al., 2018; Mishra et al., 2020; Rathee et al., 2020; Huang et al., 2022) . CrypTFlow2 (Rathee et al., 2020) implements two class of protocols, HEbased and OT-based, for linear operations. For non-linear layers, they also design efficient protocols based on OT. Cheetah (Huang et al., 2022) improves CrypTFlow2 with customized HE-based linear protocols and improved OT-based non-linear protocols. HE-transformer (Boemer et al., 2019a) employs nGraph-HE for evaluation of linear operations, and ABY framework for GC to evaluate non-linear functions (Demmler et al., 2015) . Since non-linear operations cannot be parallelized between query data, GC is inefficient, especially for large networks with thousands of parameters.In contrast, our GuardHFL avoids the use of heavy cryptographic tools like HE and OT, and only employs secret sharing and PRFs to achieve high efficiency, confidentiality and practicability.

