THE BEST OF BOTH WORLDS: ACCURATE GLOBAL AND PERSONALIZED MODELS THROUGH FEDERATED LEARNING WITH DATA-FREE HYPER-KNOWLEDGE DISTILLATION

Abstract

Heterogeneity of data distributed across clients limits the performance of global models trained through federated learning, especially in the settings with highly imbalanced class distributions of local datasets. In recent years, personalized federated learning (pFL) has emerged as a potential solution to the challenges presented by heterogeneous data. However, existing pFL methods typically enhance performance of local models at the expense of the global model's accuracy. We propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL algorithm in which clients rely on knowledge distillation (KD) to train local models. In particular, each client extracts and sends to the server the means of local data representations and the corresponding soft predictions -information that we refer to as "hyper-knowledge". The server aggregates this information and broadcasts it to the clients in support of local training. Notably, unlike other KD-based pFL methods, FedHKD does not rely on a public dataset nor it deploys a generative model at the server. We analyze convergence of FedHKD and conduct extensive experiments on visual datasets in a variety of scenarios, demonstrating that FedHKD provides significant improvement in both personalized as well as global model performance compared to state-of-the-art FL methods designed for heterogeneous data settings.

1. INTRODUCTION

Federated learning (FL), a communication-efficient and privacy-preserving alternative to training on centrally aggregated data, relies on collaboration between clients who own local data to train a global machine learning model. A central server coordinates the training without violating clients' privacy -the server has no access to the clients' local data. The first ever such scheme, Federated Averaging (FedAvg) (McMahan et al., 2017) , alternates between two steps: (1) randomly selected client devices initialize their local models with the global model received from the server, and proceed to train on local data; (2) the server collects local model updates and aggregates them via weighted averaging to form a new global model. As analytically shown in (McMahan et al., 2017) , FedAvg is guaranteed to converge when the client data is independent and identically distributed (iid). A major problem in FL systems emerges when the clients' data is heterogeneous (Kairouz et al., 2021) . This is a common setting in practice since the data owned by clients participating in federated learning is likely to have originated from different distributions. In such settings, the FL procedure may converge slowly and the resulting global model may perform poorly on the local data of an individual client. To address this challenge, a number of FL methods aiming to enable learning on non-iid data has recently been proposed (Karimireddy et al., 2020; Li et al., 2020; 2021a; Acar et al., 2021; Liu et al., 2021; Yoon et al., 2021; Chen & Vikalo, 2022) . Unfortunately, these methods struggle to train a global model that performs well when the clients' data distributions differ significantly. Difficulties of learning on non-iid data, as well as the heterogeneity of the clients' resources (e.g., compute, communication, memory, power) , motivated a variety of personalized FL (pFL) techniques (Arivazhagan et al., 2019; T Dinh et al., 2020; Zhang et al., 2020; Huang et al., 2021; Collins et al., 2021; Tan et al., 2022) . In a pFL system, each client leverages information received from the server and utilizes a customized objective to locally train its personalized model. Instead of focusing on global performance, a pFL client is concerned with improving the model's local performance empirically evaluated by running the local model on data having distribution similar to the distribution of local training data. Since most personalized FL schemes remain reliant upon on gradient or model aggregation, they are highly susceptible to 'stragglers' that slow down the training convergence process. FedProto (Tan et al., 2021) is proposed to address high communication cost and limitations of homogeneous models in federated learning. Instead of model parameters, in FedProto each client sends to the server only the class prototypes -the means of the representations of the samples in each class. Aggregating the prototypes rather than model updates significantly reduces communication costs and lifts the requirement of FedAvg that clients must deploy the same model architecture. However, note that even though FedProto improves local validation accuracy by utilizing aggregated class prototypes, it leads to barely any improvement in the global performance. Motivated by the success of Knowledge Distillation (KD) (Hinton et al., 2015) which infers soft predictions of samples as the 'knowledge' extracted from a neural network, a number of FL methods that aim to improve global model's generalization ability has been proposed (Jeong et al., 2018b; Li & Wang, 2019; Lin et al., 2020; Zhang et al., 2021) . However, most of the existing KD-based FL methods require that a public dataset is provided to all clients, limiting the feasibility of these methods in practical settings. In this paper we propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL framework that relies on prototype learning and knowledge distillation to facilitate training on heterogeneous data. Specifically, the clients in FedHKD compute mean representations and the corresponding mean soft predictions for the data classes in their local training sets; this information, which we refer to as "hyper-knowledge," is endued by differential privacy via the Gaussian mechanism and sent for aggregation to the server. The resulting globally aggregated hyper-knowledge is used by clients in the subsequent training epoch and helps lead to better personalized and global performance. A number of experiments on classification tasks involving SVHN (Netzer et al., 2011) , CIFAR10 and CIFAR100 datasets demonstrate that FedHKD consistently outperforms state-of-the-art approaches in terms of both local and global accuracy.

2. RELATED WORK

2.1 HETEROGENEOUS FEDERATED LEARNING Majority of the existing work on federated learning across data-heterogeneous clients can be organized in three categories. The first set of such methods aims to reduce variance of local training by introducing regularization terms in local objective (Karimireddy et al., 2020; Li et al., 2020; 2021a; Acar et al., 2021) . (Mendieta et al., 2022) analyze regularization-based FL algorithms and, motivated by the regularization technique GradAug in centralized learning (Yang et al., 2020) , propose FedAlign. Another set of techniques for FL on heterogeneous client data aims to replace the naive model update averaging strategy of FedAvg by more efficient aggregation schemes. To this end, PFNM (Yurochkin et al., 2019) applies a Bayesian non-parametric method to select and merge multi-layer perceptron (MLP) layers from local models into a more expressive global model in a layer-wise manner. FedMA ( (Wang et al., 2020a) ) proceeds further in this direction and extends the same principle to CNNs and LSTMs. (Wang et al., 2020b) analyze convergence of heterogeneous federated learning and propose a novel normalized averaging method. Finally, the third set of methods utilize either the mixup mechanism (Zhang et al., 2017) or generative models to enrich diversity of local datasets (Yoon et al., 2021; Liu et al., 2021; Chen & Vikalo, 2022) . However, these methods introduce additional memory/computation costs and increase the required communication resources.

2.2. PERSONALIZED FEDERATED LEARNING

Motivated by the observation that a global model collaboratively trained on highly heterogeneous data may not generalize well on clients' local data, a number of personalized federated learning (pFL) techniques aiming to train customized local models have been proposed (Tan et al., 2022) . They can be categorized into two groups depending on whether or not they also train a global model. The pFL techniques focused on global model personalization follow a procedure similar to the plain vanilla FL -clients still need to upload all or a subset of model parameters to the server to enable global model aggregation. The global model is personalized by each client via local adaptation steps such as fine-tuning (Wang et al., 2019; Hanzely et al., 2020; Schneider & Vlachos, 2021) , creating a mixture of global and local layers (Arivazhagan et al., 2019; Mansour et al., 2020; Deng et al., 2020; Zec et al., 2020; Hanzely & Richtárik, 2020; Collins et al., 2021; Chen & Chao, 2021) , regularization (T Dinh et al., 2020; Li et al., 2021b) and meta learning (Jiang et al., 2019; Fallah et al., 2020) . However, when the resources available to different clients vary, it is impractical to require that all clients train models of the same size and type. To address this, some works waive the global model by adopting multi-task learning (Smith et al., 2017) or hyper-network frameworks (Shamsian et al., 2021) . Inspired by prototype learning (Snell et al., 2017; Hoang et al., 2020; Michieli & Ozay, 2021) , FedProto (Tan et al., 2021) utilizes aggregated class prototypes received from the server to align clients' local objectives via a regularization term; since there is no transmission of model parameters between clients and the server, this scheme requires relatively low communication resources. Although FedProto improves local test accuracy of the personalized models, it does not benefit the global performance.

2.3. FEDERATED LEARNING WITH KNOWLEDGE DISTILLATION

Knowledge Distillation (KD) (Hinton et al., 2015) , a technique capable of extracting knowledge from a neural network by exchanging soft predictions instead of the entire model, has been introduced to federated learning to aid with the issues that arise due to variations in resources (computation, communication and memory) available to the clients (Jeong et al., 2018a; Chang et al., 2019; Itahara et al., 2020) . FedMD (Li & Wang, 2019) , FedDF (Lin et al., 2020) and FedKT-pFL (Zhang et al., 2021) transmit only soft-predictions as the knowledge between the server and clients, allowing for personalized/heterogeneous client models. However, these KD-based federated learning methods require that a public dataset is made available to all clients, presenting potential practical challenges. Recent studies (Zhu et al., 2021; Zhang et al., 2022) explored using GANs (Goodfellow et al., 2014) to enable data-free federated knowledge distillation in the context of image classification tasks; however, training GANs incurs considerable additional computation and memory requirements. In summary, most of the existing KD-based schemes require a shared dataset to help align local models; others require costly computational efforts to synthesize artificial data or deploy a student model at the server and update it using local gradients computed when minimizing the divergence of soft prediction on local data between clients' teacher model and the student model (Lin et al., 2020) . In our framework, we extend the concept of knowledge to 'hyper-knowledge', combining class prototypes and soft predictions on local data to improve both the local test accuracy and global generalization ability of federated learning.

3. METHODOLOGY

3.1 PROBLEM FORMULATION Consider a federated learning system where m clients own local private dataset D 1 , . . . , D m ; the distributions of the datasets may vary across clients, including the scenario in which a local dataset contains samples from only a fraction of classes. In such an FL system, the clients communicate locally trained models to the server which, in turn, sends the aggregated global model back to the clients. The plain vanilla federated learning (McMahan et al., 2017) implements aggregation as w t = m i=1 |D i | M w t-1 i , where w t denotes parameters of the global model at round t; w t-1 i denotes parameters of the local model of client i at round t -1; m is the number of participating clients; and M = m i=1 |D i |. The clients are typically assumed to share the same model architecture. Our aim is to learn a personalized model w i for each client i which not only performs well on data generated from the distribution of the i th client's local training data, but can further be aggregated into a global model w that performs well across all data classes (i.e., enable accurate global model performance). This is especially difficult when the data is heterogenous since straightforward aggregation in such scenarios likely leads to inadequate performance of the global model.

3.2. UTILIZING HYPER-KNOWLEDGE

Knowledge distillation (KD) based federated learning methods that rely on a public dataset require clients to deploy local models to run inference / make predictions for the samples in the public dataset; the models' outputs are then used to form soft predictions according to q i = exp(z i /T ) j exp(z j /T ) , where z i denotes the i th element in the model's output z for a given data sample; q i is the i th element in the soft prediction q; and T is the so-called "temperature" parameter. The server collects soft predictions from clients (local knowledge), aggregates them into global soft predictions (global knowledge), and sends them to clients to be used in the next training round. Performing inference on the public dataset introduces additional computations in each round of federated learning, while sharing and locally storing public datasets consumes communication and memory resources. It would therefore be beneficial to develop KD-based methods that do not require use of public datasets; synthesizing artificial data is an option, but one that is computationally costly and thus may be impractical. To this end, we extend the notion of distilled knowledge to include both the averaged representations and the corresponding averaged soft predictions, and refer to it as "hyperknowledge"; the "hyper-knowledge" is protected via the Gaussian differential privacy mechanism and shared between clients and server. Feature Extractor and Classifier. We consider image classification as an illustrative use case. Typically, a deep network for classification tasks consists of two parts (Kang et al., 2019) : (1) a feature extractor translating the input raw data (i.e., an image) into latent space representation; (2) a classifier mapping representations into categorical vectors. Formally, h i = R ϕi (x i ), z i = G ωi (h i ), where x i denotes raw data of client i, R ϕi (•) and G ωi (•) are the embedding functions of feature extractor and classifier with model parameters ϕ i and ω i , respectively; h i is the representation vector of x i ; and z i is the categorical vector. Evaluating and Using Hyper-Knowledge. The mean latent representation of class j in the local dataset of client i is computed as hj i = 1 N j i N j i k=1 h j,k i , qj i = 1 N j i N j i k=1 Q(z j,k i , T ) where N j i is the number of samples with label j in client i's dataset; Q(•, T ) is the soft target function; h j,k i and z j,k i are the data representation and prediction of the i th client's k th sample with label j. The mean latent data representation hj i and soft prediction qj i are the hyper-knowledge of class j in client i; for convenience, we denote K j i = ( hj i , qj i ). If there are n classes, then the full hyper-knowledge of client i is K i = {K 1 i , . . . , K n i }. As a comparison, FedProto (Tan et al., 2021 ) only utilizes means of data representations and makes no use of soft predictions. Note that to avoid the situations where K j i = ∅, which may happen when data is highly heterogeneous, FedHKD sets a threshold (tunable hyper-parameter) ν which is used to decided whether or not a client should share its hyper-knowledge; in particular, if the fraction of samples with label j in the local dataset of client i is below ν, client i is not allowed to share the hyper-knowledge K j i . If there is no participating client sharing hyper-knowledge for class j, the server sets K j = ∅. A flow diagram illustrating the computation of hyper-knowledge is given in Appendix. A.3. Differential Privacy Mechanism. It has previously been argued that communicating averaged data representation promotes privacy (Tan et al., 2021) ; however, hyper-knowledge exchanged between server and clients may still be exposed to differential attacks (Dwork, 2008; Geyer et al., 2017) . A number of studies (Geyer et al., 2017; Sun et al., 2021; Gong et al., 2021; Ribero et al., 2022; Chen & Vikalo, 2022) that utilize differential privacy to address security concerns in federated learning have been proposed. The scheme presented in this paper promotes privacy by protecting the shared means of data representations through a differential privacy (DP) mechanism (Dwork et al., 2006a; b) defined below. Definition 1 ((ε, δ)-Differential Privacy) A randomized function F : D → R provides (ε, δ)differential privacy if for all adjacent datasets d, d ′ ∈ D differing on at most one element, and all S ∈ range(F), it holds that P[F(d) ∈ S] ≤ e ϵ P [F (d ′ ) ∈ S] + δ, where ϵ denotes the maximum distance between the range of F(d) and F(d ′ ) and may be thought of as the allotted privacy budget, while δ is the probability that the maximum distance is not bounded by ε. Any deterministic function f : D → R can be endued with arbitrary (ϵ, δ)-differential privacy via the Gaussian mechanism, defined next. Theorem 1 (Gaussian mechanism) A randomized function F derived from any deterministic func- tion f : D → R perturbed by Gaussian noise N (0, S 2 f • σ 2 ), F(d) = f (d) + N 0, S 2 f • σ 2 , achieves (ε, δ)-differential privacy for any σ > 2 log 5 4δ /ε. Here S f denotes the sensitivity of function f defined as the maximum of the absolute distance |f (d) -f (d ′ )|. We proceed by defining a deterministic function f l (d j i ) ≜ hj i (l) = 1 N j i N j i k=1 h j,k i (l) which evaluates the l th element of hj i , where d j i is the subset of client i's local dataset including samples with label j only; h j,k i denotes the representation of the k th sample in d j i while h j,k i (l) is the l th element of h j,k i . In our proposed framework, client i transmits noisy version of its hyper-knowledge to the server, hj i (l) = hj i (l) + χ j i (l), (7) where χ j i (l) ∼ N (0, (S i f ) 2 • σ 2 ); σ 2 denotes a hyper-parameter shared by all clients. (S i f ) 2 is the sensitive of function f l (•) with client i's local dataset. Lemma 1 If |h j,k i (l)| is bounded by ζ > 0 for any k, then |f l (d j i ) -f l (d j′ i )| ≤ 2ζ N j i (8) Therefore, S i f = 2ζ N j i . Note that (S i f ) 2 depends on N j i , the number of samples in class j, and thus differs across clients in the heterogeneous setting. A discussion on the probability that differential privacy is broken can be found in the Section 4.3. Proof of Lemma 1 is provided in Appendix A.5.

3.3. GLOBAL HYPER-KNOWLEDGE AGGREGATION

After the server collects hyper-knowledge from participating clients, the global hyper-knowledge for class j at global round t + 1 , K j,t+1 = H j,t+1 , Q j,t+1 , is formed as H j,t+1 = m i=1 p i hj,t i , Q j,t+1 = m i=1 p i qj,t i , where p i = N j i /N j , N j i denotes the number of samples in class j owned by client i, and N j = m i=1 N j i . For clarity, we emphasize that hj,t i denotes the local hyper-knowledge about class j of client i at global round t. Since the noise is drawn from N 0, (S i f ) 2 • σ 2 , its effect on the quality of hyper-knowledge is alleviated during aggregation assuming sufficiently large number of participating clients, i.e., E H j,t+1 (l) = m i=1 p i hj,t i (l) + E m i=1 p i χ j,t i (l) = m i=1 p i hj,t i (l) + 0, with variance σ 2 m i=1 (S i f ) 2 . In other words, the additive noise is "averaged out" and effectively near-eliminated after aggregating local hyper-knowledge. For simplicity, we assume that in the above expressions N j i ̸ = 0.

3.4. LOCAL TRAINING OBJECTIVE

Following the aggregation at the server, the global hyper-knowledge is sent to the clients participating in the next FL round to assist in local training. In particular, given data samples (x, y) ∼ D i , the loss function of client i is formed as L(D i , ϕ i , ω i ) = 1 B i Bi k=1 CELoss(G ωi (R ϕi (x k )), y k ) + λ 1 n n j=1 ||Q(G ωi (H j ), T ) -Q j || 2 + γ 1 B i Bi k=1 ||R ϕi (x k ) -H y k || 2 (11) where B i denotes the number of samples in the dataset owned by client i, n is the number of classes, CELoss(•, •) denotes the cross-entropy loss function, ∥ • ∥ 2 denotes Euclidean norm, Q(•, T ) is the soft target function with temperature T , and λ and γ are hyper-parameters. Note that the loss function in (11) consists of three terms: the empirical risk formed using predictions and ground-truth labels, and two regularization terms utilizing hyper-knowledge. Essentially, the second and third terms in the loss function are proximity/distance functions. The second term is to force the local classifier to output similar soft predictions when given global data representations while the third term is to force the features extractor to output similar data representations when given local data samples. For both, we use Euclidean distance because it is non-negative and convex.

3.5. FEDHKD: SUMMARY OF THE FRAMEWORK

The training starts at the server by initializing the global model for i ∈ S t do 7: θ 1 = (ϕ 1 , ω 1 ), ϕ t i , ω t i , K i ← -LocalUpdate(ϕ t ,ω t ,K,D i , σ 2 , ν, i) 8: end for 9: Aggregate global hyper-knowledge K by Eq. 9. 10: Aggregate global model θ t+1 = (ϕ t+1 , ω t+1 ) 11: end for 12: return θ Tr+1 = (ϕ Tr+1 , ω Tr+1 ) 13: 14: LocalUpdate(ϕ t ,ω t ,K,D i , σ 2 s , i): 15: ϕ t i ← -ϕ t , ω t i ← -ω t , (x, y) ∼ D i 16: for each local epoch do 17: ϕ t i , ω t i ← -OptimAlg(L(x, y, K, λ, γ)) 18: end for 19: update local hyper-knowledge K i 20: return ϕ t i , ω t i , K i

3.6. CONVERGENCE ANALYSIS

To facilitate the convergence analysis of FedHKD, we make the assumptions commonly encountered in literature (Li et al., 2019; 2020; Tan et al., 2021) . The details in assumptions and proof are in Appendix A.6. Theorem 2. Instate Assumptions 1-3 A.6.1. For an arbitrary client, after each communication round the loss function is bounded as E L 1 2 ,t+1 i ≤ L 1 2 ,t i - E-1 e= 1 2 η e - η 2 e L 1 2 ∇L e,t 2 2 + η 2 0 L 1 E 2 EV 2 + σ 2 + 2λη 0 L 3 (L 2 + 1) EV + 2γη 0 L 2 EV. ( ) Theorem 3. (FedHKD convergence rate) Instate Assumptions 1-3 A.6.1 hold and define regret ∆ = L 1 2 ,1 -L * . If the learning rate is set to η, for an arbitrary client after T = 2∆ ϵE (2η -η 2 L 1 ) -η 2 L 1 E (EV 2 + σ 2 ) -4ληL 3 (L 2 + 1) EV -4γηL 2 EV (13) global rounds (ϵ > 0), it holds that 1 T E T t=1 E-1 e= 1 2 ∇L e,t 2 2 ≤ ϵ,

4. EXPERIMENTS

4.1 EXPERIMENTAL SETTINGS In this section, we present extensive benchmarking results comparing the performance of FedHKD and the competing FL methods designed to address the challenge of learning from non-iid data. All the methods were implemented and simulated in Pytorch (Paszke et al., 2019) , with models trained using Adam optimizer (Kingma & Ba, 2014) . Details of the implementation and the selection of hyper-parameters are provided in Appendix. Below we describe the datasets, models and baselines used in the experiments. Datasets. Three benchmark datasets are used in the experiments: SVHN (Netzer et al., 2011) , CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) . To generate heterogeneous partitions of local training data, we follow the strategy in (Yoon et al., 2021; Yurochkin et al., 2019; Li et al., 2021a) and utilize Dirichlet distribution with varied concentration parameters β which controls the level of heterogeneity. Since our focus is on understanding and addressing the impact of class heterogeneity in clients data on the performance of trained models, we set equal the size of clients' datasets. Models. Rather than evaluate the performance of competing schemes on a simple CNN network as in (McMahan et al., 2017; Li et al., 2020; 2021a) , we apply two widely used benchmarking models better suited to practical settings. Specifically, we deploy ShuffleNetV2 (Ma et al., 2018) on SVHN and ResNet18 (He et al., 2016) on CIFAR10/100. As our results show, FedHKD generally outperforms competing methods on both (very different) architectures, demonstrating remarkable consistency and robustness. Baselines. We compare the test accuracy of FedHKD with seven state-of-the-art federated learning methods including FedAvg (McMahan et al., 2017) , FedMD (Li & Wang, 2019) , FedProx (Li et al., 2020) , Moon (Li et al., 2021a) , FedProto (Tan et al., 2021) , FedGen (Zhu et al., 2021) and FedAlign (Mendieta et al., 2022) . We emphasize that the novelty of FedHKD lies in data-free knowledge distillation that requires neither a public dataset nor a generative model; this stands in contrast to FedMD which relies on a public dataset and FedGen which deploys a generative model. Like FedHKD, FedProto shares means of data representations but uses different regularization terms in the loss functions and does not make use of soft predictions. When discussing the results, we will particularly analyze and compare the performance of FedMD, FedGen and FedProto with the performance of FedHKD. 

4.3. PRIVACY ANALYSIS

In our experimental setting, clients share the same network architecture (either ShuffleNetV2 or ResNet18). In both network architectures, the outermost layer in the feature extractor is a batch normalization (BN) layer (Ioffe & Szegedy, 2015) . For a batch of vectors B = {v 1 , . . . , v b } at the input of the BN layer, the operation of the BN layer is specified by µ B = 1 b b i=1 v i , σ 2 B = 1 b b i=1 (v i -µ B ) 2 , ṽi ← - v i -µ B σ B . ( ) Assuming b is sufficiently large, the law of large numbers implies ṽi ∼ N (0, 1). Therefore, -3 ≤ v i ≤ 3 with probability 99.73% (almost surely). Consider the experimental scenarios where client i contains N i = 1024 samples in its local dataset, the sharing threshold is ν = 0.25, N j i > νN i = 256, δ = 0.01, and ϵ = 0.5. According to Theorem 1, to obtain 0.5-differential privacy with confidence 1 -δ = 99% we set σ > 2 log 5 4δ /ε ≈ 6.215. According to Lemma 1, (S i f ) 2 = 2ζ N j i 2 < ( 6 256 ) 2 . Setting σ = 7 (large privacy budget), the variance of noise added to the hyperknowledge K j i of client i should be (S i f ) 2 σ 2 < 0.0269.

5. CONCLUSION

We presented FedHKD, a novel FL algorithm that relies on knowledge distillation to enable efficient learning of personalized and global models in data heterogeneous settings; FedHKD requires neither a public dataset nor a generative model and therefore addresses the data heterogeneity challenge without a need for significantly higher resources. By introducing and utilizing the concept of "hyper-knowledge", information that consists of the means of data representations and the corresponding means of soft predictions, FedHKD enables clients to train personalized models that perform well locally while allowing the server to aggregate a global model that performs well across all data classes. To address privacy concerns, FedHKD deploys a differential privacy mechanism. We conducted extensive experiments in a variety of setting on several benchmark datasets, and provided a theoretical analysis of the convergence of FedHKD. The experimental results demonstrate that FedHKD outperforms state-of-the-art federated learning schemes in terms of both local and global accuracy while only slightly increasing the training time.

A APPENDIX

A.1 EXPERIMENTAL DETAILS General setting. We implemented all the models and ran the experiments in Pytorch (Paszke et al., 2019) (Ubuntu 18.04 operating system, 8 AMD Vega20 GPUs). Adam (Kingma & Ba, 2014) optimizer was used for model training in all the experiments; learning rate was initialized to 0.001 and decreased every 10 iterations with a decay factor 0.5, while the hyper-parameter γ in Adam was set to 0.5. The number of global communication rounds was set to 50 while the number of local epochs was set to 5. The size of a data batch was set to 64 and the participating rate of clients was for simplicity set to 1. For SVHN (Netzer et al., 2011) dataset, the latent dimension of data representation was set to 32; for CIFAR10/100 (Krizhevsky et al., 2009) , the latent dimension was set to 64. Hyper-parameters. In all experiments, the FedProx (Li et al., 2020) hyper-parameter µ prox was set to 0.5; the Moon (Li et al., 2021a) hyper-parameter µ moon in the proximTal term was set to 1. In FedAlign (Mendieta et al., 2022) , the fractional width of the sub-network was set to 0.25, and the balancing parameter µ align was set to 0.45. The generative model required by FedGen (Zhu et al., 2021) is the MLP-based architecture proposed in (Zhu et al., 2021) . The hidden dimension of the generator was set to 512; the latent dimension, noise dimension, and input/output channels were adapted to the datasets. The number of epochs for training the generative model in each global round was set to 5, and the ratio of the generating batch-size and the training batch-size was set to 0.5 (i.e, the generating batch-size was set to 32). Parameters α generative and β generative were initialized to 10 with a decay factor 0.98 in each global round. In FedMD (Li & Wang, 2019) , we set the regularization hyper-parameter λ md to 0.05; the size of the public dataset was set equal to the size of the clients' local training dataset. In FedProto (Tan et al., 2021) , the regularization hyper-parameter λ proto was set to 0.05. The hyper-parameters λ and γ in our proposed method FedHKD* were set to 0.05 and 0, respectively; as for FedHKD, the two hyper-parameters λ and γ were set to 0.05 and 0.05, respectively. Variance σ of the Gaussian noise added to the generated hyper-knowledge was set to 7; threshold ν that needs to be met to initiate computation of hyper-knowledge was set to 0.25. Temperature for FedHKD and Moon algorithm was set to 0.5. 

A.5 PROOF OF LEMMA 1

To compute i th client's mean of class j representation, hj i , we consider the deterministic function (averaging in an element-wise manner) f l (d j i ) ≜ hj i (l) = 1 N j i N j i k=1 hj,k i (l) where d j i is the subset of the i th client's local dataset collecting samples with label j; h j,k i denotes the data representation of the k th sample in d j i while h j,k i (l) is the l th element of h j,k i . Lemma 1. If |h j,k i (l)| is bounded by ζ > 0 for any k, then |f l (d j i ) -f l (d j′ i )| ≤ 2ζ N j i . ( ) Proof: Without a loss of generality, specify e = {h 1 i (l), . . . , h N j i -1 i (l), h N j i i (l)}, |e| = N j i , and e ′ = {h 1 i (l), . . . , h N j i -1 i (l)}, |e ′ | = N j i -1, where e and e ′ denote adjacent sets differing in at most one element. Define 1 = {1, . . . , 1} with  |1| = N j i -1. Then |f l (d j i ) -f (d j ′ i )| = 1 T e ′ + h N j i i (l) N j i - 1 T e ′ N j i -1 = N j i -1 h N j i i (l) -1 T e ′ N j i N j i -1 ≤ N j i -1 h N j i i (l) N j i N j i -1 + 1 T e ′ N j i N j i -1 ≤ N j i -1 ζ N j i N j i -1 + N j i -1 ζ N j i N j i -1 = ζ N j i + ζ N j i = 2ζ N j i . L(D i , ϕ i , ω i ) = 1 B i Bi k=1 CELoss(G ωi (R ϕi (x k )), y k ) + λ 1 n n j=1 ∥Q(G ωi (H j ), T ) -Q j ∥ 2 + γ 1 B i Bi k=1 ∥R ϕi (x k ) -H y k ∥ 2 , where ). Although client i does not begin the next training epoch, the local model is changed and so is the output of the loss function. At the server, the global model is updated as D θ 1 2 ,t+1 = m i=1 p i θ E,t i , where θ E,t i is the local model of client i after E local training epoches at round t; p i is the averaging weight of client i, where m i=1 p i = 1. hj,t and qj,t are aggregated as  H j,t+1 = m i=1 p i hj,t , Q j,t+1 = m i=1 p i qi,t . (•, T ) is L 3 -Lipschitz continuous, ∇L(θ t1 ) -∇L(θ t2 ) 2 ≤ L 1 θ t1 -θ t2 2 , ∀t 1 , t 2 > 0, R ϕ t 1 (•) -R ϕ t 2 (•) ≤ L 2 ϕ t1 -ϕ t2 2 , ∀t 1 , t 2 > 0, ∥Q (G ω t 1 (•)) -Q (G ω t 2 (•))∥ ≤ L 3 ω t1 -ω t2 2 , ∀t 1 , t 2 > 0. (26) Inequality 24 also implies L(θ t1 ) -L(θ t2 ) ≤ ∇L(θ t2 ), θ t1 -θ t2 + L 1 2 θ t1 -θ t2 2 2 , ∀t 1 , t 2 > 0. ( ) Assumption 2. (Unbiased Gradient and Bounded Variance). The stochastic gradients on a batch of client i's data ξ i , denoted by g t i = ∇L (θ t i , ξ t i ), is an unbiased estimator of the local gradient for each client i, E ξi∼Di g t i = ∇L θ t i ∀i ∈ 1, 2, . . . , m, with the variance bounded by σ 2 , E g t i -∇L θ t i 2 2 ≤ σ 2 , ∀i ∈ {1, 2, . . . , m}, σ > 0. ( ) Assumption 3. (Bounded Expectation of Gradients). The expectation of the stochastic gradient is bounded by V ,  E g t i 2 2 ≤ V 2 , ∀i ∈ {1, 2, . . . , m}, V > 0. E L E,t+1 ≤ L 1 2 ,t+1 - E-1 e= 1 2 η e - η 2 e L 1 2 ∇L e,t+1 2 2 + η 2 0 L 1 E 2 σ 2 , ( ) where η e is the step-size (learning rate) at local epoch e. Proof: L e+1,t+1 ≤ L e,t+1 + ∇L e,t+1 , θ e+1,t+1 -θ e,t+1 + L 1 2 θ e+1,t+1 -θ e,t+1 2 2 = L e,t+1 -η e ∇L e,t+1 , g e,t+1 + L 1 2 η 2 e g e,t+1 2 2 , e ∈ { 1 2 , 1, . . . , E -1}, where inequality (1) follows from Assumption 1. Taking expectation of both sides (the sampling batch ξ t+1 ), we obtain E L e+1,t+1 ≤ L e,t+1 -η e ∇L e,t+1 2 2 + L 1 2 η 2 e E g e,t+1 = L e,t+1 -η e ∇L e,t+1 2 2 + L 1 2 η 2 e ∇L e,t+1 2 2 + V g e,t+1 ≤ L e,t+1 -η e - η 2 e L 1 2 ∇L e,t+1 2 2 + L 1 2 η 2 e σ 2 . Inequality (2) follows from Assumption 2; (3) follows from V [x] = E x 2 -E [x] 2 , where x is a random variable; (4) holds due to Assumptions 2-3. Let us set the learning step at the start of local training to η 1 2 = η 0 . By telescoping, E L E,t+1 ≤ L 1 2 ,t+1 - E-1 e= 1 2 η e - η 2 e L 1 2 ∇L e,t+1 2 2 + η 2 0 σ 2 L 1 E 2 . ( ) The above inequality holds due to the fact that the learning rate η is non-increasing. Lemma 2. Following the model and hyper-knowledge aggregation at the server, the loss function of any client i at global round t + 1 can be bounded as E L 1 2 ,(t+1) i ≤ L E,t i + η 2 0 L 1 2 E 2 V 2 + 2λη 0 L 3 (L 2 + 1) EV + 2γη 0 L 2 EV. Proof: L 1 2 ,(t+1) i -L E,t i = L(θ 1 2 ,t+1 i , K t+1 ) -L(θ E,t i , K t ) = L(θ 1 2 ,t+1 i , K t+1 ) -L(θ E,t i , K t+1 ) + L(θ E,t i , K t+1 ) -L(θ E,t i , K t ) (1) ≤ ∇L E,t i , θ 1 2 ,t+1 i -θ E,t i + L 1 2 θ 1 2 ,t+1 i -θ E,t i 2 2 + L(θ E,t i , K t+1 ) -L(θ E,t i , K t ) (2) = ∇L E,t i , m j=1 p j θ E,t j -θ E,t i + L 1 2 m j=1 p j θ E,t j -θ 1 2 ,t i 2 2 + L(θ E,t i , K t+1 ) -L(θ E,t i , K t ), where inequality (1) follows from Assumption 1, and ( 2) is derived from Eq. 21. Taking expectation of both side, E L 1 2 ,(t+1) i -L E,t i (1) ≤ L 1 2 E m j=1 p j θ E,t j -θ E,t i 2 2 + EL(θ E,t i , K t+1 ) -EL(θ E,t i , K t ) = L 1 2 E m j=1 p j θ E,t j -θ 1 2 ,t i -θ E,t i -θ 1 2 ,t i 2 2 + EL(θ E,t , K t+1 ) -EL(θ E,t , K t ) (2) ≤ L 1 2 E θ E,t i -θ 1 2 ,t i 2 2 + EL(θ E,t , K t+1 ) -EL(θ E,t , K t ) = L 1 2 E E-1 e= 1 2 η e g e,t i 2 2 + EL(θ E,t , K t+1 ) -EL(θ E,t , K t ) (3) ≤ L 1 2 E E-1 e= 1 2 Eη 2 e g e,t i 2 2 + EL(θ E,t , K t+1 ) -EL(θ E,t , K t ) (4) ≤ η 2 1 2 L 1 2 E E-1 e= 1 2 E g e,t i 2 2 + EL(θ E,t , K t+1 ) -EL(θ E,t , K t ) (5) ≤ η 2 0 L 1 2 E 2 V 2 + EL(θ E,t , K t+1 ) -EL(θ E,t , K t ). Due to Lemma 3 and the proof of Lemma 3 in (Li et al., 2019) , inequality (1) holds as E θ E,t j = m j=1 p j θ E,t j ; inequality (2) holds because E ∥EX -X∥ 2 ≤ E ∥X∥ 2 , where X = θ E,t i -θ 1 2 ,t i ; inequality 3) is due to Jensen inequality; inequality (4) follows from that fact that the learning rate η e is non-increasing; inequality (5) holds due to Assumption 3. Let us consider the term L(θ E,t , K t+1 ) -L(θ E,t , K t ); note that the model parameters θ E,t are unchanged and thus the first term in the loss function 20 can be neglected. The difference between the two loss functions is due to different global hyper-knowledge K t and K t+1 , L(θ E,t , K t+1 ) -L(θ E,t , K t ) = = λ 1 n n j=1 Q G ω E,t j (H j,t+1 ) -Q j,t+1 2 -Q G ω E,t j (H j,t ) -Q j,t 2 + γ 1 B i Bi k=1 R ω E,t i (x k ) -H y k ,t+1 2 -R ω E,t i (x k ) -H y k ,t 2 = λ 1 n n j=1 Q G ω E,t j (H j,t+1 ) -Q j,t + Q j,t -Q j,t+1 2 -Q G ω E,t j (H j,t ) -Q j,t 2 + γ 1 B i Bi k=1 R ω E,t i (x k ) -H y k ,t+1 2 -R ω E,t i (x k ) -H y k ,t 2 (1) ≤ λ 1 n n j=1 Q G ω E,t j (H j,t+1 ) -Q G ω E,t j (H j,t ) 2 + Q j,t+1 -Q j,t 2 + γ 1 B i Bi k=1 H y k ,t+1 -H y k ,t 2 (2) ≤ λ 1 n n j=1 L 3 H j,t+1 -H j,t 2 + Q j,t+1 -Q j,t 2 + γ 1 B i Bi k=1 H y k ,t+1 -H y k ,t 2 , where ( 1) is due to the triangle inequality, ∥a + b + c∥ 2 ≤ ∥a∥ 2 + ∥b∥ 2 + ∥c∥ 2 with a = Q G ω E,t j (H j,t ) -Q j,t , b = Q G ω E,t j (H j,t+1 ) -Q G ω E,t j (H j,t ) and c = Q j,t -Q j,t+1 ; inequality (2) holds due to Assumption 1. Then, let us consider the following difference:  H j,t+1 p i   1 N j i N j i k=1 R ϕ E,t i (x k ) -R ϕ E,t-1 i (x k )   2 (1) ≤ m i=1 p i 1 N j i N j i k=1 R ϕ E,t i (x k ) -R ϕ E,t-1 i (x k ) 2 (2) ≤ m i=1 p i 1 N j i Ni k=1 L 2 ϕ E,t i -ϕ E,t-1 i 2 = L 2 m i=1 p i ϕ E,t i -ϕ E,t-1 i 2 . ( ) Inequality (1) holds due to Jensen's inequality, while inequality (2) follows from Assumption 1. For convenience (and perhaps clarity), we drop the superscript j denoting the class. Taking expectation of both sides, ; inequality (5) follows by using the fact that the learning rate η e is non-increasing. E H t+1 -H t 2 ≤ L 2 m i=1 p i E ϕ E,t i -ϕ E,t-1 i 2 (1) ≤ L 2 m i=1 p i E ϕ E,t i -ϕ 1 2 ,t i 2 + E ϕ 1 2 ,t i -ϕ E,t-1 i 2 (2) ≤ L 2 m i=1 p i   η 0 EV + E m j p j ϕ E,t-1 i -ϕ E,t-1 i 2   = L 2 m i=1 p i   η 0 EV + E m j p j ϕ E,t-1 i -ϕ 1 2 ,t-1 i + ϕ 1 2 ,t-1 i -ϕ E,t-1 i 2   (3) ≤ L 2 m i=1 p i   η0EV + E m j p j ϕ E,t-1 i -ϕ 1 2 ,t-1 i + ϕ 1 2 ,t-1 i -ϕ E,t-1 i 2 2    Similarly, E Q t+1 -Q t 2 ≤ L 3 m i=1 p i E ω E,t i -ω E,t-1 i 2 ≤ 2η 0 L 3 EV Combining the above inequalities, we have E L 1 2 ,(t+1) i ≤ L E,t i + η 2 0 L 1 2 E 2 V 2 + 2λη 0 L 3 (L 2 + 1) EV + 2γη 0 L 2 EV. A.6.3 THEOREMS Theorem 2. Instate Assumptions 1-3. For an arbitrary client, after each communication round the loss function is bounded as E L 1 2 ,t+1 i ≤ L 1 2 ,t i - E-1 e= 1 2 η e - η 2 e L 1 2 ∇L e,t 2 2 + η 2 0 L 1 E 2 EV 2 + σ 2 + 2λη 0 L 3 (L 2 + 1) EV + 2γη 0 L 2 EV. Fine-tuning the learning rates η 0 , λ and γ ensures that η 2 0 L 1 E 2 EV 2 + σ 2 + 2λη 0 L 3 (L 2 + 1) EV + 2γη 0 L 2 EV - E-1 e= 1 2 η e - η 2 e L 1 2 ∇L e,t 2 2 < 0. (44) Corollary 1. (FedHKD convergence) Let η 0 > η e > αη 0 for e ∈ {1, . . . , E -1}, 0 < α < 1. The loss function of an arbitrary client monotonously decreases in each communication round if αη 0 < η e < 2α 2 ∥∇L e,t ∥ -4αλL 3 (L 2 + 1)V -4αγL 2 V L 1 α 2 ∥∇L  Proof: According to Theorem 1, 1 T E T t=1 E-1 e= 1 2 η - η 2 L 1 2 ∇L e,t 2 2 ≤ 1 T E T t=1 L 1 2 ,t i - 1 T E T t=1 E L 1 2 ,t+1 i + η 2 L 1 2 EV 2 + σ 2 + 2ληL 3 (L 2 + 1) V + 2γηL 2 V ≤ 1 T E ∆ + η 2 L 1 2 EV 2 + σ 2 + 2ληL 3 (L 2 + 1) V + 2γηL 2 V < ϵ η - η 2 L 1 2 . (52) Therefore, ∆ T ≤ ϵE η - η 2 L 1 2 - η 2 L 1 E 2 EV 2 + σ 2 -2ληL 3 (L 2 + 1) EV -2γηL 2 EV, which is equivalent to T ≥ 2∆ ϵE (2η -η 2 L 1 ) -η 2 L 1 E (EV 2 + σ 2 ) -4ληL 3 (L 2 + 1) EV -4γηL 2 EV .



Furthermore, to evaluate both personalized as well as global model performance, each client is allocated a local test dataset (with the same class distribution as the corresponding local training dataset) and a global test dataset with uniformly distributed classes (shared by all participating clients); this allows computing both the average local test accuracy of the trained local models as well as the global test accuracy of the global model aggregated from the clients' local models.

Figure 1: 10% of the training set points in CIFAR10 are sampled into 10 partitions according to a Dirichlet distribution (10 clients). As the concentration parameter varies (β = 0.2, 0.5, 5), the partitions change from heterogeneous to homogeneous.

Figure 2: 50% of the training set points in CIFAR10 are sampled into 10 partitions according to a Dirichlet distribution (50 clients). With concentration parameter β = 0.2, the partition is extremely heterogeneous.

Figure 3: 50% of the training set points in CIFAR100 are sampled into 10 partitions according to a Dirichlet distribution (50 clients). With concentration parameter β = 5, the partition is relatively homogeneous.

Figure 4: A flow diagram showing computation, encryption and aggregation of hyper-knowledge.

Figure 5: A flow diagram showing FedHKD steps. The blue dashed line indicates sending local hyper-knowledge and model updates from clients to the server while the green dashed line indicates broadcasting global hyper-knowledge and model from the server to clients.

CONVERGENCE ANALYSIS OF FEDHKD It will be helpful to recall the notation before restating the theorems and providing their proofs. Let R ϕi (•) : R dx → R dr denote the feature extractor function of client i, mapping the raw data of dimension d x into the representation space of dimension d r . Let G ωi (•) : R dr → R n denote the classifier's function of client i, projecting the data representation into the categorical space of dimension n. Let F θi=(ϕi,ωi) (•) = G ωi (•) • R ϕi (•) denote the mapping of the entire model. The local objective function of client i is formed as

(Lipschitz Continuity). The gradient of the local loss function L(•) is L 1 -Lipschitz continuous, the embedding functions of the local feature extractor R ϕ (•) is L 2 -Lipschitz continuous, and the embedding functions of the local classifier G ω (•) composition with soft prediction function Q

η 0 EV + η 0 EV ) = 2η 0 L 2 EV,(40)where (1) follows from the triangle inequality; inequality (2) holds due to Assumption 3 and the update rule of SGD; since f (x) = √ x is concave, (3) follows from Jensen's inequality; inequality (4) holds due to the fact that E ∥EX -X∥ 2 ≤ E ∥X∥ 2 , where X = ϕ E,t

+ 1 (EV 2 + σ 2 ) , ∀e ∈ {1, . . . , E -1}.(49)Theorem 3. (FedHKD convergence rate) Instate Assumptions 1-3 and define regret ∆ = L 1 2 ,1 -L * . If the learning rate is set to η, for an arbitrary client afterT = 2∆ ϵE (2η -η 2 L 1 ) -η 2 L 1 E (EV 2 + σ 2 ) -4ληL 3 (L 2 + 1) EV -4γηL 2 EV (50)global rounds (ϵ > 0), it holds that

where ϕ 1 and ω 1 denote parameters of the global feature extractor and global classifier, respectively. At the beginning of each global epoch, the server sends the global model and global hyper-knowledge to clients selected for training. In turn, each client initializes its local model with the received global model, and performs updates by minimizing the objective in Eq. 11; the objective consists of three terms: (1) prediction loss in a form of the cross-entropy between prediction and ground-truth; (2) classifier loss reflective of the Euclidean norm distance between the output of the classifier and the corresponding global soft predictions; and (3) feature loss given by the Euclidean norm distance between representations extracted from raw data by a local feature extractor and global data representations. Having

Table1shows that FedHKD generally outperforms other methods across various settings and datasets. For each dataset, we ran experiments with 10, 20 and 50 clients, with local data gener-Results on data partitions generated from Dirichlet distribution with the concentration parameter β = 0.5. The number of clients is 10, 20 and 50; the clients utilize 10%, 20% and 50% of the datasets. The number of parameters (in millions) indicates the size of the model stored in the memory during training. A single client's averaged wall-clock time per round is measured across 8 AMD Vega20 GPUs in a parallel manner.

Results on data partitions generated with different concentration parameters (10 clients). unable to synthesize data of sufficient quality to assist in KD-based FL on SVHN and CIFAR10/100 -on the former dataset, FedGen actually leads to performance deterioration as compared to FedAvg.Training time comparison. We compare training efficiency of different methods in terms of the averaged training time (in second) per round/client. For fairness, all the experiments were conducted on the same machine with 8 AMD Vega20 GPUs. As shown in Table1, the training time of FedHKD, FedHKD*, FedProto and FedGen is slightly higher than the training time of FedAvg. The additional computational burden of FedHKD is due to evaluating two extra regularization terms and calculating local hyper-knowledge. The extra computations of FedGen are primarily due to training a generative model; the MLP-based generator leads to minor additional computations but clearly limits the performance of FedGen. FedMD relies on a public dataset of the same size as the clients' local datasets, thus approximately doubling the time FedAvg needs to complete the forward and backward pass during training. Finally, the training efficiency of Moon and FedAlign is inferior to the training efficiency of other methods. Moon is inefficient as it requires more than double the training time of FedAvg. FedAlign needs to pass forward the network multiple times and runs large matrix multiplications to estimate second-order information (Hessian matrix). of class heterogeneity. We compare the performance of the proposed method, FedHKD, and other techniques as the data heterogeneity is varied by tuning the parameter β. When β = 0.2, the heterogeneity is severe and the local datasets typically contain only one or two classes; when β = 5, the local datasets are nearly homogeneous. Data distributions are visualized in Appendix A.2. As shown in Table2, FedHKD improves both local and global accuracy in all settings, surpassing other methods except FedMD on SVHN dataset for β = 5. FedProto exhibits remarkable improvement

i denotes the local dataset of client i; input x k and label y k are drawn from D i ; B i is the number of samples in a batch of D i ; Q(•, T ) is the soft target function with temperature T ; H j denotes the global mean data representation of class j; Q y k is the corresponding global soft prediction of class y k ; and λ and γ are the hyper-parameters. Note that only ϕ i and ω i are variables in the loss function while the other terms are constant. Let t denote the current global training round. During any global round, there are E local training epochs. Assume the loss function is minimized by relying on stochastic gradient descent (SGD). To compare the loss before and after model/hyper-knowledge aggregation at the server, denote the local epoch by e ∈ { 1 2 , 1, . . . , E}; e = 1 2 indicates the epoch between the end of the server's aggregation in the previous communication round and the first epoch of the local training in the next round. After E epochs of local training in communication round t, the local model of client i is denoted as (ϕE,t i , ω E,t i ). At the global communication round t + 1, client i initializes the local model with the aggregated global model, (ϕ

e,t ∥ Factoring out η e on the left hand side yieldsL 1 2α 2 EV 2 + σ 2 + ∥∇L e,t ∥ -4αλL 3 (L 2 + 1)V -4αγL 2 V L 1 α 2 ∥∇L e,t ∥

