DATA-FREE ONE-SHOT FEDERATED LEARNING UNDER VERY HIGH STATISTICAL HETEROGENEITY

Abstract

Federated learning (FL) is an emerging distributed learning framework that collaboratively trains a shared model without transferring the local clients' data to a centralized server. Motivated by concerns stemming from extended communication and potential attacks, one-shot FL limits communication to a single round while attempting to retain performance. However, one-shot FL methods often degrade under high statistical heterogeneity, fail to promote pipeline security, or require an auxiliary public dataset. To address these limitations, we propose two novel data-free one-shot FL methods: FEDCVAE-ENS and its extension FEDCVAE-KD. Both approaches reframe the local learning task using a conditional variational autoencoder (CVAE) to address high statistical heterogeneity. Furthermore, FEDCVAE-KD leverages knowledge distillation to compress the ensemble of client decoders into a single decoder. We propose a method that shifts the center of the CVAE prior distribution and experimentally demonstrate that this promotes security, and show how either method can incorporate heterogeneous local models. We confirm the efficacy of the proposed methods over baselines under high statistical heterogeneity using multiple benchmark datasets. In particular, at the highest levels of statistical heterogeneity, both FEDCVAE-ENS and FEDCVAE-KD typically more than double the accuracy of the baselines.

1. INTRODUCTION

Traditional federated learning (FL) achieves privacy protection by sharing learned model parameters with a central server, circumventing the need for a centralized dataset and thus allowing potentially sensitive data to remain local to client devices (McMahan et al., 2017) . FL has shown promise in several practical application domains with privacy concerns, such as health care, mobile phones, and industrial engineering (Li et al., 2020a) . However, most existing FL methods depend on substantial iterative communication (Guha et al., 2019; Li et al., 2020b) , introducing a vulnerability to eavesdropping attacks, among other privacy and security concerns (Mothukuri et al., 2021) . One-shot FL has emerged to address issues associated with communication and security in standard FL (Guha et al., 2019) . One-shot FL limits communication to a single round, which is more practical in scenarios like model markets, where models trained to convergence are sold with no possibility for iterative communication during local client training (Li et al., 2021b) . In high impact settings, like health care, data could be highly heterogeneous and computation capabilities could be varied; for example, health care institutions could have different prevalence rates of particular diseases or no data on a disease and substantially different computing abilities depending on funding (Li et al., 2020a) . Furthermore, fewer communications rounds means fewer opportunities for eavesdropping attacks. While results in one-shot FL are promising, existing methods struggle under high statistical heterogeneity, non-independently-and identically-distributed (non-IID) data, (i.e., Zhou et al. ( 2020) Zhang et al. (2021) ) or do not fully consider statistical heterogeneity (i.e., Guha et al. (2019) , Shin et al. (2020) , Li et al. (2021b) ). Additionally, most do not consider pipeline security (i.e., Shin et al. (2020) , Li et al. (2021b) , Zhang et al. (2021) ). Furthermore, an auxiliary public dataset is often required to achieve satisfactory performance in one-shot FL (i.e., Guha et al. (2019) , Li et al. (2021b) ), which may be difficult to obtain in practice (Zhu et al., 2021) . Figure 1 : Motivating our proposed methods, FEDCVAE-ENS and FEDCVAE-KD, using the MNIST dataset as an example. In cases of very high statistical heterogeneity, each client will only observe one or two of the ten available classes, as seen on the left where the size of each dot is proportional to the number of samples. For example, client 2 only observes 4's and 7's, resulting in a client decoder that can expertly generate these digits. Note that the columns are shown in order of conditioning class (digits 0-9). Similarly, client 4 is an expert in 3's and 6's. In FEDCVAE-KD, our lightweight knowledge distillation training procedure compacts local learning into a single server decoder, as evidenced by the high-quality samples from all available classes (digits 0-9). This server decoder can then be used for any downstream task, e.g., classification. To address these issues, we jointly propose FEDCVAE-ENS and FEDCVAE-KD, two novel datafree one-shot FL models that reframe the local learning task using conditional variational autoencoders (CVAE). Because CVAEs can easily learn a simplified data distribution, both methods train CVAEs locally to capture the narrow conditional data distributions that arise in the high statistical heterogeneity setting. Figure 1 shows how client decoders become experts in the few classes that they observed. These decoders are ensembled (FEDCVAE-ENS) or compactly aggregated (FEDCVAE-KD). More specifically, FEDCVAE-KD aggregates the models using a lightweight knowledge distillation procedure; client decoders are teachers, and the server decoder is the student. Figure 1 shows images generated by the server decoder. Thorough experiments on multiple benchmark datasets (MNIST, FashionMNIST, SVHN) demonstrate the superiority of FEDCVAE-ENS and FEDCVAE-KD over other relevant one-shot FL methods in the high statistical heterogeneity setting. In particular, FEDCVAE-ENS and FEDCVAE-KD obtain more than 1.75× the accuracy of the best baseline method for MNIST, more than 2× the accuracy for FashionMNIST, and more than 2.75× the accuracy for SVHN under extreme statistical heterogeneity (i.e., clients only observe one or two classes). Furthermore, to protect the decoders uploaded to the server, we propose a method to shift the center of the CVAE prior distribution. We show that without knowing the center of the prior, an eavesdropping attacker cannot train a performant classifier, thus promoting pipeline security. In sum, our contributions are two one-shot FL methods targeted to the high statistical heterogeneity setting that: (1) perform substantially better than other baseline methods in this setting, (2) demonstrate invariance to the number of clients, (3) are data-free and can be applied to any downstream task requiring a labeled dataset, (4) allow for heterogeneous local model architectures, and (5) extend to promote pipeline security. To the best of our knowledge, we are the first to thoroughly address very high statistical heterogeneity in one-shot FL.

2. PRELIMINARIES

Conditional Variational Autoencoders. A variational autoencoder (VAE) is a probabilistic generative model that attempts to learn the distribution of data samples (Kingma & Welling, 2014) . A VAE is a latent variable method that models the joint distribution p θ (x, z) of a data sample x ∈ X ⊆ R D and latent variable z ∈ Z ⊆ R d , where usually d << D. This joint can be factorized as p θ (x, z) = p θ (x|z)p(z), where the prior p(z) is usually chosen to be a multivariate standard normal distribution, i.e., p(z) = N (0, I). The posterior p θ (z|x) is approximated via an inference model q ϕ (z|x) called the encoder and the model for the conditional likelihood p θ (x|z) is called the decoder. In our case, both the encoder and decoder are deep neural networks parameterized by ϕ and θ, respectively. Conditional VAEs (CVAEs) extend the basic VAE by conditioning the encoder and decoder on the one-hot encoding of the class label y ∈ Y = {1, 2, ..., K} corresponding to data sample x, resulting in the conditional encoder q ϕ (z|x, y) and conditional decoder p θ (x|z, y). A CVAE is trained by maximizing the variational lower bound: L(ϕ, θ; x, y) = -D KL (q ϕ (z|x, y) || p(z)) + E q ϕ (z|x,y) [log p θ (x|z, y)], which bounds the conditional marginal likelihood of the data log p(x|y). Here, D KL (•) represents KL-divergence. Knowledge Distillation. Knowledge distillation (KD) aims to extract information from a trained teacher model to train a separate lightweight student model (Buciluǎ et al., 2006; Hinton et al., 2015) . Typically, the student model is trained by minimizing the discrepancy between student and teacher logits generated using a suitable auxiliary dataset (Hinton et al., 2015) ; KL-divergence is often chosen as the measure of discrepancy. Some works ensemble teacher models, using the average of teacher logits in an attempt to compact ensemble knowledge into a single student model (Anil et al., 2018; Dvornik et al., 2019; Furlanello et al., 2018) . Several works have integrated KD into FL to mitigate privacy risks, reduce upload costs, or regularize local learning using an ensemble-ofteachers approach (Lin et al., 2020; Zhu et al., 2021; Guha et al., 2019) . However, KD approaches in FL usually require an auxiliary public dataset with similar properties as the distributed dataset, which may be difficult to obtain in practice (Zhu et al., 2021) . One-shot Federated Learning. In the federated setting, we have a set of clients C, with m = |C| clients in total. Each client k has a local private dataset D k = {(x i , y i )} n k i=1 , with n k = |D k | repre senting the number of data samples belonging to user k. Traditional FL methods assume each client has a local differentiable model f w k (•), usually a deep neural network parameterized by w k . It is typically assumed that the server and clients can communicate over multiple rounds. However, in the one-shot FL setting communication is restricted to a single round, which severely limits communication costs but also increases the difficulty of the distributed learning task (Guha et al., 2019) . Notably, existing one-shot FL methods either ignore the issue of statistical heterogeneity (i.e., Guha et al. (2019) ), fail to comprehensively explore the effect of statistical heterogeneity on performance (i.e., Shin et al. (2020) and Li et al. (2021b) ), or degrade substantially at even moderate levels of statistical heterogeneity (i.e., Zhou et al. (2020) and Zhang et al. (2021) ).

3. TACKLING VERY HIGH STATISTICAL HETEROGENEITY IN ONE-SHOT FL

We jointly propose FEDCVAE-ENS and FEDCVAE-KD, one-shot FL methods that do not require an auxiliary public dataset for server-side training (they are data-free). They address issues caused by high statistical heterogeneity by reframing the learning task using CVAEs and account for model heterogeneity by allowing different CVAE architectures across clients. FEDCVAE-ENS is described in Algorithm 2 (Appendix A) and visualized in Figure 6 (Appendix A ), with FEDCVAE-KD described in Algorithm 1 and visualized in Figure 2 . We discuss privacy-and security-promoting extensions in Appendix B and Section 3.3, respectively.

3.1. OVERVIEW

Figure 2 illustrates the overall framework of the proposed one-shot method FEDCVAE-KD and Figure 6 (Appendix A) shows FEDCVAE-ENS. Specifically, clients first train CVAEs locally on their private data. Next, each client's trained decoder parameters and local label distributions are uploaded to the server once; this is the only communication round. Then, the server generates samples from the client decoders according to the client's local label distribution. Generating samples based on a client's label distribution ensures that each client presents samples from the classes that they know best. In FEDCVAE-ENS, these generated samples are directly used to perform a downstream task, e.g., train a classifier. In FEDCVAE-KD, these samples are used to train a single server Figure 2: The full pipeline for one of our proposed methods, FEDCVAE-KD. Here, E, D, and C represent "encoder," "decoder," and "classifier" models, respectively. First, clients train CVAEs on their private local datasets. Then, the server uses uploaded client decoder parameters and local label distributions to train a server decoder using knowledge distillation (KD). Finally, synthetic labeled samples from the server decoder are used to train a classifier. decoder via KD. Thus, the single conditional decoder can be used as a compact labeled dataset to perform a task like training a classifier, as depicted in Figure 2 . Because FEDCVAE-KD extends FEDCVAE-ENS, we leave the description of FEDCVAE-ENS to Appendix A.  k = [ϕ k , θ k ], with encoder E ϕ k (•) and decoder D θ k (•). Then, clients communicate their decoder weights θ k and label distributions pk (y) to the server; in practice, this simply requires upload of client label counts as in Zhu et al. (2021) . This completes the single communication round. Now we move to the server. Intuitively, if a CVAE observes samples primarily from only a few of the K total classes (as is likely when data is highly heterogeneous), the CVAE will become an "expert" in the simplified data distribution over those few classes. To ensure each client only presents its highest-quality samples, we generate conditioning classes y by sampling from the client's local label distribution, i.e., y ∼ pk (y). Then, to sample from client decoders, we sample a latent vector from the prior (i.e., z ∼ N (0, I))foot_0 and obtain synthetic data sample i from client k using xk i = D θ k (z k i ; y k i ). The trained client decoders act as the teacher models, conveying their aggregate knowledge of how to map from latent space to data space to a single student server decoder, parameterized by θ S . To achieve this, we generate n D total KD training samples, with D Ens defined as the combination of client subsets D k Ens = {(x k i , y k i , z k i )} ⌊ n D/m⌋ i=1 . Then, we train the server to match the teacher's mapping of (z k i , y k i ) to xk i by minimizing a reconstruction loss: for each client k ∈ C in parallel do 3: ℓ KD (θ S ; z k , y k , xk ) = g(D θ S (z k ; y k ), xk ), θ k , pk (y) ← ClientLocalUpdate(k, T L ) ▷ See Algorithm 3 in Appendix A 4: Generate samples from client D k Ens := {(x k i , y k i , z k i )} ⌊ n D/m⌋ i=1 using client decoder D θ k (•) and label distribution pk (y) 5: Combine client subsets into a KD dataset D Ens := D 1 Ens ∪ D 2 Ens ∪ ... ∪ D m Ens 6: for server decoder epoch i = 1 to T KD do 7: for mini-batch b ⊂ D Ens do 8: θ S ← θ S -η KD • ∇ θ S ℓ KD (θ S ; b) 9: Generate an IID labeled dataset D C := {(x S i , y S i )} n C i=1 , by sampling from trained server decoder D θ S (•) 10: for classifier epoch i = 1 to T C do 11: for mini-batch b ⊂ D C do 12: w S C ← w S C -η C • ∇ w S C ℓ C (w S C ; b) Because FEDCVAE-KD ensembles client decoders to create a labeled dataset D Ens to train the server decoder, each client can have a unique CVAE model architecture, accommodating each client's computational limitations. Furthermore, the decision of the classifier architecture can be deferred until after FL is finished and will not affect the learning procedure. FEDCVAE-KD can be applied to any task that requires a labeled dataset, which is more general than classification; there is no commitment to a particular terminal task before learning occurs. While we do not explore the extended communication setting, we note that FEDCVAE-KD extends naturally by communicating the server decoder parameters obtained through KD to all clients and repeating the outlined procedure for non-terminal communication rounds.

3.3. SECURITY-PROMOTING EXTENSION

We define a secure pipeline as one where an outside attacker who obtains transferred data cannot train a performant classifier (Zhou et al., 2020) . In the case of FEDCVAE-ENS and FEDCVAE-KD, an attacker who intercepts all client decoders and local label distributions should not be able to generate the high quality samples necessary to train a high quality classifier. CVAEs use a prior distribution over latent space to train the encoder and decoder models. While a multivariate standard normal distribution is typically used for convenience, any normal distribution is acceptable. To promote security, we propose to shift the center of the prior distribution µ to a random position in real space (i.e., µ ∈ R d ), which can be communicated offline or via encryption methods between server and clients (Zhou et al., 2020) . As shown in Figure 7 (Appendix B), sampling latent vectors too far from the center of the normal prior produces qualitatively poor data samples, deterring eavesdropping attackers who have no knowledge of µ. We conduct experiments to verify the effectiveness of this extension.

4.1. SETUP

Benchmark Datasets. To validate FEDCVAE-ENS and FEDCVAE-KD, we conduct experiments on three image datasets that are standard in the FL literature: MNIST (Lecun et al., 1998) , Fashion-MNIST (Xiao et al., 2017) , and SVHN (Netzer et al., 2011) . 3 Datasets are described in Appendix 2020). Specifically, we sample p k ∼ Dir(α) and allocate a p k i proportion of class i to client k. The parameter α controls the level of non-IID-ness, with a lower α inducing more skewed label distributions across clients. To illustrate the effect of α on dataset partitions across clients, we visualize the distribution of labels across m = 10 clients for α = {0.001, 0.01, 0.05} in Figure 3 . Baseline Methods. We compare the performance of FEDCVAE-ENS and FEDCVAE-KD in the one-shot data-free FL setting against two existing methods: FEDAVG (McMahan et al., 2017) and a method proposed in Guha et al. (2019) , which we call FEDONESHOT. FEDONESHOT ensembles the predictions of select uploaded client classifiers using a sampling procedure; because we consider substantially less clients than Guha et al. (2019) , we disregard sampling and use all clients in the ensemble. There are recent FL methods that are not appropriate in our proposed setting. The oneshot methods proposed in Li et al. (2021b) and Shin et al. (2020) are not applicable because of their reliance on public auxiliary data for server-side training or fine-tuning. Similarly, many standard FL methods are not appropriate because they depend on an auxiliary dataset (i.e., FEDDF (Lin et al., 2020) ) or focus on regularization (i.e., FEDGEN (Zhu et al., 2021) , FEDPROX (Li et al., 2020c) , SCAFFOLD (Karimireddy et al., 2020) , FEDNOVA (Wang et al., 2020) ), which is incompatible with the one-shot setting. Configurations. Unless otherwise stated, we use m = 10 clients, α = 0.01 (very heterogeneous), and report average test accuracy across 5 seeded parameter intializations ± one standard deviation. The data partition is fixed unless otherwise stated. Following Zhu et al. (2021) , we distribute 50% of the available training data to clients for MNIST and FashionMNIST, and 100% for SVHN. All available test data is used to evaluate the final server classifier (or ensemble for FEDONESHOT). We adopt the same convolutional classifier architecture as McMahan et al. (2017) for all methods. We base our CVAE architecture on Higgins et al. (2017) . The server decoder for FEDCVAE-KD has the same architecture as the client CVAEs by default, although this is not strictly necessary because FEDCVAE-KD supports heterogeneous CVAE architectures. For each method, hyperparameters were obtained through tuning, with the bounds of the search grid extended until the best-performing value appeared in the middle of the grid. Full hyperparameter settings can be found in Table 3 and Table 4 (Appendix C ). All classifiers use a cross-entropy objective. CVAE training uses binary cross entropy and mean squared error for the reconstruction term of the objective for grayscale and RGB images, respectively; we use the same reconstruction objective for the KD loss (g(•) in Equation 2).

4.2. GENERAL RESULTS

Statistical Heterogeneity. To demonstrate the efficacy of FEDCVAE-ENS and FEDCVAE-KD in the difficult setting of high statistical heterogeneity, we test on varying levels of α, from high (α = 0.05), to very high (α = 0.01), to extreme (α = 0.001) statistical heterogeneity. FEDCVAE-ENS consistently outperforms all other methods, and FEDCVAE-KD outperforms the baselines in nearly all datasets and levels of α (the only exception is MNIST at the lowest level of statistical heterogeneity) as shown in Table 1 . At α = 0.001, FEDCVAE-KD obtains more than 1.75× the accuracy of the best baseline method for MNIST, more than 2× the accuracy for FashionMNIST, and nearly 2.75× the accuracy for SVHN (Table 1 ). While the baselines FEDAVG and FEDONESHOT are very sensitive to the level of statistical heterogeneity, both FEDCVAE-ENS and FEDCVAE-KD demonstrate consistent performance across levels of α (Table 1 ).

Number of Clients.

Because FL applications often include many participating clients (Li et al., 2020a) , we evaluate several values for number of clients (m = {5, 10, 20, 50}). While both FEDAVG and FEDONESHOT struggle when many clients are present, Table 2 shows that FEDCVAE-ENS and FEDCVAE-KD perform consistently across the number of clients, with the only exception being FEDCVAE-KD for SVHN. Even though it is typical to observe FL methods' accuracy degrade with increasing numbers of clients (Zhang et al., 2021) , FEDAVG and FEDONESHOT are unstable with no clear decrease in accuracy, which we ascribe to the highly variable partitions generated at high levels of statistical heterogeneity. Experiments varying the dataset partitions reveal high variation in accuracy for FEDAVG and FEDONESHOT, whereas both FEDCVAE-ENS and FEDCVAE-KD exhibit consistent accuracy (Table 5 in Appendix C). Decoder Aggregation. The proposed KD aggregation method used in FEDCVAE-KD substantially improves on aggregation via parameter averaging, generating qualitatively more realistic samples (Figure 8 in Appendix C) and achieving higher server classifier accuracy ( We show the accuracy for a classifier trained on samples from the intercepted client decoders and using client label distributions when the center of the normal prior is unknown. The sampling bounds represent the parameters of the uniform distribution used for latent vector sampling, i.e., ±100 represents the distribution U(-100, 100). The prior is a multivariate standard normal N (0, I).

4.3. EXTENSIONS

Heterogeneous Local Models. To simulate clients with diverse computational resources, we train both FEDCVAE-ENS and FEDCVAE-KD using two local CVAE architectures: the first as described in Section 4.1 and the second with one convolutional/deconvolutional block removed for the encoder/decoder, respectively. The server decoder matches that which is described in Section 4.1 but can be chosen arbitrarily. Because the two architectures demonstrate similar generative capabilities, final server classifier accuracy is very similar when comparing homogeneous against heterogeneous local architectures (Figure 4 ). 4 Notably, our KD procedure for FEDCVAE-KD still performs well using heterogeneous models, indicating diverse architectures can organize latent space similarly enough to successfully translate this knowledge to a single decoder. Promoting Security. We verify the effectiveness of our proposed distribution shift extension for securing the uploaded information. Suppose an eavesdropping attacker is able to intercept the label distributions, decoder weights, and decoder architectures from all clients during upload. Without knowledge of the shared center of the multivariate normal prior µ, we show that training a performant classifier is infeasible because it is difficult to extract high-quality samples from the client decoders. In particular, even when the attacker samples latent vectors z from a broad region which overlaps with the high-density region of the prior (i.e., a uniform distribution centered on µ), the accuracy of the classifier trained on the resulting samples degrades sharply as the sampling region grows (Figure 5 ). Even a good guess of U(-10, 10) for a normal prior of N (0, I) results in a 3-30% point decrease in accuracy depending on the dataset. A more realistic guess of U(-1000, 1000) results in a 33 -67% point decrease. Therefore, this simple FEDCVAE-ENS and FEDCVAE-KD extension reduces eavesdropping attackers' capacity to extract high-quality samples from uploaded decoders or train a performant downstream model, reducing communication risks.

5. RELATED WORK

FL Under High Statistical Heterogeneity. FL was originally proposed by McMahan et al. (2017) as a paradigm for decentralized distributed learning. Statistical heterogeneity quickly emerged as a core issue within FL, with many studies focusing on maintaining high performance under very non-IID data. Many approaches have experimented with augmenting FEDAVG by adding proximal terms to the local objective as an attempt to restrain local updates (Li et al., 2020c; Karimireddy et al., 2020; Wang et al., 2020; Acar et al., 2021; Li et al., 2021a) . Other works have used KD to circumvent issues associated with parameter averaging under non-IID data, focusing on leveraging auxiliary data (Lin et al., 2020; Sattler et al., 2021) or a generative model (Zhu et al., 2021; Zhang et al., 2022) to compactly capture client ensemble learning. Recent approaches focus on improving client selection strategy (Tang et al., 2022) , reducing catastrophic forgetting to balance global knowledge against local learning (Huang et al., 2022) , or improving the generality of client models (Mendieta et al., 2022) . However, all of these methods are designed for standard FL and rely heavily on local regularization through substantial iterative communication, which is not feasible in the one-shot FL setting where communication is limited to a single round. One-Shot FL. Methods in one-shot FL have demonstrated strong performance under substantial communication constraint. Guha et al. (2019) originally proposed one-shot FL and introduced two methods, one based on heuristic selection methods for client inclusion in the final ensemble and another that used KD with an auxiliary dataset for ensemble aggregation. Li et al. (2021b) extended the use of KD with a hierarchical KD procedure with wide applicability to a variety of local model types. Although these methods obtain high accuracy, they rely on an auxiliary public dataset to support KD, which is inapplicable in data-free FL. Secure dataset transfer has been applied in several studies: Zhou et al. (2020) achieved this through dataset distillation, which requires a shared model architecture, and Shin et al. (2020) provided limited experiments using XOR-based data augmentation techniques. Zhang et al. (2021) proposed a data-free KD procedure based on a generator network trained using the ensemble of client classifiers, showing promising performance even under heterogeneous local models. However, existing approaches in one-shot FL either do not experiment with high statistical heterogeneity or degrade even under moderate levels of heterogeneity, unlike our proposed methods.

VAEs in FL.

A few studies have experimented with using VAEs in FL. Kasturi et al. (2022) proposed a distributed learning framework based on VAEs, but required an auxiliary pre-trained classifier to generate sample labels and did not specify how to obtain this classifier. Wen et al. (2020) and Gu & Yang (2021) used CVAEs to protect against malicious clients, but only included limited experimental results with respect to statistical heterogeneity, relied on multiple communication rounds, and, in the case of Wen et al. (2020) , did not use KD for server-side aggregation. We are the first to apply CVAEs to one-shot FL with a focus on high statistical heterogeneity.

6. CONCLUSION

In this paper, we proposed FEDCVAE-ENS and FEDCVAE-KD, data-free one-shot FL methods that reframe the local learning task using CVAEs. Both methods performed well given high statistical heterogeneity, demonstrated consistent performance with increasing numbers of clients, allow for model heterogeneity across clients, and can be extended to promote security. Extensive experimental results showed that FEDCVAE-ENS and FEDCVAE-KD substantially extended the stateof-the-art in one-shot FL under very high statistical heterogeneity, making way for more nuanced development in this difficult environment for distributed learning. A ADDITIONAL DESCRIPTION OF METHODS FEDCVAE-ENS Description. FEDCVAE-ENS follows the same procedure as FEDCVAE-KD, but alternatively defines D Ens as the combination of client subsets D k Ens := {(x k i , y k i )} ⌊ n C/m⌋ i=1 and uses D Ens to directly train the server classifier. Figure 6 visualizes the full pipeline for FEDCVAE-ENS and Algorithm 2 details the full procedure. Algorithm 3 shows the local client training procedure, which is the same for FEDCVAE-ENS and FEDCVAE-KD. Note that while we represent parameter optimization as using stochastic gradient descent in all algorithms, any optimizer can be used; for our experiments, we uniformly use Adam (Kingma & Ba, 2014) . for each client k ∈ C in parallel do 3: for θ k , pk (y) ← ClientLocalUpdate(k, T L ) 4: Generate samples from each client D k Ens := {(x k i , y k i )} ⌊ n C/ classifier epoch i = 1 to T C do 7: for mini-batch b ⊂ D Ens do 8: w S C ← w S C -η C • ∇ w S C ℓ C (w S C ; b) B DETAILS ON THE PRIVACY-AND SECURITY-PROMOTING EXTENSIONS for mini-batch b ⊂ D k do 5: w k ← w k -η • ∇ w k ℓ(w k ; b) ▷ Optimize based on Equation 1 return Decoder parameters θ k and local label distribution pk (y) to the server the FEDCVAE-ENS and FEDCVAE-KD pipelines, further enhancing the privacy of our proposed methods. We leave further exploration of privacy-preserving extensions to future work. Quality of Decoder Samples. To generate high-quality samples from a trained CVAE, it is typical to sample latent vectors z i either directly from the prior (usually a multivariate standard normal, i.e., z i ∼ N (0, I)) or from some other distribution with tight bounds around the prior distribution's mean (e.g., a truncated standard normal or a uniform distribution). During training, the CVAE will largely observe latent vectors in the highest density region of the prior distribution; for a standard normal distribution, this is near the center µ = 0. Latent vectors distant from the center will not generate high-quality samples when used with the trained decoder. As a demonstration, we train a centralized CVAE and sample both close to the center of the prior (i.e., z i ∼ U(-1, 1)) and distant from the center of the prior (i.e., z i ∼ U(5, 20)). The resulting image samples are shown in Figure 7 . 

C ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS

Benchmark Datasets. MNIST and FashionMNIST contain 28×28 grayscale images of handwritten digits and clothing/accessories, respectively, with 60, 000 train samples and 10, 000 test samples. SVHN contains 32 × 32 RGB image crops of street-view house numbers, with 73, 257 train samples and 26, 032 test samples. Hyperparameter Settings. Tables 3 and 4 contain fixed and variable hyperparameters, respectively. While the number of local epochs (T L ) may seem low for FEDAVG, we observed substantially reduced accuracy at higher numbers of local epochs, which is consistent with Lin et al. (2020) and references therein. Stability With Respect to Dataset Partition. To complement the results in Table 2 , we test the stability of each model across varying dataset partitions, controlled by a random seed (Table 5 ). When α is very low, as in our study, the dataset client splits generated by sampling from the Dirichlet distribution are diverse. This is exacerbated when more clients are used (higher m), potentially explaining some of the unstable results in Table 2 . While FEDAVG and FEDONESHOT are very Comparing KD With Parameter Averaging. FEDAVG (McMahan et al., 2017) introduced the notion of parameter averaging to FL. While parameter averaging may seem like a reasonable approach for client decoder aggregation in FEDCVAE-KD, it generates qualitatively poor samples (Figure 8 ) and fails to train a high-accuracy classifier (Table 6 ). The KD approach we propose to aggregate client decoders generates substantially better samples (Figure 8 ) while also obtaining more than 2× classifier accuracy on MNIST, more than 3.25× accuracy on FashionMNIST, and nearly 5× accuracy for SVHN at α = 0.001 (Table 6 ). 7 ). While accuracy for both FEDCVAE-ENS and FEDCVAE-KD degrade with less data per client, they both consistently perform better than FEDAVG and FEDONESHOT across all tested percent subsets of the training data. Adding Noise to Uploaded Label Distributions. Uploading client label distributions pk (y) may generate additional privacy concerns. One potential solution is to mask the precise label counts for each client by adding noise before upload as in Zhang et al. (2022) ; to achieve this, we draw noise from a normal distribution ϵ c ∼ N (0, γ • n k ) such that the "strength" (variance) of the noise applied to class c is in proportion γ to the total number of training samples for client k. Noise is applied to the training sample count for each class for a given client before uploading this information to the server. We visualize the effect of noise on client label distributions for several levels of γ in Figure 9 . As γ → ∞, information from the uploaded label distribution disappears; when γ = 0, the exact local label distributions are communicated. When adding a modest amount of noise to the client label distributions (i.e., γ ≤ 0.1), accuracy for both FEDCVAE-ENS and FEDCVAE-KD is barely affected compared to upload with no noise; see Table 8 for γ > 0 and refer to the α = 0.01 1 for γ = 0. The notion of uploading and harnessing client label distributions in FL is new (Zhu et al., 2021; Zhang et al., 2022) and quantifying the privacy risks that label distributions might induce is an open problem which could benefit from focused development. 



In practice, we find it useful to focus on the highest density region of the prior and instead sample from a truncated standard normal distribution with tight symmetric bounds. If classification is the downstream task, we note that rather than train an auxiliary classifier in the server, the server decoder's conditional likelihood model p(x|z, y) could be used directly in the generative classifier p(y|x) = p(x|y,z)p(z)dz•p(y) p(x). We leave further exploration of this modeling direction to future work. The code used to implement our proposed methods and carry out all experiments is included in the following public repository: https://github.com/ceh-2000/fed_cvae. We choose not to baseline against FEDONESHOT, which supports heterogeneous local models, because our notion of "heterogeneous model" is different; FEDONESHOT supports heterogeneous local classifiers while our two methods defer the choice of classifier to after FL is complete and instead support heterogeneous local CVAEs.



FEDCVAE-KD: DECODER AGGREGATION USING KNOWLEDGE DISTILLATION FEDCVAE-KD trains a CVAE f w k (•) for every client k ∈ C to convergence on their private local dataset by solving Equation 1; this CVAE is parameterized by w

which penalizes the dissimilarity g(•) in data space between the synthetic data sample generated by the server decoder D θ S (z k ; y k ) and the client decoder sample xk . To facilitate comparison with existing works in one-shot FL, we use the trained server decoder to generate an IID labeled dataset D C of n C samples to train the server classifier f w S C (•), parameterized by w S C . 2 Algorithm 1 -FEDCVAE-KD in the one-shot FL setting. T L represents the number of local training epochs. The server decoder parameters are θ S , with KD training epochs T KD , number of KD training samples n D , KD loss ℓ KD (•), and KD learning rate η KD . The server classifier parameters are w S C , with training epochs T C , number of training samples n C , classification loss ℓ C (•), and learning rate η C . C is the set of clients. 1: procedure SERVER 2:

Figure 3: Example distributions of class labels for MNIST with m = 10 clients over multiple levels of statistical heterogeneity α. The size of each dot is proportional to the number of samples.

Figure4: Results with heterogeneous local models. "Homogeneous" uses the same CVAE architecture for all clients, whereas "heterogeneous" uses two architectures with similar generative capabilities.

Figure 6: The full pipeline for one of our proposed methods, FEDCVAE-ENS. Here, E, D, and C represent "encoder," "decoder," and "classifier" models, respectively. First, clients train CVAEs on their private local datasets. Then, the server uses the ensemble of uploaded client decoders and corresponding local label distributions to generate a labeled dataset of synthetic samples to train a classifier.

m⌋ i=1 using client decoder D θ k (•) and label distribution pk (y) 5: Combine client subsets into an IID labeled dataset D Ens := D 1 Ens ∪ D 2 Ens ∪ ... ∪ D m Ens 6:

Figure 7: CVAE samples generated using latent vectors distant from the center of the multivariate normal prior (top row) and close to the center (bottom row).

Figure 8: Samples from the aggregated server decoder obtained through parameter averaging (top row) versus knowledge distillation (bottom row) at high statistical heterogeneity (α = 0.01).

Figure 9: The effect of the noise proportion γ on an example data partition at α = 0.01, m = 10 clients, and on MNIST. The original partition is in yellow (γ = 0). The size of each dot is proportional to the number of samples.

Performance of four data-free one-shot FL methods over three datasets and across three levels of statistical heterogeneity (lower α is more heterogeneous). Best results for each dataset and each level of α are in purple , with second best results in yellow .

Performance of four one-shot FL methods over three datasets and across four numbers of clients m. Best results for each dataset and each level of m are in purple , with second best results in yellow .

Privacy. A private pipeline is one that does not leak private client data to other participating clients or the server. van den Burg & Williams (2021) define a probabilistic generative model's propensity to reproduce samples observed in the training data as memorization, and prove that ensuring a particular level of differential privacy (DP) can bound memorization in probabilistic generative models (including CVAEs). Thus, beyond the normal privacy guarantees attributable to DP, employing FL DP-estimation techniques (i.e.,Geyer et al. (2017)) also ensures low memorization across Published as a conference paper at ICLR 2023 Algorithm 3 -Local training procedure. D k represents the client's local dataset. The local learning rate is η and the local loss function is ℓ(•) Initialize local CVAE parameters w k := [ϕ k , θ k ]

The default hyperparameter settings, which are used in experiments unless otherwise mentioned. These values are held consistent across datasets.

Dataset-specific hyperparameter settings, where applicable. "Truncated normal width" denotes the truncation bounds for the truncated standard normal distribution used for sampling from client decoders. The truncated normal width values are in terms of number of standard deviations.

Performance of four one-shot FL methods over three datasets and across four numbers of clients m. Results show the average test accuracy across 5 random dataset partitions ± one standard deviation. Parameter initialization remains constant. Best results for each dataset and each level of m are in purple , with second best results in yellow .

Performance With Less Data Per Client. To gauge our proposed methods' performance relative to the size of each client's local dataset, we vary the percent of the benchmark training data distributed to clients (Table

Performance of four data-free one-shot FL methods over three datasets and across multiple percent subsets of each dataset. Best results for each dataset and percent subset are in purple , with second best results in yellow .

Performance of our proposed methods with noise added to the uploaded client label distributions. Results show the average test accuracy across 5 seeds for the random noise ± one standard deviation.

