TOWARDS ADDRESSING LABEL SKEWS IN ONE-SHOT FEDERATED LEARNING

Abstract

Federated learning (FL) has been a popular research area, where multiple clients collaboratively train a model without sharing their local raw data. Among existing FL solutions, one-shot FL is a promising and challenging direction, where the clients conduct FL training with a single communication round. However, while label skew is a common real-world scenario where some clients may have few or no data of some classes, existing one-shot FL approaches that conduct voting on the local models are not able to produce effective global models. Due to the limited number of classes in each party, the local models misclassify the data from unseen classes into seen classes, which leads to very ineffective global models from voting. To address the label skew issue in one-shot FL, we propose a novel approach named FedOV which generates diverse outliers and introduces them as an additional unknown class in local training to improve the voting performance. Specifically, based on open-set recognition, we propose novel outlier generation approaches by corrupting the original features and further develop adversarial learning to enhance the outliers. Our extensive experiments show that FedOV can significantly improve the test accuracy compared to state-of-the-art approaches in various label skew settings.

1. INTRODUCTION

Federated learning (FL) (McMahan et al., 2016; Kairouz et al., 2019; Yang et al., 2019; Li et al., 2019) allows multiple clients to collectively train a machine learning model while preserving individual data privacy. Most FL algorithms like FedAvg (McMahan et al., 2016) require many communication rounds to train an effective global model, which cause massive communication overhead, increasing privacy concerns, and fault tolerance requirements among rounds. One-shot FL (Guha et al., 2019; Li et al., 2021c) , i.e., FL with only a single communication round, has been a promising and challenging direction to address the above issues. On the other hand, label skews are common in real-world applications, where different clients have different label distributions (e.g., hospitals on different regions can face different diseases). As parties may have few or no data of some classes, this leads even more challenges in one-shot FL. In this paper, we study whether and how we can improve the effectiveness of one-shot FL algorithm for applications with label skews. A simple and common one-shot FL strategy (Guha et al., 2019; Li et al., 2021c) is to conduct local training and collect the local models as an ensemble. The ensemble is either directly used as a final model for predictions (Guha et al., 2019) or distilled as a single model (Li et al., 2021c) with voting. However, those voting based approaches fail to produce high quality federated learning models. Under the label skew setting, since each client has only a portion of classes, the local model predicts everything to its seen classes and the final voting results are poor. For example, in an extreme case where each client only has one label (e.g., face recognition), all clients predict the input as its own label and the voting result is meaningless. To address this issue, we propose open-set voting for oneshot FL that introduces an "unknown" class in the voting inspired by studies on open-set recognition (OSR) (Neal et al., 2018; Zhou et al., 2021) . In local training, the clients train local open-set classifiers that are expected to predict its known classes correctly, while predicting "unknown" if it is unsure about the input data. Then, during inference, the server conducts voting on the received open-set classifiers with the "unknown" option. In this way, open-set voting can filter the local models that do not have the knowledge of an input to improve the voting accuracy. Although there are existing OSR studies in the centralized setting, it is challenging to apply them in the label-skewed federated setting to achieve good open-set classifiers during local training due to the limited number of local classes. For example, the state-of-the-art OSR algorithm PROSER (Zhou et al., 2021) considers linear interpolation between different seen classes as outliers. The outliers and the original data are used to train the model, where the outliers are considered as the extra class "unknown". When the number of classes is very small in a client, PROSER generates very limited types of outliers that are not sufficient for training. Moreover, the classifier has a loose boundary as the distance between the training data and the generated outliers may be large. To improve the quality of open-set classifiers, we propose a new open-set approach named FedOV with two novel techniques including data destruction and adversarial outlier enhancement to generate diverse and tight outliers. In data destruction, as opposed to data augmentation, we generate rich outliers by transforming the key features from the original image using boosted data operations such as random erasing and random resized crop. In adversarial outlier enhancement, we further optimize the outliers to be close to the training data in an adversarial way such that the local model cannot distinguish it. Experiments show that our proposed FedOV (Federated learning by Open-set Voting) significantly improves the accuracy compared with existing one-shot FL approaches under various label skew cases. To reduce the model size of FedOV ensemble, we also combine knowledge distillation to FedOV like previous approaches (Lin et al., 2020; Li et al., 2021c) . Distilled FedOV can also outperform state-of-the-art one-shot FL algorithms with model distillation. Our main contributions are summarized as follow: • To the best of our knowledge, we are the first to propose open-set voting in FL by introducing the "unknown" class, which significantly improves the accuracy compared to traditional close-set voting in FL. • We propose two novel techniques, including data destruction and adversarial outlier enhancement, to generate diverse "unknown" outliers without requirement on the number of classes of the training data. • We conduct extensive experiments to show the effectiveness of our open-set voting algorithm. Our algorithm consistently outperforms baselines with a significant improvement on accuracy on comprehensive label skew settings, including #C = 1 (each client has only one class) where many FL algorithms fail.

2.1. NON-IID DATA IN FL

Non-IID data is prevalent among real-world applications. For example, different areas have different types of diseases. Another example is that there are different species in different places. For classification tasks, suppose client i has dataset {x i , y i }, where x i are features and y i are labels. In the label skew setting, p(y i ) differs across clients. Label skew is difficult because the local optima of different clients can be much far away (Li et al., 2021b) . There have been many studies (Li et al., 2020; 2021d; Wang et al., 2020a; Luo et al., 2021; Mendieta et al., 2022) trying to alleviate this problem based on the model-averaging framework. For example, FedProx (Li et al., 2020) and MOON (Li et al., 2021d) adjusts the local training procedure to pull back local models from global model. FedNova (Wang et al., 2020a) normalizes local steps of each client during aggregation. A recent work (Huang et al., 2022) proposes few-shot model agnostic FL, which is able to train any models in a setting where each client has a very small sample size. It applies domain adaptation in the latent space with the help of a large public dataset. However, these algorithms need many rounds to converge, which may not be practical in real-world scenarios. For example, different companies may not be willing to communicate with each other frequently due to privacy and security concerns.

2.2. ONE-SHOT FL ALGORITHMS

One-shot FL (i.e., FL with only one communication) is a promising research direction. It has a very low communication cost. Moreover, it enables applications such as model market (Vartak, 2016) , where the clients only need to sell the models to the market and users conduct learning on the models. The original one-shot FL study (Guha et al., 2019) collects local models as an ensemble for the final prediction. It further proposes to use knowledge distillation on such ensemble with public data. FedKT (Li et al., 2021c) proposes consistent voting to improve the ensemble. A recent work (Zhang et al., 2021) proposes a data-free knowledge distillation scheme to perform one-shot FL. It uses the same ensemble distillation method as FedDF (Lin et al., 2020) . Its main contribution is the fake data generation in the server side to replace the public dataset for distillation. Such a technique is orthogonal to our FedOV, and can be combined with FedOV. All the above studies do not investigate comprehensive label skew cases. For more related studies and a detailed comparison between FedOV and these studies, please refer to Appendix A.1.

2.3. OPEN-SET RECOGNITION (OSR)

OSR is an emerging field with many important applications (Salehi et al., 2021) . In OSR, an additional class "unknown" is introduced besides the original classes. A good open-set classifier is expected to predict its known classes correctly, while predicting "unknown" if it is unsure about the input data. A popular approach in OSR is to generate outliers and label them as "unknown" in training. One direction is to use GANs to generate outliers. For example, Neal et al. (2018) apply GANs to generate outliers from latent space that (1) is close to real samples, and (2) with high probability of outlier (low probability of any known class). However, the training of GANs is very expensive. Moreover, it is challenging for GANs to generate clear high-dimension images. PROSER (Zhou et al., 2021) is a state-of-the-art open-set recognition approach. It generate outliers by linear interpolation of embedding space among different classes. Moreover, it introduces an additional loss to increase the possibility of predicting a sample as "unknown" when discarding its true class. Suppose the training set is D = {(x i , y i )} n i=1 , where x i is a training sample and y i ∈ {0, 1, ..., c -1} is its label. The neural network from input space to embedding space is denoted ϕ pre (•), and ϕ post (•) is the neural network from embedding space to output space. The whole model is f (•) = ϕ post (ϕ pre (•)) The PROSER loss is shown in Equation 1, where l is the cross entropy loss, c denotes class "unknown", xpre is the outliers generated from the training data (x pre = λϕ pre (x i ) + (1 -λ)ϕ pre (x j )), λ ∈ [0, 1] is sampled from Beta distribution, β and γ are two hyper-parameters. L P ROSER = (x,y)∈D l(f (x), y) + βl(f (x)\y, c) + γ (xi,xj )∈D l(ϕ post (x pre ), c) Since PROSER achieves state-of-the-art results and easy to implement, we consider it as our base method for training an open-set classifier. 3 FEDOV: ONE-SHOT FEDERATED OPEN-SET VOTING FRAMEWORK 3.1 PROBLEM STATEMENT Suppose there are N clients P 1 , ..., P N with local datasets D 1 , ..., D N . Our goal is to train a good machine learning model over D ≜ i∈[N ] D i with the help of a server, while the raw data are not exchanged. Moreover, each client is allowed to communicate with the server only once. In this paper, we focus on image classification task due to its popularity.

3.2. MOTIVATION

Observation 1 Voting is a popular method in existing one-shot FL approaches (Guha et al., 2019; Li et al., 2021c) . However, these approaches suffer under extreme label skews. For example, when we divide MNIST dataset into 10 clients where each client has only one class, both close-set voting (Guha et al., 2019) and FedKT (Li et al., 2021c) only have lower than 20% test accuracy. When each client has two classes, the test accuracy of both methods are lower than 50%. The problem is that the predictions of close-set classification models are biased towards their seen classes as shown in Figure 1a . When a test sample of class one comes, in traditional close-set voting, the first and third client make wrong predictions. Therefore, the voting result cannot predict correctly. Implication 1 In the label skew setting of FL, close-set classifiers are weak for one-shot FL and predict every input among its known classes. For voting, it would be better if models can be modest and admit unknown for its unseen classes as shown in Figure 1b . This motivates us to apply OSR in FL to introduce an unknown class to improve the voting. Implication 2 To better suit OSR algorithms for label skews in FL, we need new techniques to generate outliers which should 1) be diverse and 2) be close to the seen classes. We will introduce them in Section 3.4 and Section 3.5 and explain Figure 2b and Figure 2c in Section 3.6. 

3.3. THE OVERALL ALGORITHM

Based on the above observations and implications, we develop a new approach named FedOV to address label skews in federated learning. FedOV addresses the challenges in directly applying OSR to FL in the following two aspects. First, in order to generate diverse outliers, we propose data destruction (DD) to directly generate outliers from true samples. Second, in order to generate outliers that are even closer to true samples, we propose adversarial outlier enhancement (AOE) to learn a tighter boundary to surround the inliers. The overall framework of open-set voting is described as follows. For the training stage, each client trains an open-set classifier locally and submits it to the server. For the prediction stage, the server sums up the prediction probability of all submitted models on the input sample while discarding their "unknown" channel. The class with maximum prediction probability is outputted as the prediction label. An example of open-set voting is shown in Figure 1b . With the help of the unknown class, local models can admit its uncertainty when encountering unseen classes. The first and third model with little knowledge of class 1 assign very high probability to "unknown", and the second model outputs class 1 with 100% certainty due to its expertise in class 1. In this way, the input image can be correctly classified. The whole procedure is shown in Algorithm 1. Suppose there are c classes and classes 0 to (c -1) are the classes from the original training data. We use class c to denote the unknown class. In each client, it first initializes the local model f i (Line 2). Then, in each round, for each batch of training data, it generate outliers by data destruction and adversarial outlier enhancement (Lines 5-6, see Section 3.4 and 3.5 for more details). Next, considering the outliers as the unknown class, cross-entropy loss is computed on the outliers. By summing the PROSER loss (computed according to Equation 1) and the cross-entropy loss as the total loss, the local model is updated using the Adam optimizer (Lines 7-8). The local models are sent to server after reaching the number of training rounds (Lines 9). In the server, it aggregates all the local models as an ensemble as the final model (Line 11). When a new sample comes for prediction, it sums the prediction probability of each model (Lines 13-15). Then, the known class with the highest probability score is considered as the output label (Line 16). Since FedOV only requires a single communication round, its communication cost is O(N M ), where M is the size of local model. The communication cost is low compared with iterative federated learning algorithms, which need many rounds to communicate the models. Algorithm 1: The FedOV algorithm. L ce is the cross entropy loss and σ is the softmax function. Input: number of clients N , number of classes c, training rounds T

Each client executes:

Initialize local model f i for t = 1, ..., T do for each batch of local data (x, y) do  x ′ = DataDestruction(x) x ′′ = F GSM (f i , x ′ , c) L = L P ROSER + L ce (f i ({x ′ , x ′′ }), c) Update f i with loss L Upload f i to

Boosting Outliers with Data Destruction Set

To boost the diversity of outliers, we introduce randomness during the outlier generation. We do not use a fixed operation on each image to generate outliers. In each time, considering the above candidate operations as a set, we randomly sample one operation to generate an outlier each time. Then, in each batch of data during training, there exists diverse types of outliers generated by different operations. We show that such boosting method can significantly improve the accuracy in Appendix B.3. In summary, the intuition behind DD is to corrupt part of the key features such that 1) the generated data are not similar to the training data and 2) the generated data is diverse so that the model does not simply consider certain patterns as outliers.

3.5. ADVERSARIAL OUTLIER ENHANCEMENT

Adversarial training (Goodfellow et al., 2015; Kurakin et al., 2016) has been a popular approach to protect machine learning models from malicious attacks. For example, Goodfellow et al. (2015) utilizes fast gradient sign method (FGSM) to generate adversarial samples such that the model outputs a wrong answer with a high confidence. Then, the adversarial samples are used as a part of the training data to regularize the training. Inspired by adversarial training, instead of using FGSM to generate adversarial samples for robust training, we apply it to optimize the generated outliers. Specifically, suppose the client is training the model f with the generated outliers x ′ by our data destruction method. We utilize FGSM to generate x ′′ such that the model wrongly outputs x ′′ as a seen sample with a high confidence. Then, the enhanced outliers x ′′ are used together with the generated outliers x ′ as the unknown class to update the model. We call this method Adversarial Outlier Enhancement (AOE). Examples of the enhanced outliers are shown in Figure 3b . Compared Figure 3b with Figure 3a , the outliers are more normal and look like different classes from the training data after AOE (e.g., in the third row of Figure 3b , the third and eight outliers look like digit "3" although they are generated from digit "2".).

3.6. DISCUSSION

T-SNE Visualization As shown in Figure 2b and Figure 2c , DD can generate diverse outliers to help distinguish data from seen and unseen classes and AOE can further decrease the margin between the outliers and training data (i.e., class 0 and 6) to learn a better classifier. Extension of FedOV with Knowledge Distillation One shortage of FedOV is that the final model is an ensemble of local models, therefore its prediction and storage costs may be large especially when the number of clients is large (e.g., cross-device setting). Some existing approaches (Lin et al., 2020; Li et al., 2021c) Baselines We include one-shot FL algorithms as baselines including close-set voting (Guha et al., 2019) and FedKT (Li et al., 2021c) . We also compare FedOV with the iterative FL algorithms including FedAvg (McMahan et al., 2016) , FedProx (Li et al., 2020) , FedNova (Wang et al., 2020a) , and FedDF (Lin et al., 2020) . We run them in a single round for a fair comparison. Note that FedKT and FedDF require a public dataset (or synthetic dataset) for distillation. In each task, we use a half of the test dataset as the public dataset for distillation for FedKT and FedDF and the remaining for testing. Since the source code of FedSyn (Zhang et al., 2021) is not publicly available and we have included FedDF (which has the same distillation approach as FedSyn) in our experiments, we do not compare it with FedOV. Default setups By default, we follow FedAvg (McMahan et al., 2016) and other existing studies (Li et al., 2021c; b; Wang et al., 2020b) to use a simple CNN with 5 layers in our experiments. There are 10 clients. For local training, we run 200 local epochs for each client. We set batch size to 64 and learning rate to 0.001. For results with error bars, we run three experiments with different random seeds. Due to the page limit, we only present some representative results in the main paper. For more experimental details and results, please refer to Appendix B.

4.2. AN OVERALL COMPARISON

We compare the accuracy between FedOV and the other baselines as shown in Table 1 . Our algorithm can significantly outperform baseline algorithms with only one communication. In many settings, FedOV achieves more than 10% accuracy than close-set voting. In the extreme cases such as #C = 1, FedOV can outperform close-set voting by at lease 30%. The iterative FL algorithms cannot achieve satisfactory accuracy when running for a single round. 

4.4. COMBINING WITH KNOWLEDGE DISTILLATION

Assuming that there exist unlabeled public data on the server, FedOV can also be combined with knowledge distillation like FedDF (Lin et al., 2020) and FedKT (Li et al., 2021c) . We call it Distilled FedOV. We compare Distilled FedOV with the other baselines. According to FedKT experimental settings, we train 100 epochs for each client. For distillation, we run 100 epochs for the final student model. According to our default settings, we use the simple CNN model and use SVHN without Extension to Multiple Rounds After knowledge distillation, we can further train the distilled model by FL averaging algorithms (e.g., FedAvg, FedProx, etc). We conduct experiments on MNIST with 10 clients and data partitioning p k ∼ Dir(0.5). The results are shown in Figure 4 . For FedOV_FedProx and FedKT_FedProx, we run Distilled FedOV and FedKT for the first round and using the global model as the initialized model for the later rounds using FedProx. From the figure, with the help of FedOV, the accuracy after first round is high. Then, FedOV_FedProx can converge much faster than the other algorithms.

4.5. SCALABILITY

We test the scalability of FedOV by varying the number of clients. Due to page limit, we only show results on CIFAR-10 in Table 4 . For the results on the other datasets, please refer to Appendix B.8. From the table, we can observe that FedOV still achieves the best accuracy when increasing the number of clients. Moreover, with the help of knowledge distillation, Distilled FedOV can outperform distilled close-set and other iterative FL algorithms with the same storage and inference cost. From the perspective of data sharing, Zhou et al. (2020) proposes to perform dataset distillation and upload the distilled data to server for centralized training. Kasturi et al. (2020) proposes to fit the features in each client by some distribution. Then the server generates fake data with these distributions. Both approaches raise additional privacy concerns due to the uploaded fake data or feature distributions. Besides, we cannot find source code for both methods, so we do not compare FedOV with them. XorMixFL (Shin et al., 2020) proposes to use exclusive OR operation (XOR) to encode and decode samples for data sharing. However, it assumes that all clients and the server have labelled samples of a global class, which is impractical in real-world applications. In our experiments, we adopt the setting like many existing studies (Li et al., 2020; Karimireddy et al., 2020; Wang et al., 2020a; Hsu et al., 2019; Li et al., 2021b) , which cannot guarantee the above assumption. Therefore, we do not compare FedOV with XorMixFL in our experiments. We compare these one-shot FL algorithms in Table 5 . None of the previous one-shot FL algorithms conduct experiments on both distribution-based and quantity-based label skews. 

B.1 FURTHER DETAILS

We summarize the datasets in our experiments in Table 6 . For PROSER, we choose β = 0.01, γ = 1, according to the default trade-off parameter setting in the official codefoot_0 . For ease of implementation, we omit the extension to multiple dummy classifiers. For data destruction, details of our current transformations are as follow. The core idea is to destroy the key features of the original image while keeping some scrappy features. • Random resized crop: with scale range (0.1, 0.33). We choose a small portion in order not to contain the key features and enlarge that portion to original size. • Gaussian blur: with random kernel ranging from 1*3 to 5*9, and random σ ∈ (10, 100). We choose a large σ to blur out key features. • Random erasing: with scale range (0.33, 0.5). We choose a large portion to erase in order to spoil key features. • Random paste: We random paste half of the image to another place. • Random swap: We swap left-side and right-side, or upper-side and down-side. • Random rotation: We random rotate two square portions of the image. For adversarial learning, we set 5 local steps and each step size 0.002. For Distilled FedOV, the local training step is the same as FedOV. After the server collects all local models, it performs knowledge distillation based on the open-set voting results. Formally, denote the local models f 1 , ..., f N , and each model outputs c + 1 scores where the last score is for the class "unknown". The student model f s outputs c scores for the c known classes. For an input x, we add the output probability of each local model scores(x) = N i=1 σ(f i (x)) where σ is the softmax function. The voting result is the first c scores for the c known classes, i.e. v(x) = scores(x) 0,1,...,c-1 . The normalized probability v n (x) = v(x) |v(x)|1 is used as the target to distill the student model. Therefore, we have distillation loss L dis = KL(σ(f s (x)), v n (x)), where KL is Kullback-Leibler divergence. By default, we use the first half of the test set as the public unlabelled dataset for knowledge distillation in the server and then test the distilled model on the second half of the test set. We use Adam optimizer with learning rate 0.001, and train 100 epochs on the public unlabelled dataset for the distillation process. For FedProx, we tune the hyper-parameter µ ∈ {0.001, 0.01, 0.1, 1}. For methods requiring knowledge distillation, we use first half of test set as public unlabelled dataset. The other half are used to compute accuracy. Our simple CNN contains two 5*5 convolution layers with 2*2 max pooling layer. The first has 6 channels and the second has 16 channels. Then it has two fully-connected layers with 120 and 84 neurons separately. We use ReLu as the activation function between layers. All experiments are conducted on a single 3090 GPU. We compare computing time of different algorithms in Table 7 . In this section, we verify the effectiveness of random data destruction from a pool of transformations. We compare with using single transformation. We experiment on CIFAR-10 dataset, and all algorithms are open-set voting containing the same PROSER loss. The differences are their data destruction strategies. Results are shown in Table 10 . None of single transformation can reach the accuracy of random transformation from data destruction set. 

B.4 TOP-K CONFIDENCE VOTING

A possible alteration to our framework is to select the top k confident models for voting. Specifically, after getting predictions of all models, the server sorts all predictions by the probability of "unknown" channel. Then we sum up predictions of the lowest k "unknown" (i.e. the top k confident) while discarding all other predictions. Hyper-parameter k can be tuned for different tasks. We visualize the voting accuracy curve versus k for experiments of VGG-9 on CIFAR-10 in Figure 5 . As we can see, the best k differs for different label skews. For extreme #C = 2, k = 1 seems the best. For slight skew p k ∼ Dir(0.5), k = 10 is the best. Under #C = 2 cases, each model is trained only on two classes, in such case we would better follow the most confident expert, since others are more likely to make wrong prediction. Under p k ∼ Dir(0.5), models are trained on more classes and are generally more clever, where considering all the models for prediction becomes better. By tuning k for different settings, results of best top-k confidence voting is shown in Table 11 . For close-set voting, we report the accuracy of counting votes from all 10 clients, since all clients are equal. For open-set voting, we report the best accuracy among counting votes from top 1, 2, 3, ..., 10 confident clients. We can get similar conclusions as in Section 4.3. Besides VGG-9, we also experiment on ResNet-50 to verify the effectiveness of FedOV. We use CIFAR-100 to test on more complicated datasets. Results are shown in Table 14 . For the baselines of one-round FedAvg, FedProx, SCAFFOLD and FedNova, we omit them since their one-shot accuracy are much lower (see Table 1, 4, 17 ). FedKT trains multiple models in each client and has huge computation and storage cost when using ResNet-50, so we omit it. Actually in FedKT paper (Li et al., 2021c) , the authors do not experiment with heavy models like ResNet-50 either. Note that ResNet-50 has batch normalization layers. Therefore in each training batch, we have to mix train data and generated outliers. Previously we compute a batch of train data and another batch of generated outliers separately. For ResNet-50, it can cause problem due to batch normalization. To speed up computation, we only use data destruction strategy in our experiments for FedOV. Experimental results show that FedOV is better than close-set voting and FedDF for CIFAR-100 with ResNet-50 model. We find that an existing study proposed CutPaste (Li et al., 2021a) for outlier detection task, which generates outliers by applying operations on the training data. However, CutPaste contains limited operations and these operations cannot effectively corrupt the original features. In this section, we compare our data destruction with Cutpaste (Li et al., 2021a) and explains why we do not include Cutpaste in our main experiments. First, we show the outliers generated by Cutpaste in Figure 6 . Cutpaste outliers are less diverse than our data destruction in Figure 3a . Next, we show some experimental results. We experiment on simple CNN model under CIFAR-10 setting. Mixture means for each image, we augment using our method and Cutpaste each with 50% chance. Results are shown in Table 15 . Even in some cases, Cutpaste or mixture can slightly outperform our method, in other cases our method can greatly outperform them. Therefore, adding Cutpaste can hardly bring significant improvement. In conclusion, our approach can generate more diverse outliers and achieve better accuracy compared with CutPaste. For the setting where each client only has one class, FedAwS (Yu et al., 2020) proposes spreadout regularization to push the embeddings of each class apart from each other to avoid all inputs collapsing to a single point. FedUV (Hosseini et al., 2021) argues that FedAwS leaks the sensitive class embedding information to the server. The authors propose to use error-correcting codes to protect the embeddings and achieve similar accuracy compared with FedAwS. In FedUV paper (Hosseini et al., 2021) , both FedUV and FedAwS are compared with FedAvg with regular softmax cross-entropy loss function on user verification tasks. The authors conclude that FedAvg with regular softmax cross-entropy loss achieves the best accuracy in most cases, however regular FedAvg leaks class embeddings to the server and all other clients. FedUV focuses more on privacy protection and has lower accuracy than FedAvg, while we focus on accuracy and our FedOV has significantly outperformed FedAvg, therefore we do not compare with FedUV. There are also works specifically for the real-world context of #C=1, such as federated face recognition (Liu et al., 2022) . However, it focuses on personalized FL, and each client can access a public dataset to assist training, which is different from our settings. Therefore, we do not compare with it. Next, we compare FedOV with FedAwS on #C=1 setting. We use the default k = 10 in FedAwS paper. Since we only have 10 clients, it equals calculating the distance with all other class embeddings. We tune the best λ ∈ {10, 100} according to the authors' suggestions. We use the same squared hinge loss with hyper-parameter 0.9 as the original paper. Our two techniques DD and AOE can also be extended to centralized OSR training to augment the outliers. In this section, we explore how they perform in centralized OSR experimental settings. It clarifies why DD and AOE are more suitable and more important for FL label skew settings. We utilize the common setting of OSR algorithm evaluation. For MNIST, Fashion-MNIST, CIFAR-10 and SVHN datasets, we regard the first 2, 4, 6 and 8 classes as known during training. During testing, all 10 classes in the test set appear. We compare PROSER+DD+AOE with PROSER, and run 200 epochs each. The model is simple CNN. We show the results in Table 20 . As we can see, under the OSR settings, adding our DD and AOE techniques can also bring improvement compared to the state-of-the-art PROSER algorithm in most cases. When there are fewer known classes, the average improvement brought by our techniques is larger. For the experimental settings of many OSR papers (Zhou et al., 2021; Neal et al., 2018; Perera et al., 2020; Yoshihashi et al., 2019) , the OSR model is trained on at least 4 known classes. The scenario of a limited number of classes (such as 2 classes) is not studied. For PROSER, we can see a clear trend that when the number of known classes decreases, the accuracy decreases. Since PROSER generates outliers between classes, when the number of known classes is limited, the generated outliers are not diverse enough to train a good OSR model. Therefore, PROSER is inapplicable to scenarios with limited known classes. Under FL label skews, some parties may have limited data for some labels due to privacy regulations and data distribution heterogeneity. Therefore, some clients may have limited known classes, e.g. only 1 class or 2 classes. And since there are multiple clients, it is unlikely to ensure that all clients have at least some number of classes. Such scenarios are different from typical centralized OSR experimental settings and pose great challenges for PROSER as we also show in Figure 2 . Our two techniques DD and AOE can tackle these challenges under the scenario of a limited number of classes. There is no clear relation between the accuracy of adding our two techniques and the number of known classes. It is because the task difficulty is not linear w.r.t the number of known classes. More known classes can be both a benefit and a challenge. The benefit is that more examples of diverse classes help generate more diverse outliers in-between for better outlier detection. The challenge is that the task is more complicated as the model has to classify more seen classes. In conclusion, although DD and AOE can be extended to centralized OSR settings, they can bring more significant improvement to FL and are more essential for FL with label skews, compared with PROSER when the number of classes is limited, such as in FL label skew scenarios. 



https://github.com/zhoudw-zdw/CVPR21-Proser



Open-set voting framework ('u' denotes unknown class)

Figure 1: Comparison between traditional close-set voting and our open-set voting framework. We allocate the data of classes 0, 1, and 2 into three clients by Dirichlet distribution Dir 3 (0.02).

Figure 2: T-SNE visualisation of the features extracted by local models trained with different methods. During training, the client only has data from classes 0 and 6. In each sub-figure, we plot the representations of the data from the seen classes 0 and 6, unseen classes, and generated outliers during training. The black lines are possible classification boundaries.

(a) Outliers generated by DD. (b) Outliers after AOE.

Figure 3: Generated outlier examples on MNIST, Fashion-MNIST and CIFAR-10 dataset. In Figure 3a, the first column contains the real samples and the other columns are generated outliers of that sample. In Figure 3b, all figures are generated outliers by adversarial outlier enhancement.

utilize knowledge distillation to distill the knowledge from multiple local models to a global model with the help of a public or synthetic dataset. Our approach is compatible with the above methods. With knowledge distillation, we can transform the ensemble of local models into a single global model. Then, it can significantly reduce the storage and prediction costs of the final model. Moreover, considering the final model as the initialized model for iterative federated learning algorithms like FedAvg(McMahan et al., 2016), we can conduct multi-round federated learning to further improve the model. As shown in Section 4.4, FedOV can be effectively combined with the existing approaches to increase their accuracy and communication efficiency. We conduct experiments on MNIST, Fashion-MNIST, CIFAR-10 and SVHN datasets. We use the data partitioning methods inLi et al. (2021b)  to simulate different label skews. Specifically, we try two different kinds of partition: 1) #C = k: each client only has data from k classes. We first assign k random class IDs for each client. Next, we randomly and equally divide samples of each class to their assigned clients; 2) p k ∼ Dir(β): for each class, we sample from Dirichlet distribution p k ∼ Dir N (β) and distribute p k,j portion of class k samples to client j.

Figure 4: Extension to multiple rounds on MNIST.

Figure 5: Voting accuracy on different k of VGG-9 on CIFAR-10.

Figure 6: Outliers generated by Cutpaste. The first column is real samples and the other columns are Cutpaste outliers.

Figure7: Comparing FedOV (one-shot FL accuracy) with baseline algorithms running multiple rounds. We experiment on six different partitions of CIFAR-10 dataset.

the server.

Comparison with close-set voting and various FL algorithms in one round.

Experimental results of different FL voting strategies with simple CNN model. We show the effect of each component of FedOV including open-set voting (PROSER), data destruction (DD), and adversarial outlier enhancement (AOE). Specifically, we add one component each time and the results are shown in Table2. From the table, we can observe that FedOV with all the three components can achieve the highest accuracy in most settings. Simply applying PROSER in FL may not increase the accuracy compared with close-set voting (e.g., CIFAR-10 with #C = 2).

Comparing distilled FedOV with the other baselines. The partition is p k ∼ Dir(0.5). Note that the settings are different from FedKT paper(Li et al., 2021c), therefore our reported accuracy differs from FedKT paper. Further details are elaborated in Appendix B.1.We run distilled FedOV for three times and results are shown in Table3. We can observe that Distilled FedOV can achieve a higher accuracy than FedKT and the other iterative FL baselines with the same size of final model, which further verifies the effectiveness of our open-set voting framework.

Experimental results of different number of clients on CIFAR-10 with simple CNN model.

Comparison among existing one-shot FL algorithms. For example, since the generated counterfactual images are all blurry, the classifier may classify clear images as inlier and blurry images as outlier. When it encounters clear images of other digits during testing, it can classify these samples into its known class. Finally, our current data destruction framework only works for vision tasks. For tabular dataset or language dataset, it deserves further research how to augment diverse outliers. A.4 A BETTER BENCHMARK FOR OSR ALGORITHMS From the perspective of open-set recognition, we find that FL voting can serve as a more realistic, comprehensive and challenging benchmark to test open-set recognition algorithms. Previous openset recognition experiments mainly divide the classes into totally known and totally unknown, i.e. the training set has either full data of a class or no data of a class. However, in reality, grey area exists where a model sees only a few training samples of some class. Moreover, previous open-set recognition or outlier detection experiments use ROC-AUC as the metric, which avoids transforming the outlier score to some probability. Instead, ROC-AUC only cares the relative score between known and unknown class. However, it is essential to output a reasonable unknown probability in real-world application. Our FL open-set voting benchmark includes these challenges. Its data partitioning strategies are based on Li et al. (2021b), which can be easily adjusted to various settings.

Basic information of datasets we use.

Running time of different algorithms with simple CNN on CIFAR-10 dataset. There are 10 clients and each client runs 200 local epochs with only one communication. Our device is a single 3090 GPU. For the additional time cost of using data destruction (DD) and adversarial outlier enhancement (AOE), it costs about 2 times computation during local training in our experiments, depending on different datasets. The experiments shown in Table 8 are with partition p k ∼ Dir(0.5) into 10 clients.

Time for the first client to finish local training of 200 epochs. Our device is a single 3090 GPU.

Experimental results of different FL voting strategies with simple CNN model. We repeat all experiments for three times.

Experimental results of using random data destruction from a pool of transformations, compared with using single transformation. Dir(0.5) 67.6%±0.3% 66.0%±1.1% 66.4%±1.2% 66.5%±0.8% 66.6%±0.7% 67.8%±0.8% 68.0%±0.9% p k ∼ Dir(0.1) 61.3%±1.0% 56.9%±3.6% 56.1%±2.7% 54.9%±2.3% 56.4%±2.6% 58.2%±2.9% 57.9%±2.0%

Experimental results of different FL voting strategies with simple CNN model. We report the best top-k confidence voting accuracy. We repeat all experiments for three times. Results are shown in Table12 and 13. For VGG-9 experiments, our method still works.

Experimental results of different FL voting strategies with VGG-9 model. We repeat all experiments for three times.

Experimental results of different FL voting strategies with VGG-9 model. We report the best top-k confidence voting accuracy. We repeat all experiments for three times.

Experimental results on ResNet-50. We run 100 local epochs each client, and train the student model for 100 epochs during distillation.

Experimental results of Cutpaste and our augmentation. We report the accuracy of counting votes from all 10 clients. We also conduct similar experiments based on the metric of outlier detection. Our dataset includes MNIST, Fashion-MNIST and CIFAR-10. We use one class when training and all classes of the same dataset when testing. The metric is ROC-AUC of the outlier probability. We run 200 local epochs with batch size 64, learning rate 0.001. For ROC-AUC, we report the average of last 10 epochs. Results are shown in Table16. Our augmentation method still outperforms Cutpaste and mixture.

Experimental results of cutpaste and our augmentation under common outlier detectionIn this section, we show results of scalability experiments on SVHN, Fashion-MNIST and MNIST. Except for datasets, other experimental settings are the same as in main paper. Results are shown in Table17. FedOV still works well for scalability on SVHN, Fashion-MNIST and MNIST. Distilled FedOV can also outperform distilled close-set voting, FedAvg, FedProx and FedNova with single communication.Both FedDF and distilled close-set voting can outperform FedAvg. Note that the ensemble of FedDF is based on average logits, while distilled close-set voting uses average probability, i.e. the softmax of logits. After softmax, the probability is limited in the range of [0,1]. However in FedDF, logits can be arbitrary real number. If a stupid local model outputs some very bad logits, it can greatly influence the ensemble of FedDF. This can probability explain why FedDF has different performance compared with distilled close-set voting, although both approaches are similar. Especially when there are 80 clients where it is more likely to have some very stupid local models, FedDF has worse performance than distilled close-set voting.

Experimental results of different number of clients with simple CNN model.

Comparison between distilled FedOV and FedDF under different levels of label skews.

FedAwS paper does not state the number of local epochs. From our experiments, one local epoch is enough for convergence under #C=1. For successive epochs, the loss is almost zero and the accuracy does not improve. By default, for both FedAwS and FedAvg, we train 1 local epoch with learning rate 0.001. Results are shown in Table 19. As we can see, FedOV significantly outperforms FedAwS and FedAvg in one round. After one round, FedAwS and FedAvg achieve similar accuracy. Comparison between FedOV, FedAwS, and FedAvg under #C=1 partition. The model is simple CNN. We run three times for each setting. .11 PERFORMANCE OF DD AND AOE ON CENTRALIZED OSR EXPERIMENTAL SETTINGS

Test accuracy comparison between PROSER+DD+AOE with PROSER under centralized OSR settings. "Unknown" is regarded as a new class besides the seen classes. We run three times for each setting.

ACKNOWLEDGEMENTS

This research is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-RP-2020-018). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of National Research Foundation, Singapore.

