TACKLING IMBALANCED CLASS IN FEDERATED LEARNING VIA CLASS DISTRIBUTION ESTIMATION Anonymous

Abstract

Federated Learning (FL) has become an upsurging machine learning method due to its applicability in large-scale distributed system and its privacy-preserving property. However, in real-world applications, the presence of class imbalance issue, especially the mismatch between local and global class distribution, greatly degrades the performance of FL. Moreover, due to the privacy constrain, the class distribution information of clients can not be accessed directly. To tackle class imbalance issue under FL setting, a novel algorithm, FedRE, is proposed in this paper. We propose a new class distribution estimation method for the FedRE algorithm, which requires no extra client data information and thus has no privacy concern. Both experimental results and theoretical analysis are provided to support the validity of our distribution estimation method. The proposed algorithm is verified with several experiment, including different datasets with the presence of class imbalance and local-global distribution mismatch. The experimental results show that FedRE is effective and it outperforms other related methods in terms of both overall and minority class classification accuracy.

1. INTRODUCTION

Federated Learning (FL) was first proposed (McMahan et al., 2017) when they were developing the application of next-word prediction on mobile keyboard. It enables multiple clients to collaboratively learn a machine learning model, without sharing their locally stored raw data (Li et al., 2020a) . This property greatly reduce the communication cost and preserve the privacy of clients, which makes FL become an upsurging research direction, not only in machine learning community but also in a variety of engineering applications, including communication (Wang et al., 2022; Niknam et al., 2020; Mills et al., 2019 ), edge computing (Zhang et al., 2021a; Wang et al., 2019a; b) , and energy engineering (Saputra et al., 2019; Hamdi et al., 2021; Cheng et al., 2022) . Standard FL consists of four major steps, which are client selection, broadcast, client computation, and aggregation (Kairouz et al., 2021) . In each iteration, the central server will select a subset of clients in each global iteration at first, and then broadcast the global model to them. After receiving the global model, the selected clients will perform model update based on their local dataset given the global model as initial condition, and then upload updates to the server. As the final step, the server will aggregate the collected information to update the global model, and then start a new iteration. In FL framework, one of the most difficult challenges is class imbalance issue. Class imbalance means that the data distribution among all classes are not uniform. In other words, the majority of data samples may belong to certain classes, and other minority classes may only have a small amount of data. Class imbalance results in low classification accuracy on minority classes, and also slow down the training speed. In the literature, several methods have been proposed to resolve class imbalance issue in centralized machine learning scheme. In general, the methods can be categorized as data-level methods, algorithm level methods, and hybrid methods (Johnson & Khoshgoftaar, 2019) . For data-level method, Jo & Japkowicz (2004) proposed a cluster-based sampling scheme to tackle the class imbalance issue. For algorithm level method, Ling & Sheng (2008) proposed the cost-sensitive learning to improve the classification performance on minority class. For hybrid method, Sun et al. ( 2007) integrated both sampling techniques and cost-sensitive learning and showed a significant performance boost in most of the cases. However, in FL, since all training data are distributed and stored locally and not exchangeable, it is infeasible to apply data-level methods. Besides, due to the inconsistency of local and global class imbalance, algorithm-level methods such as cost-sensitive learning are not effective and may even impose negative effect on the performance of global model (Wang et al., 2021) . Thus, those methods that have shown great achievements on class imbalance issue in centralized machine learning cannot be applied directly in FL, and new algorithm has to be considered under the constrains of FL scheme. In this work, the class imbalance issue in FL is addressed, especially the issues of global class imbalance and mismatch between local and global class distribution. To tackle the challenges mentioned previously while preserving the privacy condition, a new FL algorithm is proposed. Due to the privacy condition, the local dataset and local class distribution of each clients are not accessible. Thus, an estimation method to estimate the class distribution is developed, requiring no extra client dataset information. Based on the estimated class distribution, the loss re-weight method can be applied to handle the global class imbalance. The proposed method requires no additional client information, so the privacy safety can be guaranteed. Besides, from the experimental results, the proposed method has shown a significant improvement on the classification accuracy of minority class under the class imbalance scenario. Contribution. In summary, the contributions of this work is as follows. 1. The proposed estimation method can estimate the class distribution without additional client information, so the privacy safety can still be guaranteed. Moreover, the estimation method doesn't require much extra computation, so the overall efficiency of FL algorithm is not degraded. The theoretical analysis is also provided to support the validity of the proposed distribution estimation method. 2. The proposed new FL algorithm based on class distribution estimation achieves significant improvement on handling the class imbalance issue in FL, and the proposed algorithm is verified by experiments with different heterogeneity level and different dataset.

2.1. FEDERATED LEARNING

Due to the development of edge devices and the increasing popularity of mobile devices, the distributed machine learning has become one of the most popular direction of machine learning research. However, common mobile devices such as smart phone and wearable electronics have limited computation and communication power. Moreover, those mobile devices contains personal information which is private and not exchangeable. Thus, a new distributed machine learning problem, also known as Federated Learning (FL), has to be considered to overcome the challenges of communication and computation cost, data heterogeneity, and privacy. Yonetani, 2019; Ribero & Vikalo, 2020; Balakrishnan et al., 2021) .

2.2. CLASS IMBALANCE IN FEDERATED LEARNING

One of the most difficult yet important challenges in FL is class imbalance. The class imbalance issue can be categorized as local imbalance and global imbalance (Wang et al., 2021) . Works mentioned previously were addressing on data heterogeneity, which belongs to local imbalance. However, as stated in section 1, due to the mismatch between local and global distribution, only handling local data imbalance is not enough. To deal with class imbalance issue, the algorithm Astraea



To overcome the novel challenges, FedAvg (McMahan et al., 2017) was first proposed and addressing the communication efficiency problem. Since then, many researches have been conducted to resolve challenges in FL. Li et al. (2020b) proposed the algorithm FedProx to tackle the data heterogeneity issue by introducing a proximal term in local objective function, and provided theoretical analysis on the convergence. To improve the convergence speed and reduce the communication cost, in Reddi et al. (2020) different adaptive learning techniques were applied in the aggregation steps. Wang et al. (2020b) investigated the fundamental cause of heterogeneity, and proposed a normalized averaging algorithm FedNova to improve the FL performance. Li et al. (2021) applied local batch normalization to tackle the featureshift non-iid issue, which is another type of data heterogeneity. Some recent works in literature also provided improvement on FL algorithm by proposing different client selection scheme (Nishio &

