TOWARDS SCALABLE AND NON-IID ROBUST HIERARCHICAL FEDERATED LEARNING VIA LABEL-DRIVEN KNOWLEDGE AGGREGATOR

Abstract

In real-world applications, Federated Learning (FL) meets two challenges: (1) scalability, especially when applied to massive IoT networks, and (2) how to be robust against an environment with heterogeneous data. Realizing the first problem, we aim to design a novel FL framework named Full-stack FL (F2L). More specifically, F2L utilizes a hierarchical network architecture, making extending the FL network accessible without reconstructing the whole network system. Moreover, leveraging the advantages of hierarchical network design, we propose a new labeldriven knowledge distillation (LKD) technique at the global server to address the second problem. As opposed to current knowledge distillation techniques, LKD is capable of training a student model, which consists of good knowledge from all teachers' models. Therefore, our proposed algorithm can effectively extract the knowledge of the regions' data distribution (i.e., the regional aggregated models) to reduce the divergence between clients' models when operating under the FL system with non-independent identically distributed data. Extensive experiment results reveal that: (i) our F2L method can significantly improve the overall FL efficiency in all global distillations, and (ii) F2L rapidly achieves convergence as global distillation stages occur instead of increasing on each communication cycle.

1. INTRODUCTION

Recently, Federated Learning (FL) is known as a novel distributed learning methodology for enhancing communication efficiency and ensuring privacy in traditional centralized one McMahan et al. (2017) . However, the most challenge of this method for client models is non-independent and identically distributed (non-IID) data, which leads to divergence into unknown directions. Inspired by this, various works on handling non-IID were proposed in Li et al. ( 2020 However, these works mainly rely on arbitrary configurations without thoroughly understanding the models' behaviors, yielding low-efficiency results. Aiming to fulfil this gap, in this work, we propose a new hierarchical FL framework using information theory by taking a deeper observation of the model's behaviors, and this framework can be realized for various FL systems with heterogeneous data. In addition, our proposed framework can trigger the FL system to be more scalable, controllable, and accessible through hierarchical architecture. Historically, anytime a new segment (i.e., a new group of clients) is integrated into the FL network, the entire network must be retrained from the beginning. Nevertheless, with the assistance of LKD, the knowledge is progressively transferred during the training process without information loss owing to the empirical gradients towards the newly participated clients' dataset. The main contributions of the paper are summarized as follows. (1) We show that conventional FLs performance is unstable in heterogeneous environments due to non-IID and unbalanced data by carefully analyzing the basics of Stochastic Gradient Descent (SGD). (2) We propose a new multi-teacher distillation model, Label-Driven Knowledge Distillation (LKD), where teachers can only share the most certain of their knowledge. In this way, the student model can absorb the most meaningful information from each teacher. (3) To trigger the scalability and robustness against non-IID data in FL, we propose a new hierarchical FL framework, subbed Full-stack Federated Learning (F2L). Moreover, to guarantee the computation cost at the global server, F2L architecture integrates both techniques: LKD and FedAvg aggregators at the global server. To this end, our framework can do robust training by LKD when the FL process is divergent (i.e., at the start of the training process). When the training starts to achieve stable convergence, FedAvg is utilized to reduce the server's computational cost while retaining the FL performance. (4) We theoretically investigate our LKD technique to make a brief comparison in terms of performance with the conventional Multi-teacher knowledge distillation (MTKD), and in-theory show that our new technique always achieves better performance than MTKD. (5) We validate the practicability of the proposed LKD and F2L via various experiments based on different datasets and network settings. To show the efficiency of F2L in dealing with non-IID and unbalanced data, we provide a performance comparison and the results show that the proposed F2L architecture outperforms the existing FL methods. Especially, our approach achieves comparable accuracy when compared with FedAvg (McMahan et al. ( 2017)) and higher 7 -20% in non-IID settings. 

2. RELATED WORK

here, r ∈ {R} are the teachers' indices. By minimizing P1, the student pg can attain knowledge from all teachers. However, when using MTKD, there are some problems in extracting the knowledge distillation from multiple teachers. In particular, the process of distilling knowledge in MTKD is typically empirical without understanding the teacher's knowledge (i.e., aggregating all KL divergences



); Acar et al. (2021); Dinh et al. (2021a); Karimireddy et al. (2020); Wang et al. (2020); Zhu et al. (2021); Nguyen et al. (2022b).

FEDERATED LEARNING ON NON-IID DATA To narrow the effects of divergence weights, some recent studies focused on gradient regularization aspects Li et al. (2020); Acar et al. (2021); Dinh et al. (2021a); Karimireddy et al. (2020); Wang et al. (2020); Zhu et al. (2021); Nguyen et al. (2022b). By using the same conceptual regularization, the authors in Li et al. (2020); Acar et al. (2021), and Dinh et al. (2021a) introduced the FedProx, FedDyne, and FedU, respectively, where FedProx and FedDyne focused on pulling clients' models back to the nearest aggregation model while FedU's attempted to pull distributed clients together. To direct the updated routing of the client model close to the ideal server route, the authors in Karimireddy et al. (2020) proposed SCAFFOLD by adding a control variate to the model updates. Meanwhile, to prevent the aggregated model from following highly biased models, the authors in Wang et al. (2020) rolled out FedNova by adding gradient scaling terms to the model update function. Similar to Dinh et al. (2021a), the authors in Nguyen et al. (2022b) launched the WALF by applying Wasserstein metric to reduce the distances between local and global data distributions. However, all these methods are limited in providing technical characteristics. For example, Wang et al. (2020) demonstrated that FedProx and FedDyne are ineffective in many cases when using pullback to the globally aggregated model. Meanwhile, FedU and WAFL have the same limitation on making a huge communication burden. Aside from that, FedU also faces a very complex and non-convex optimization problem. Regarding the aspect of knowledge distillation for FL, only the work in Zhu et al. (2021) proposed a new generative model of local users as an alternative data augmentation technique for FL. However, the majority drawback of this model is that the training process at the server demands a huge data collection from all users, leading to ineffective communication. Motivated by this, we propose a new FL architecture that is expected to be more elegant, easier to implement, and much more straightforward. Unlike Dinh et al. (2021a); Acar et al. (2021); Karimireddy et al. (2020), we utilize the knowledge from clients' models to extract good knowledge for the aggregation model instead of using model parameters to reduce the physical distance between distributed models. Following that, our proposed framework can flexibly handle weight distance and probability distance in an efficient way, i.e., ∥p k (y = c) -p(y = c)∥ (please refer to Appendix B).2.2 MULTI-TEACHER KNOWLEDGE DISTILLATIONMTKD is an improved version of KD (which is presented in Appendix A.2), in which multiple teachers work cooperatively to build a student model. As shown inZhu et al. (2018), every MTKD technique solves the following problem formulation: P1 : min L KL m = , ω r , T ) log p(l|X, ω r , T ) p(l|X, ω g , T ) ,

