ON THE EFFICACY OF SERVER-AIDED FEDERATED LEARNING AGAINST PARTIAL CLIENT PARTICIPATION Anonymous

Abstract

Although federated learning (FL) has become a prevailing distributed learning framework in recent years due to its benefits in scalability/privacy, there remain many significant challenges in FL system design. Notably, most existing works in the current FL literature assume either full client or uniformly distributed client participation. Unfortunately, this idealistic assumption rarely holds in practice. It has been frequently observed that some clients may never participate in FL training (aka partial/incomplete participation) due to a meld of system heterogeneity factors. To mitigate impacts of partial client participation, an increasingly popular approach in practical FL systems is the sever-aided federated learning (SA-FL) framework, where one equips the server with an auxiliary dataset. However, despite the fact that SA-FL has been empirically shown to be effective in addressing the partial client participation problem, there remains a lack of theoretical understanding for SA-FL. Worse yet, even the ramifications of partial worker participation are not clearly understood in conventional FL so far. These theoretical gaps motivate us to rigorously investigate SA-FL. To this end, we first reveal that conventional FL is not PAC-learnable under partial client participation in the worst case, which advances our understanding of conventional FL. Then, we show that the PAC-learnability of FL with partial client participation can indeed be revived by SA-FL, which theoretically justifies the use of SA-FL for the first time. Lastly, to further make SA-FL communication-efficient, we propose the SAFARI (server-aided federated averaging) algorithm that enjoys convergence guarantees and the same level of communication efficiency and privacy as the state-of-the-art FL.

1. INTRODUCTION

Since the seminal work by McMahan et al. (2017) , federated learning (FL) has emerged as a powerful distributed learning paradigm that enables a large number of clients (e.g., edge devices) to collaboratively train a model under a central server's coordination. However, with FL gaining popularity, it has also become apparent that FL faces a key challenge that is unseen in traditional distributed learning in datacenter settings -system heterogeneity. Generally speaking, system heterogeneity in FL is caused by the massively different computation and communication capabilities at each client (computational power, communication capacity, drop-out rate, etc.). Studies have shown that system heterogeneity can significantly impact client participation in a highly non-trivial fashion and severely degrade the learning performance of FL algorithms (Bonawitz et al., 2019; Yang et al., 2021a) . For example, it is shown in (Yang et al., 2021a ) that more than 30% clients never participate in FL, while only 30% of the clients contribute to 81% of the total computation even if the server uniformly samples the clients. Exacerbating the problem is the fact that the client's status could be unstable and time-varying due to the aforementioned computation and communication constraints. To mitigate the impact of partial client participation, one approach called server-aided federated learning (SA-FL) has been increasingly adopted in practical FL systems in recent years (see, e.g., (Zhao et al., 2018; Wang et al., 2021b) ). The basic idea of SA-FL is to equip the server in FL with a small auxiliary dataset that approximately mimics the population distribution, so that the distribution deviation induced by partial client participation can be corrected. To date, even though SA-FL has been empirically shown to be quite effective in addressing the partial client participation problem in practice, there remains a lack of theoretical understanding for SA-FL. This motivates us to rigorously investigate the efficacy of SA-FL against partial client participation in FL in this paper. Somewhat counterintuitively, to fully understand SA-FL, one must be able to first see what happens if SA-FL is not used and partial client participation is left untreated in conventional FL. In other words, we need to first answer the following fundamental question: "1) What are the impacts of partial client participation on FL learning performance?" Upon answering this question, the next important follow-up question regarding SA-FL is: "2) What benefits could SA-FL bring and how could we theoretically characterize them? Also, since SA-FL still largely follows the server-client architecture that demands intensive communication between server and clients, the third fundamental question regarding SA-FL is: "3) Could we make SA-FL as communication-efficient as conventional FL?" Indeed, answering these three questions constitutes the rest of this paper, where we address the first two questions through the lens of PAC (probably approximately correct) learnability, while resolving the third question by proposing a communication-efficient SA-FL algorithm. Our major contributions in this work are summarized as follows: • By establishing a worst-case generalization error lower bound, we show that conventional FL is not PAC-learnable under partial client participation. In other words, no learning algorithm can approach zero generalization error under partial participation for conventional FL even in the limit of infinitely many data samples and training iterations. This insight, though being negative, warrants the necessity of developing new algorithmic techniques and system architectures (e.g., SA-FL) to modify the conventional FL framework to mitigate partial client participation. • Inspired by techniques from domain adaptation, we prove a new generalization error bound to show that SA-FL can indeed revive the PAC learnability of FL under partial client participation. We note that this bound could reach zero asymptotically as the number data samples increase. This is much stronger than previous results in domain adaptation with non-vanishing small error (see Section 2 for details). • To make SA-FL communication-efficient, we propose a new training algorithm for SA-FL called SAFARI (server-aided federated averaging). By carefully designing the update coordination between the server and the clients, we show that SAFARI achieves an O(1/ √ KR) convergence rate, matching the convergence rate of state-of-the-art conventional FL algorithms. We also conduct extensive experiments to demonstrate the efficacy and efficiency of our SAFARI algorithm. The rest of this paper is organized as follows. In Section 2, we review the literature to put our work in comparative perspectives. Section 3 presents the PAC learning analysis of standard FL under partial participation and our proposed SA-FL framework. We then propose SAFARI algorithm with convergence guarantees in Section 4.

2. RELATED WORK

1) Client Participation in Federated Learning: The seminal FedAvg algorithm was first proposed in (McMahan et al., 2017) as a heuristic to improve communication efficiency and data privacy for FL. Since then, there have been many follow-ups (e.g., (Li et al., 2020a; Wang et al., 2020; Zhang et al., 2020; Acar et al., 2021; Karimireddy et al., 2020; Luo et al., 2021; Mitra et al., 2021; Karimireddy et al., 2021; Khanduri et al., 2021; Murata & Suzuki, 2021; Avdiukhin & Kasiviswanathan, 2021 ) and so on) on addressing the data heterogeneity challenge in FL. However, most of these works (e.g., (Li et al., 2020a; Wang et al., 2020; Zhang et al., 2020; Acar et al., 2021; Karimireddy et al., 2020; Yang et al., 2021b) ) are based on the full or uniform (i.e., sampling clients uniformly at random) client participation assumption. The full or uniform participation assumptions are essential since they are required to ensure that the stochastic gradient estimator is unbiased in each round of update. Thus, even if "model drift" or "objective inconsistency" emerges due to local updates (Karimireddy et al., 2020; Wang et al., 2020) , the full/uniform client participation in each communication round averages them out in the long run, therefore, guaranteeing convergence. A related interesting line of works in FL different from full/uniform client participation focuses on proactively creating flexible client participation (see, e.g., (Xie et al., 2019; Ruan et al., 2021; Gu et al., 2021; Avdiukhin & Kasiviswanathan, 2021; Yang et al., 2022; Wang & Ji, 2022) ). The main idea here is to allow asynchronous communication or fixed participation pattern (e.g., given probability) for clients to flexibly participate in training. Existing works in this area often require extra assumptions, such as bounded delay (Ruan et al., 2021; Gu et al., 2021; Yang et al., 2022; Avdiukhin & Kasiviswanathan, 2021) and identical computation rate (Avdiukhin & Kasiviswanathan, 2021) . Under these assumptions, although the stochastic gradients are no longer unbiased estimators of the

