ANCHOR SAMPLING FOR FEDERATED LEARNING WITH PARTIAL CLIENT PARTICIPATION

Abstract

In federated learning, the support of partial client participation offers a flexible training strategy, but it deteriorates the model training efficiency. In this paper, we propose a framework FedAMD to improve the convergence property and maintain flexibility. The core idea is anchor sampling, which disjoints the partial participants into anchor and miner groups. Each client in the anchor group aims at the local bullseye with the gradient computation using a large batch. Guided by the bullseyes, clients in the miner group steer multiple near-optimal local updates using small batches and update the global model. With the joint efforts from both groups, FedAMD is able to accelerate the training process as well as improve the model performance. Measured by ϵ-approximation and compared to the state-of-the-art first-order methods, FedAMD achieves the convergence by up to O(1/ϵ) fewer communication rounds under non-convex objectives. In specific, we achieve a linear convergence rate under PL conditions. Empirical studies on real-world datasets validate the effectiveness of FedAMD and demonstrate the superiority of our proposed algorithm: Not only does it considerably save computation and communication costs, but also the test accuracy significantly improves.

1. INTRODUCTION

Federated learning (FL) (Konečnỳ et al., 2015; 2016; McMahan et al., 2017) has attained an increasing interest over the past few years. As a distributed training paradigm, it enables a group of clients to collaboratively train a global model from decentralized data under the orchestration of a central server. By this means, sensitive privacy is basically protected because the raw data are not shared across the clients. Due to the unreliable network connection and the rapid proliferation of FL clients, it is infeasible to require all clients to be simultaneously involved in the training. To address the issue, recent works (Li et al., 2019b; Philippenko & Dieuleveut, 2020; Gorbunov et al., 2021a; Karimireddy et al., 2020b; Yang et al., 2020; Li et al., 2020; Eichner et al., 2019; Yan et al., 2020; Ruan et al., 2021; Gu et al., 2021; Lai et al., 2021) introduce a practical setting where merely a portion of clients participates in the training. The partial-client scenario effectively avoids the network congestion at the FL server and significantly shortens the idle time as compared to traditional large-scale machine learning (Zinkevich et al., 2010; Bottou, 2010; Dean et al., 2012; Bottou et al., 2018) . However, a model trained with partial client participation is much worse than the one trained with full client participation (Yang et al., 2020) . This phenomenon is account for two reasons, namely, data heterogeneity (a.k.a. non-i.i.d. data) and the lack of inactive clients' updates. With data heterogeneity, the optimal model is subject to the local data distribution, and therefore, the local updates on the clients' models greatly deviate from the update towards optimal global parameters (Karimireddy et al., 2020b; Malinovskiy et al., 2020; Pathak & Wainwright, 2020; Wang et al., 2020; 2021; Mitra et al., 2021; Rothchild et al., 2020; Zhao et al., 2018; Wu et al., 2021 ). FedAvg (McMahan et al., 2017; Li et al., 2019b; Yu et al., 2019a; b; Stich, 2018) , for example, is less likely to follow a correct update towards the global minimizer because the model aggregation on the active clients critically deviates from the aggregation on the full clients, an expected direction towards global minimizer (Yang et al., 2020) . As a family of practical solutions to data heterogeneity, variance reduced techniques (Karimireddy et al., 2020b; Gorbunov et al., 2021a; Wu et al., 2021; Gorbunov et al., 2021b; Liang et al., 2019; Shamir et al., 2014; Li et al., 2019a; 2021b; Karimireddy et al., 2020a; Murata & Suzuki, 2021) achieve an improved convergence rate when compared to FedAvg. With multiple local updates, each client corrects the SGD steps with reference to an estimated global target, which is synchronized at 1

