ANCHOR SAMPLING FOR FEDERATED LEARNING WITH PARTIAL CLIENT PARTICIPATION

Abstract

In federated learning, the support of partial client participation offers a flexible training strategy, but it deteriorates the model training efficiency. In this paper, we propose a framework FedAMD to improve the convergence property and maintain flexibility. The core idea is anchor sampling, which disjoints the partial participants into anchor and miner groups. Each client in the anchor group aims at the local bullseye with the gradient computation using a large batch. Guided by the bullseyes, clients in the miner group steer multiple near-optimal local updates using small batches and update the global model. With the joint efforts from both groups, FedAMD is able to accelerate the training process as well as improve the model performance. Measured by ϵ-approximation and compared to the state-of-the-art first-order methods, FedAMD achieves the convergence by up to O(1/ϵ) fewer communication rounds under non-convex objectives. In specific, we achieve a linear convergence rate under PL conditions. Empirical studies on real-world datasets validate the effectiveness of FedAMD and demonstrate the superiority of our proposed algorithm: Not only does it considerably save computation and communication costs, but also the test accuracy significantly improves.

1. INTRODUCTION

Federated learning (FL) (Konečnỳ et al., 2015; 2016; McMahan et al., 2017) has attained an increasing interest over the past few years. As a distributed training paradigm, it enables a group of clients to collaboratively train a global model from decentralized data under the orchestration of a central server. By this means, sensitive privacy is basically protected because the raw data are not shared across the clients. Due to the unreliable network connection and the rapid proliferation of FL clients, it is infeasible to require all clients to be simultaneously involved in the training. To address the issue, recent works (Li et al., 2019b; Philippenko & Dieuleveut, 2020; Gorbunov et al., 2021a; Karimireddy et al., 2020b; Yang et al., 2020; Li et al., 2020; Eichner et al., 2019; Yan et al., 2020; Ruan et al., 2021; Gu et al., 2021; Lai et al., 2021) introduce a practical setting where merely a portion of clients participates in the training. The partial-client scenario effectively avoids the network congestion at the FL server and significantly shortens the idle time as compared to traditional large-scale machine learning (Zinkevich et al., 2010; Bottou, 2010; Dean et al., 2012; Bottou et al., 2018) . However, a model trained with partial client participation is much worse than the one trained with full client participation (Yang et al., 2020) . This phenomenon is account for two reasons, namely, data heterogeneity (a.k.a. non-i.i.d. data) and the lack of inactive clients' updates. With data heterogeneity, the optimal model is subject to the local data distribution, and therefore, the local updates on the clients' models greatly deviate from the update towards optimal global parameters (Karimireddy et al., 2020b; Malinovskiy et al., 2020; Pathak & Wainwright, 2020; Wang et al., 2020; 2021; Mitra et al., 2021; Rothchild et al., 2020; Zhao et al., 2018; Wu et al., 2021 ). FedAvg (McMahan et al., 2017; Li et al., 2019b; Yu et al., 2019a; b; Stich, 2018) , for example, is less likely to follow a correct update towards the global minimizer because the model aggregation on the active clients critically deviates from the aggregation on the full clients, an expected direction towards global minimizer (Yang et al., 2020) . As a family of practical solutions to data heterogeneity, variance reduced techniques (Karimireddy et al., 2020b; Gorbunov et al., 2021a; Wu et al., 2021; Gorbunov et al., 2021b; Liang et al., 2019; Shamir et al., 2014; Li et al., 2019a; 2021b; Karimireddy et al., 2020a; Murata & Suzuki, 2021) achieve an improved convergence rate when compared to FedAvg. With multiple local updates, each client corrects the SGD steps with reference to an estimated global target, which is synchronized at Convexity Method Partial Clients Communication Rounds Non-convex Minibatch SGD (Wang & Srebro, 2019) ✗ 1 M Kϵ 2 + 1 ϵ FedAvg (Yang et al., 2020) ✓ K Aϵ 2 + 1 ϵ SCAFFOLD (Karimireddy et al., 2020b) ✓ σ 2 AKϵ 2 + M A 2/3 1 ϵ BVR-L-SGD (Murata & Suzuki, 2021) ✗ 1 M Kϵ 3/2 + 1 ϵ VR-MARINA(Gorbunov et al., 2021a) ✗ σ M ϵ 3/2 + σ 2 M ϵ + 1 ϵ FedAMD (Sequential) (Corollary 1) ✓ M Aϵ FedAMD (Constant) (Corollary 2) ✓ M Aϵ PL condition (or *strongly-convex) Minibatch SGD* (Woodworth et al., 2020b) ✗ σ 2 µM Kϵ + 1 µ log 1 µϵ FedAvg (Karimireddy et al., 2020a) ✓ 1+σ 2 /K µAϵ + √ 1+σ 2 /K µ √ ϵ + 1 µ log 1 ϵ SCAFFOLD* (Karimireddy et al., 2020b) ✓ σ 2 µAKϵ + M A + 1 µ log M µ Aϵ VR-MARINA (Gorbunov et al., 2021a) ✗ σ 2 µM ϵ + σ µ 3/2 M √ ϵ + 1 µ log 1 ϵ FedAMD (Constant) (Corollary 3) ✓ 1 µ + M µ 2 A + M A log 1 ϵ Table 1: Number of communication rounds that achieve E ∥∇F ( xout )∥ 2 2 ≤ ϵ for non-convex objectives (or EF ( xout ) -F * ≤ ϵ for PL condition or strongly-convex with the parameter of µ). We optimize an online scenario and set the small batch size to 1. The symbol ✓ or ✗ for "Partial Clients" is determined by the following footnote 1. the beginning of every round. Although, in each transmission round, variance-reduced algorithms require the communication overhead twice as more as FedAvg, their improved performances are likely to eliminate the cost increments. Recent studies (Gorbunov et al., 2021a; Murata & Suzuki, 2021; Tyurin & Richtárik, 2022; Zhao et al., 2021) have demonstrated great potential using large batches under full client participationfoot_0 . Measured by ϵ-approximation, MARINA (Gorbunov et al., 2021a) , for instance, realizes O(1/M ϵ 1/2 ) faster while using large batches, where M indicates the number of clients. However, none of the prior studies address the drawbacks of using large batches. Typically, a large batch update involves several gradient computations compared to a small batch update. This increases the burden of FL clients, especially on IoT devices like smartphones, because their hardware hardly accommodates all samples in a large batch simultaneously. Instead, they must partition the large batch into several small batches to obtain the final gradient. Furthermore, regarding the critical convergent differences between various participation modes, the effect of using large batches under partial client participation cannot be affirmative. BVR-L-SGD (Murata & Suzuki, 2021) and FedPAGE (Zhao et al., 2021) claim that they can work under partial client participation, but they require all clients' participation when the algorithms come to the synchronization using a large batch. Motivated by the observation above, we propose a framework named FedAMD under federated learning with anchor sampling that disjoints the partial participants into two groups, i.e., anchor and miner group. In the anchor group, clients (a.k.a. anchors) compute the gradient using a large batch cached in the server to estimate the global orientation. In the miner group, clients (a.k.a. miners) perform multiple updates corrected according to the previous and the current local parameters and the last local update volume. The objective for the latter group is twofold. First, multiple local updates without serious deviation can effectively accelerate the training process. Second, we update the global model using the local models from the latter group only. Since anchor sampling disjoints the clients with time-varying probability, we separately consider constant and sequential probability settings. Contributions. We summarize our contributions as follows: • Algorithmically, we propose a unified federated learning framework FedAMD that identifies a participant as an anchor or a miner. Clients in the anchor group aim to obtain the bullseyes of their local data with a large batch, while the miners target to accelerate the training with multiple local updates using small batches. • Theoretically, we establish the convergence rate for FedAMD under non-convex objectives under both constant and sequential probability settings. To the best of our knowledge, this is the first work to analyze the effectiveness of large batches under partial client participation. Our theoretical



In this paper, partial client participation refers to the case where only a portion of clients take part at every round during the entire training.

