ACHIEVING LINEAR SPEEDUP WITH PARTIAL WORKER PARTICIPATION IN NON-IID FEDERATED LEARNING

Abstract

Federated learning (FL) is a distributed machine learning architecture that leverages a large number of workers to jointly learn a model with decentralized data. FL has received increasing attention in recent years thanks to its data privacy protection, communication efficiency and a linear speedup for convergence in training (i.e., convergence performance increases linearly with respect to the number of workers). However, existing studies on linear speedup for convergence are only limited to the assumptions of i.i.d. datasets across workers and/or full worker participation, both of which rarely hold in practice. So far, it remains an open question whether or not the linear speedup for convergence is achievable under non-i.i.d. datasets with partial worker participation in FL. In this paper, we show that the answer is affirmative. Specifically, we show that the federated averaging (FedAvg) algorithm (with two-sided learning rates) on non-i.i.d. datasets in non-convex settings achieves a convergence rate O( 1

√

mKT + 1 T ) for full worker participation and a convergence rate O( √ K √ nT + 1 T ) for partial worker participation, where K is the number of local steps, T is the number of total communication rounds, m is the total worker number and n is the worker number in one communication round if for partial worker participation. Our results also reveal that the local steps in FL could help the convergence and show that the maximum number of local steps can be improved to T /m in full worker participation. We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.

1. INTRODUCTION

Federated Learning (FL) is a distributed machine learning paradigm that leverages a large number of workers to collaboratively learn a model with decentralized data under the coordination of a centralized server. Formally, the goal of FL is to solve an optimization problem, which can be decomposed as: min x∈R d f (x) := 1 m m i=1 F i (x), where F i (x) E ξi∼Di [F i (x, ξ i )] is the local (non-convex) loss function associated with a local data distribution D i and m is the number of workers. FL allows a large number of workers (such as edge devices) to participate flexibly without sharing data, which helps protect data privacy. However, it also introduces two unique challenges unseen in traditional distributed learning algorithms that are used typically for large data centers: • Non-independent-identically-distributed (non-i.i.d.) datasets across workers (data heterogeneity): In conventional distributed learning in data centers, the distribution for each worker's local dataset can usually be assumed to be i.i.d., i.e., D i = D, ∀i ∈ {1, ..., m}. Unfortunately, this assumption rarely holds for FL since data are generated locally at the workers based on their circumstances, i.e., D i = D j , for i = j. It will be seen later that the non-i.i.d assumption imposes significant challenges in algorithm design for FL and their performance analysis. • Time-varying partial worker participation (systems non-stationarity): With the flexibility for workers' participation in many scenarios (particularly in mobile edge computing), workers may randomly join or leave the FL system at will, thus rendering the active worker set stochastic and time-varying across communication rounds. Hence, it is often infeasible to wait for all workers' responses as in traditional distributed learning, since inactive workers or stragglers will significantly slow down the whole training process. As a result, only a subset of the workers may be chosen by the server in each communication round, i.e., partial worker participation. In recent years, the Federated Averaging method (FedAvg) and its variants (McMahan et al., 2016; Li et al., 2018; Hsu et al., 2019; Karimireddy et al., 2019; Wang et al., 2019a) mKT ) convergence rate without bounded gradient assumption. Notably, for a sufficiently large T , the above rates become O( 1 √ mKT ) 1 , which implies a linear speedup with respect to the number of workers. 2 This linear speedup is highly desirable for an FL algorithm because the algorithm is able to effectively leverage the massive parallelism in a large FL system. However, with non-i.i.d. datasets and partial worker participation in FL, a fundamental open question arises: Can we still achieve the same linear speedup for convergence, i.e., O( 1 √ mKT ), with non-i.i.d. datasets and under either full or partial worker participation? In this paper, we show the answer to the above question is affirmative. Specifically, we show that a generalized FedAvg with two-sided learning rates achieves linear convergence speedup with non-i.i.d. datasets and under full/partial worker participation. We highlight our contributions as follows: • For non-convex problems, we show that the convergence rate of the FedAvg algorithm on non-i.i.d. dataset are O( 1 √ mKT + 1 T ) and O( √ K √ nT + 1 T ) for full and partial worker participation, respectively, where n is the size of the partially participating worker set. This indicates that our proposed algorithm achieves a linear speedup for convergence rate for a sufficiently large T . When reduced to the i.i.d. case, our convergence rate is O( 1 T K + 1 √ mKT ), which is also better than previous works. We summarize the convergence rate comparisons for both i.i.d. and non-i.i.d. cases in Table 1 . It is worth noting that our proof does not require the bounded gradient assumption. We note that the SCAFFOLD algorithm (Karimireddy et al., 2019 ) also achieves the linear speedup but extra variance reduction operations are required, which lead to higher communication costs and implementation complexity. By contrast, we do not have such extra requirements in this paper. • In order to achieve a linear speedup, i.e., a convergence rate O( 1 √ mKT ), we show that the number of local updates K can be as large as T /m, which improves the T 1/3 /m result previously shown in Yu et al. (2019a) and Karimireddy et al. (2019) . As shown later in the communication complexity comparison in Table 1 , a larger number of local steps implies relatively fewer communication rounds, thus less communication overhead. Interestingly, our results also indicate that the number of local updates K does not hurt but rather help the convergence with a proper learning rates choice in full worker participation. This overcomes the limitation as suggested in Li et al. (2019b) that local SGD steps might slow down the convergence (O( K T ) for strongly convex case). This result also reveals new insights on the relationship between the number of local steps and learning rate. 1 This rate also matches the convergence rate order of parallel SGD in conventional distributed learning. 2 To attain accuracy for an algorithm, it needs to take O( 12 ) steps with a convergence rate O( 1 √ T ), while needing O( 1 m 2 ) steps if the convergence rate is O( 1 √ mT ) (the hidden constant in Big-O is the same). In this sense, one achieves a linear speedup with respect to the number of workers.



have emerged as a prevailing approach for FL. Similar to the traditional distributed learning, FedAvg leverages local computation at each worker and employs a centralized parameter server to aggregate and update the model parameters. The unique feature of FedAvg is that each worker runs multiple local stochastic gradient descent (SGD) steps rather than just one step as in traditional distributed learning between two consecutive communication rounds. For i.i.d. datasets and the full worker participation setting, Stich (2018) and Yu et al. (2019b) proposed two variants of FedAvg that achieve a convergence rate of O( mK T + 1 √ mKT ) with a bounded gradient assumption for both strongly convex and nonconvex problems, where m is the number of workers, K is the local update steps, and T is the total communication rounds. Wang & Joshi (2018) and Stich & Karimireddy (2019) further proposed improved FedAvg algorithms to achieve an O( m T +

