FEDERATED LEARNING'S BLESSING: FEDAVG HAS LINEAR SPEEDUP

Abstract

Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of non-i.i.d. data across the network, low device participation, high communication costs, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly in regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)-arguably the most popular and effective FL algorithm class in use today-and provide a unified and comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, it remains open as to how FedAvg's convergence scales with the number of participating devices in the fully heterogeneous FL setting-a crucial question whose answer would shed light on the performance of FedAvg in large FL systems. We fill this gap by providing a unified analysis that establishes convergence guarantees for FedAvg under three classes of problems: strongly convex smooth, convex smooth, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates and communication efficiencies. While there have been linear speedup results from distributed optimization that assumes full participation, ours are the first to establish linear speedup for FedAvg under both statistical and system heterogeneity. For strongly convex and convex problems, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm, which are the first linear speedup guarantees for momentum variants of FedAvg in the convex setting. To provably accelerate FedAvg, we design a new momentum-based FL algorithm that further improves the convergence rate in overparameterized linear regression problems. Empirical studies of the algorithms in various settings have supported our theoretical results.

1. INTRODUCTION

Federated learning (FL) is a machine learning paradigm where many clients (e.g., mobile devices or organizations) collaboratively train a model under the orchestration of a central server (e.g., service provider), while keeping the training data decentralized (Smith et al. (2017) ; Kairouz et al. (2019) ). In recent years, FL has swiftly emerged as an important learning paradigm (McMahan et al. (2017) ; Li et al. (2020a) )-one that enjoys widespread success in applications such as personalized recommendation (Chen et al. ( 2018)), virtual assistant (Lam et al. ( 2019)), and keyboard prediction (Hard et al. (2018) ), to name a few-for at least three reasons: First, the rapid proliferation of smart devices that are equipped with both computing power and data-capturing capabilities provided the infrastructure core for FL. Second, the rising awareness of privacy and the explosive growth of computational power in mobile devices have made it increasingly attractive to push the computation to the edge. Third, the empirical success of communication-efficient FL algorithms has enabled increasingly larger-scale parallel computing and learning with less communication overhead. Despite its promise and broad applicability in our current era, the potential value FL delivers is coupled with the unique challenges it brings forth. In particular, when FL learns a single statistical model using data from across all the devices while keeping each individual device's data isolated (Kairouz et al. κ for accelerated FedAvg. O( 1 N T + E 2 T 2 ) O 1 √ N T + N E 2 T O(exp(-N T Eκ 1 )) O(exp(-N T Eκ )) † Partial O E 2 KT + E 2 T 2 O E 2 √ KT + KE 2 T O(exp(-KT Eκ 1 )) O(exp(-KT Eκ )) † 2) System heterogeneity: only a subset of devices may access the central server at each time both because the communications bandwidth profiles vary across devices and because there is no central server that has control over when a device is active (the presence of "stragglers"). To address these challenges, Federated Averaging (FedAvg) However, despite these very recent fruitful pioneering efforts into understanding the theoretical convergence properties of FedAvg, it remains open as to how the number of devices-particularly the number of devices that participate in the computation-affects the convergence speed. In particular, is linear speedup of FedAvg a universal phenomenon across different settings and for any number of devices? What about when FedAvg is accelerated with momentum updates? Does the presence of both data and system heterogeneity in FL imply different communication complexities and require technical novelties over results in distributed and decentralized optimization? These aspects are currently unexplored or underexplored in FL. We fill in the gaps here by providing affirmative answers. Our Contributions We provide a comprehensive and unified convergence analysis of FedAvg and its accelerated variants in the presence of both data and system heterogeneity. Our contributions are threefold. First, we establish an O(1/KT ) convergence rate under FedAvg for strongly convex and smooth problems and an O(1/ √ KT ) convergence rate for convex and smooth problems (where K is the number of participating devices), thereby establishing that FedAvg enjoys the desirable linear speedup property in the FL setup. Prior to our work here, the best and the most related convergence analysis is given by Li et al. (2020b) and Karimireddy et al. (2019) , which established an O( 1T ) convergence rate for strongly convex smooth problems under FedAvg. Our rate matches the same (and optimal) dependence on T , but also completes the picture by establishing the linear dependence on K, for any K ≤ N , where N is the total number of devices, whereas Li et al. (2020b) does not have linear speedup analysis, and Karimireddy et al. (2019) only allows linear speedup close to full participation (K = O(N )). As for convex and smooth problems, there was no prior work that established the O( 1 √ T ) rate under both system and data heterogeneity. Our unified analysis highlights the common elements and distinctions between the strongly and convex settings. Second, we establish the same convergence rates-O(1/KT ) for strongly convex and smooth problems and O(1/



(2019)), it faces two challenges that are absent in centralized optimization and distributed (stochastic) optimization (Zhou & Cong (2018); Stich (2019); Khaled et al. (2019); Liang et al. (2019); Wang & Joshi (2018); Woodworth et al. (2018); Wang et al. (2019); Jiang & Agrawal (2018); Yu et al. (2019b;a); Khaled et al. (2020); Koloskova et al. (2020); Woodworth et al. (2020b;a)):1) Data (statistical) heterogeneity: data distributions in devices are different (and data cannot be shared);

A similar result in the same setting Karimireddy et al. (2019) also established an O( 1 T ) result that allows for a linear speedup when the number of participating devices is large. At the same time, Huo et al. (2020) studied the Nesterov accelerated FedAvg for non-convex smooth problems and established an O( 1 √ T ) convergence rate to stationary points.

Our convergence results for FedAvg and accelerated FedAvg in this paper. Throughout the paper, N is the total number of local devices, and K ≤ N is the maximal number of devices that are accessible to the central server. T is the total number of stochastic updates performed by each local device, E is the local steps between two consecutive server communications (and hence T /E is the number of communications). † In the linear regression setting, we have κ = κ1 for FedAvg and κ = √ κ1κ for momentum accelerated FedAvg (FedMaSS), where κ1 and √ κ1κ are condition numbers defined in Section G. Since κ1 ≥ κ, this implies a speedup factor of

McMahan et al. (2017)  was proposed as a particularly effective heuristic, which has enjoyed great empirical success. This success has since motivated a growing line of research efforts into understanding its theoretical convergence guarantees in various settings. For instance, Haddadpour & Mahdavi (2019) analyzed FedAvg (for non-convex smooth problems satisfying PL conditions) under the assumption that each local device's minimizer is the same as the minimizer of the joint problem (if all devices' data is aggregated together), an overly restrictive assumption that restricts the extent of data heterogeneity. Very recently, Li et al. (2020b) furthered the progress and established an O(1T ) convergence rate for FedAvg for strongly convex smooth Federated learning problems with both data and system heterogeneity.

