MIME: MIMICKING CENTRALIZED STOCHASTIC AL-GORITHMS IN FEDERATED LEARNING

Abstract

Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients. This heterogeneity has been shown to cause a client drift, which can significantly degrade the performance of algorithms designed for the FL setting. In contrast, centralized learning with centrally collected data is not affected by such a drift and has seen great empirical and theoretical progress with innovations such as momentum and adaptivity. In this work, we propose a general algorithmic framework, MIME, which mitigates client drift and adapts arbitrary centralized optimization algorithms such as SGD and Adam to the federated learning setting. MIME uses a combination of control-variates and server-level statistics (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on iid data. Our thorough theoretical and empirical analyses strongly establish MIME's superiority over other baselines.

1. INTRODUCTION

Federated learning has become an important paradigm in large-scale machine learning where the training data remains distributed over a large number of clients, which may be mobile phones or network sensors (Konečnỳ et al., 2016b; a; McMahan et al., 2017; Mohri et al., 2019; Kairouz et al., 2019) . A centralized model, here referred to as a server model, is then trained, without ever transmitting client data over the network, thereby providing some basic levels of data privacy and security. Two important settings are distinguished in Federated learning (Kairouz et al., 2019 , Table 1 ): the cross-device and the cross-silo settings. The cross-silo setting corresponds to a relatively small number of reliable clients, typically organizations, such as medical or financial institutions. In contrast, in the cross-device federated learning setting, the number of clients may be extremely large and include, for example, all 3.5 billion active android phones (Holst, 2019) . Thus, in that setting, we may never make even a single pass over the entire clients' data during training. The cross-device setting is further characterized by resource-poor clients communicating over a highly unreliable network. Together, the essential features of this setting give rise to unique challenges not present in the cross-silo setting. Here, we are interested in the cross-device setting, for which we will formalize and study stochastic optimization algorithms. The de facto standard algorithm for this setting is FEDAVG (McMahan et al., 2017) , which performs multiple SGD updates on the available clients, before communicating to the server. While this approach can reduce the total amount of communication required, performing multiple steps on the same client can lead to 'over-fitting' to its atypical local data, a phenomenon known as client drift (Karimireddy et al., 2020) . Furthermore, algorithmic innovations such as momentum (Sutskever et al., 2013; Cutkosky and Orabona, 2019), adaptivity (Kingma and Ba, 2014; Zaheer et al., 2018; Zhang et al., 2019), and clipping (You et al., 2017; 2019; Zhang et al., 2020) are critical to the success of deep learning applications and need to be incorporated into the client updates, replacing the SGD update of FEDAVG. Perhaps due to such deficiencies, there exists a large gap in performance between the centralized setting, where data is centrally collected on the server, and the federated setting (Zhao et al., 2018; Hsieh et al., 2019; Hsu et al., 2019; Karimireddy et al., 2020) . To overcome such deficiencies, we propose a new framework, MIME, that mitigates client drift and adapts arbitrary centralized optimization algorithms, e.g. SGD with momentum or Adam, to the federated setting. In each local client update, MIME uses global statistics, e.g. momentum, and an SVRG-style correction to mimic the updates of the centralized algorithm run on i.i.d. data. These global statistics are computed only at the server level and kept fixed throughout the local steps, thereby avoiding a bias due to the atypical local data of any single client. Contributions. We summarize our main results below. • We formalize the cross-device federated learning problem, and propose a new framework MIME that can adapt arbitrary centralized algorithms to this setting. • We prove that incorporating server momentum into each local client update reduces client drift and leads to optimal statistical rates. • Further, we quantify the usefulness of performing multiple local updates on a single client by carefully tracking the bias (client-drift) introduced. This is the first analysis showing improved rates by taking additional multiple steps for general smooth functions. et al., 2019; Yu et al., 2019b; Karimireddy et al., 2020; Khaled et al., 2020; Koloskova et al., 2020) . Charles and Konečnỳ (2020) derived a tight characterization of FedAvg with quadratic functions and demonstrated the sensitivity of the algorithm to both client and server step sizes. Matching upper and lower bounds were recently given by Karimireddy et al. ( 2020) and Woodworth et al. (2020a) for general functions, proving that FEDAVG can be slower than even SGD for heterogeneous data, due to the client-drift. Comparison to SCAFFOLD: For the cross-silo setting where the number of clients is relatively low, Karimireddy et al. ( 2020) proposed the SCAFFOLD algorithm, which uses control-variates (similar to SVRG) to correct for client drift. However, their algorithm crucially relies on stateful clients which repeatedly participate in the training process. In contrast, we focus on the cross-device setting where clients may be visited only once during training and where they are stateless. This is akin to the difference between the finite-sum and stochastic settings in traditional centralized optimization. 

2. PROBLEM SETUP

This section formalizes the problem of cross-device federated learning. We first examine some key challenges of this setting (cf. Kairouz et al., 2019) to ensure our formalism captures the difficulty: 1. Communication cost between the server and the clients is a major concern and the source of bottleneck in federated learning; thus, a key metric for optimization in this setting is the number of communication rounds.



• Finally, we also propose a simpler variant, MIMELITE, with an empirical performance similar to MIME. We report the results of thorough experimental analysis demonstrating that both MIME and MIMELITE are faster than FEDAVG. Related work. Analysis of FedAvg: Much of the recent work in federated learning has focused on analyzing FEDAVG. For identical clients, FEDAVG coincides with parallel SGD, for which Zinkevich et al. (2010) derived an analysis with asymptotic convergence. Sharper and more refined analyses of the same method, sometimes called local SGD, were provided by Stich (2019), and more recently by Stich and Karimireddy (2019), Patel and Dieuleveut (2019), Khaled et al. (2020), and Woodworth et al. (2020b), for identical functions. Their analysis was extended to heterogeneous clients in (Wang

FedAvg: Hsu et al. (2019)  andWang et al. (2020c)  observed that using server momentum significantly improves over vanilla FEDAVG. This idea was generalized byReddi et al.  (2020), who replaced the server update with an arbitrary optimizer, e.g. Adam. However, these methods only modify the server update while using SGD for the client updates. MIME, on the other hand, ensures that every local client update resembles the optimizer e.g. MIME would apply momentum in every client update and not just at the server level. Beyond this,Li et al. (2018)   proposed to add a regularizer to ensure client updates remain close. However, its usefulness is unclear (cf. Fig.5,Karimireddy et al., 2020; Wang et al., 2020b). Other orthogonal directions which can be combined with MIME include tackling computation heterogeneity, where some clients perform many more updates than others(Wang et al., 2020b), improving fairness by modifying the objective(Mohri et al., 2019; Li et al., 2019), incorporating differential privacy(Geyer et al., 2017;  Agarwal et al., 2018; Thakkar et al., 2020), Byzantine adversaries (Pillutla et al., 2019; Wang et al.,  2020a; He et al., 2020a), secure aggregation(Bonawitz et al., 2017; He et al., 2020b), etc. We refer the reader to the extensive survey byKairouz et al. (2019)  for additional discussion.

