MIME: MIMICKING CENTRALIZED STOCHASTIC AL-GORITHMS IN FEDERATED LEARNING

Abstract

Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients. This heterogeneity has been shown to cause a client drift, which can significantly degrade the performance of algorithms designed for the FL setting. In contrast, centralized learning with centrally collected data is not affected by such a drift and has seen great empirical and theoretical progress with innovations such as momentum and adaptivity. In this work, we propose a general algorithmic framework, MIME, which mitigates client drift and adapts arbitrary centralized optimization algorithms such as SGD and Adam to the federated learning setting. MIME uses a combination of control-variates and server-level statistics (e.g. momentum) at every client-update step to ensure that each local update mimics that of the centralized method run on iid data. Our thorough theoretical and empirical analyses strongly establish MIME's superiority over other baselines.

1. INTRODUCTION

Federated learning has become an important paradigm in large-scale machine learning where the training data remains distributed over a large number of clients, which may be mobile phones or network sensors (Konečnỳ et al., 2016b; a; McMahan et al., 2017; Mohri et al., 2019; Kairouz et al., 2019) . A centralized model, here referred to as a server model, is then trained, without ever transmitting client data over the network, thereby providing some basic levels of data privacy and security. Two important settings are distinguished in Federated learning (Kairouz et al., 2019 , Table 1 ): the cross-device and the cross-silo settings. The cross-silo setting corresponds to a relatively small number of reliable clients, typically organizations, such as medical or financial institutions. In contrast, in the cross-device federated learning setting, the number of clients may be extremely large and include, for example, all 3.5 billion active android phones (Holst, 2019) . Thus, in that setting, we may never make even a single pass over the entire clients' data during training. The cross-device setting is further characterized by resource-poor clients communicating over a highly unreliable network. Together, the essential features of this setting give rise to unique challenges not present in the cross-silo setting. Here, we are interested in the cross-device setting, for which we will formalize and study stochastic optimization algorithms. The de facto standard algorithm for this setting is FEDAVG (McMahan et al., 2017) , which performs multiple SGD updates on the available clients, before communicating to the server. While this approach can reduce the total amount of communication required, performing multiple steps on the same client can lead to 'over-fitting' to its atypical local data, a phenomenon known as client drift (Karimireddy et al., 2020) . Furthermore, algorithmic innovations such as momentum (Sutskever et al., 2013; Cutkosky and Orabona, 2019), adaptivity (Kingma and Ba, 2014; Zaheer et al., 2018; Zhang et al., 2019), and clipping (You et al., 2017; 2019; Zhang et al., 2020) are critical to the success of deep learning applications and need to be incorporated into the client updates, replacing the SGD update of FEDAVG. Perhaps due to such deficiencies, there exists a large gap in performance between the centralized setting, where data is centrally collected on the server, and the federated setting (Zhao et al., 2018; Hsieh et al., 2019; Hsu et al., 2019; Karimireddy et al., 2020) . To overcome such deficiencies, we propose a new framework, MIME, that mitigates client drift and adapts arbitrary centralized optimization algorithms, e.g. SGD with momentum or Adam, to the federated setting. In each local client update, MIME uses global statistics, e.g. momentum, and an 1

