ADAPTIVE FEDERATED OPTIMIZATION

Abstract

Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FEDAVG) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including ADAGRAD, ADAM, and YOGI, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.

1. INTRODUCTION

Federated learning (FL) is a machine learning paradigm in which multiple clients cooperate to learn a model under the orchestration of a central server (McMahan et al., 2017) . In FL, raw client data is never shared with the server or other clients. This distinguishes FL from traditional distributed optimization, and requires contending with heterogeneous data. FL has two primary settings, crosssilo (eg. FL between large institutions) and cross-device (eg. FL across edge devices) (Kairouz et al., 2019, Table 1 ). In cross-silo FL, most clients participate in every round and can maintain state between rounds. In the more challenging cross-device FL, our primary focus, only a small fraction of clients participate in each round, and clients cannot maintain state across rounds. For a more in-depth discussion of FL and the challenges involved, we defer to Kairouz et al. (2019) and Li et al. (2019a) . Standard optimization methods, such as distributed SGD, are often unsuitable in FL and can incur high communication costs. To remedy this, many federated optimization methods use local client updates, in which clients update their models multiple times before communicating with the server. This can greatly reduce the amount of communication required to train a model. One such method is FEDAVG (McMahan et al., 2017) , in which clients perform multiple epochs of SGD on their local datasets. The clients communicate their models to the server, which averages them to form a new global model. While FEDAVG has seen great success, recent works have highlighted its convergence issues in some settings (Karimireddy et al., 2019; Hsu et al., 2019) . This is due to a variety of factors including (1) client drift (Karimireddy et al., 2019) , where local client models move away from globally optimal models, and (2) a lack of adaptivity. FEDAVG is similar in spirit to SGD, and may be unsuitable for settings with heavy-tail stochastic gradient noise distributions, which often arise when training language models (Zhang et al., 2019a) . Such settings benefit from adaptive learning rates, which incorporate knowledge of past iterations to perform more informed optimization. In this paper, we focus on the second issue and present a simple framework for incorporating adaptivity in FL. In particular, we propose a general optimization framework in which (1) clients perform multiple epochs of training using a client optimizer to minimize loss on their local data and (2) server updates its global model by applying a gradient-based server optimizer to the average of the clients' model updates. We show that FEDAVG is the special case where SGD is used as both client and server optimizer and server learning rate is 1. This framework can also seamlessly incorporate

