FEDDA: FASTER FRAMEWORK OF LOCAL ADAPTIVE GRADIENT METHODS VIA RESTARTED DUAL AVERAG-ING

Abstract

Federated learning (FL) is an emerging learning paradigm to tackle massively distributed data. In Federated Learning, a set of clients jointly perform a machine learning task under the coordination of a server. The FedAvg algorithm is one of the most widely used methods to solve Federated Learning problems. In FedAvg, the learning rate is a constant rather than changing adaptively. The adaptive gradient methods show superior performance over the constant learning rate schedule; however, there is still no general framework to incorporate adaptive gradient methods into the federated setting. In this paper, we propose FedDA, a novel framework for local adaptive gradient methods. The framework adopts a restarted dual averaging technique and is flexible with various gradient estimation methods and adaptive learning rate formulations. In particular, we analyze FedDA-MVR, an instantiation of our framework, and show that it achieves gradient complexity Õ(ϵ -1.5 ) and communication complexity Õ(ϵ -1 ) for finding a stationary point ϵ. This matches the best known rate for first-order FL algorithms and FedDA-MVR is the first adaptive FL algorithm that achieves this rate. We also perform extensive numerical experiments to verify the efficacy of our method.

1. INTRODUCTION

Federated Learning denotes the process in which a set of distributed located clients jointly perform a machine learning task under the coordination of a central server over their privately-held data. A widely used method in FL is the FedAvg(Local-SGD) McMahan et al. (2017) 2018) are widely used in the non-distributed setting. The gradient descent method uses either a fixed learning rate or a fixed learning rate schedule. In contrast, adaptive gradient methods set the learning rate to be inversely proportional to the magnitude of the gradient; this can incorporate the local curvature structure of the problem. Adaptive gradient methods perform well in practice; meanwhile, they also enjoy useful theoretical implications that make them outperform the vanilla gradient descent method Duchi et al. (2011); Guo et al. (2021) . For example, a recent study Staib et al. (2019) showed that adaptive gradients help escape saddle points. Furthermore, some studies Loshchilov & Hutter (2018); Chen et al. (2018a) showed that adaptive gradients improve the generalization performance of the model. Adaptive gradient methods can be viewed as a type of generalized mirror descent Huang et al. (2021) methods, where the associated mirror map is defined according to adaptive learning rates. However,



algorithm. As indicated by its name, FedAvg performs (stochastic) gradient descent steps on each client and averages local states periodically. This method can be shown to converge Stich (2018); Haddadpour & Mahdavi (2019); Woodworth et al. (2020) when the distributions of the clients are homogeneous or with bounded heterogeneity. Recently, a large amount of literature has focused on accelerating FedAvg. In particular, many research works use momentum-based methods to accelerate FL, and significant progress has been made in this direction with improved gradient complexity and communication complexity Das et al. (2020); Karimireddy et al. (2019a); Khanduri et al. (2021a). However, another important category of methods: adaptive gradient methods have received much less attention, and there is still no general framework to incorporate adaptive gradient methods into the federated setting. Adaptive gradient methods such as Adagrad Duchi et al. (2011), Adam Kingma & Ba (2014) and AMSGrad Reddi et al. (

