FEDDA: FASTER FRAMEWORK OF LOCAL ADAPTIVE GRADIENT METHODS VIA RESTARTED DUAL AVERAG-ING

Abstract

Federated learning (FL) is an emerging learning paradigm to tackle massively distributed data. In Federated Learning, a set of clients jointly perform a machine learning task under the coordination of a server. The FedAvg algorithm is one of the most widely used methods to solve Federated Learning problems. In FedAvg, the learning rate is a constant rather than changing adaptively. The adaptive gradient methods show superior performance over the constant learning rate schedule; however, there is still no general framework to incorporate adaptive gradient methods into the federated setting. In this paper, we propose FedDA, a novel framework for local adaptive gradient methods. The framework adopts a restarted dual averaging technique and is flexible with various gradient estimation methods and adaptive learning rate formulations. In particular, we analyze FedDA-MVR, an instantiation of our framework, and show that it achieves gradient complexity Õ(ϵ -1.5 ) and communication complexity Õ(ϵ -1 ) for finding a stationary point ϵ. This matches the best known rate for first-order FL algorithms and FedDA-MVR is the first adaptive FL algorithm that achieves this rate. We also perform extensive numerical experiments to verify the efficacy of our method.

1. INTRODUCTION

Federated Learning denotes the process in which a set of distributed located clients jointly perform a machine learning task under the coordination of a central server over their privately-held data. A widely used method in FL is the FedAvg(Local-SGD) McMahan et al. (2017) algorithm. As indicated by its name, FedAvg performs (stochastic) gradient descent steps on each client and averages local states periodically. This method can be shown to converge Stich (2018) ; Haddadpour & Mahdavi (2019) ; Woodworth et al. (2020) when the distributions of the clients are homogeneous or with bounded heterogeneity. Recently, a large amount of literature has focused on accelerating FedAvg. In particular, many research works use momentum-based methods to accelerate FL, and significant progress has been made in this direction with improved gradient complexity and communication complexity Das et al. (2020) ; Karimireddy et al. (2019a) ; Khanduri et al. (2021a) . However, another important category of methods: adaptive gradient methods have received much less attention, and there is still no general framework to incorporate adaptive gradient methods into the federated setting. Adaptive gradient methods such as Adagrad Duchi et al. (2011 ), Adam Kingma & Ba (2014 ) and AMSGrad Reddi et al. (2018) are widely used in the non-distributed setting. The gradient descent method uses either a fixed learning rate or a fixed learning rate schedule. In contrast, adaptive gradient methods set the learning rate to be inversely proportional to the magnitude of the gradient; this can incorporate the local curvature structure of the problem. Adaptive gradient methods perform well in practice; meanwhile, they also enjoy useful theoretical implications that make them outperform the vanilla gradient descent method Duchi et al. (2011); Guo et al. (2021) . For example, a recent study Staib et al. (2019) showed that adaptive gradients help escape saddle points. Furthermore, some studies Loshchilov & Hutter (2018); Chen et al. (2018a) showed that adaptive gradients improve the generalization performance of the model. Adaptive gradient methods can be viewed as a type of generalized mirror descent Huang et al. ( 2021) methods, where the associated mirror map is defined according to adaptive learning rates. However, 2017) O(ϵ -2 ) O(ϵ -1.5 ) Primal/Dual ✗ ✗ FedCM Khanduri et al. (2021a) Õ(ϵ -1.5 ) Õ(ϵ -1 ) Primal/Dual ✗ ✗ STEM Khanduri et al. (2021a) Õ(ϵ -1.5 ) Õ(ϵ -1 ) Primal/Dual ✗ ✗ FedAdam Reddi et al. (2020) O(ϵ -2 ) O(ϵ -1 ) Primal ✗ ✗ Local-AMSGrad Chen et al. (2020b) O(ϵ -2 ) O(ϵ -1.5 ) Primal ✓ ✗ MIME-MVR Karimireddy et al. (2020a) Õ(ϵ -1.5 ) O(ϵ -1.5 ) Primal ✓ ✗ FedDA-MVR(Ours) Õ(ϵ -1.5 ) Õ(ϵ -1 ) Dual ✓ ✓ the mirror map is dynamic and changes at every training step. As a special case, the gradient descent method can be viewed as a mirror descent method with the mirror map being the L 2 distance function. Following the convention in the mirror descent literature, we denote the parameter space as the primal space and the gradient space as the dual space. The primal and dual space differ in adaptive gradient methods, but they coincide in the gradient descent method. We can exploit this primal-dual view to understand existing FL algorithms and design new algorithms. FedAvg actually exploits the usefulness of average dual states. In FedAvg, the gradient average approximates the true gradient evaluated at an average point of the client states, and the approximation error is upper-bounded by the client states difference; therefore, clients can perform multiple local steps without communication. Although in FedAvg, we do not differentiate the primal and dual space as they are the same, but the dual state average and primal state average are not equivalent for adaptive gradient methods. Current federated adaptive gradient methods in the literature either only perform adaptive gradient steps on the server side, or ignore this primal-dual nuance when supporting local adaptive gradient steps. An early work is Reddi et al. ( 2020), the authors proposed applying adaptive gradients in the server-average step, while performing normal gradient descent updates locally. This method is simple to implement and gets better performance than FedAvg, but the adaptive information is not exploited during local updates; this weakens the usage of adaptive gradients. Recently, some work Karimireddy et al. (2020a); Chen et al. (2020b) exploited adaptive information during local update steps; however, a common characteristic of these methods is that they average the primal states (parameters) of the problem during the synchronization step. This will cause some problems. Firstly, since adaptive learning rates define the mirror map, updating adaptive learning rates locally makes the dual space not aligned, thus we can not average the primal states directly. Then even if the adaptive learning rates are fixed locally, the primal space might be nonlinear w.r.t. the dual space, e.g., when we solve a constrained optimization problem. In summary, we propose two principles to apply adaptive gradients in FL. First, the local dual spaces should be aligned with each other; Second, we should average dual states. More specifically, we propose the FL adaptive gradients framework FedDA, which is short for Federated Dual-averaging Adaptive-gradient. In each global round of FedDA, the clients aggregate gradients (dual states) locally, and the server averages the dual states of the clients in the synchronization step. Local weights (primal states) are used as gradient query points in local updates and are recovered through the inverse mirror map (defined by the adaptive gradients). The global primal state is updated on the basis of the averaged dual states and the inverse mirror map. In addition, we utilize a restarting technique to make sure that all clients share the same dual space during local updates; more precisely, we refresh the adaptive gradients at every global epoch and use a fixed one in the local update. Our FedDA framework is general and can incorporate a large family of adaptive gradient methods to the FL setting. In particular, FedDA-MVR, an instantiation of our framework, achieves the best-known gradient complexity and communication complexity in the FL setting. Finally, we highlight our contribution as follows:



Comparisons of representative Federated Learning algorithms for finding an ϵ-stationary point of Objective equation 1 i.e., ∥∇f (x)∥ 2 ≤ ϵ or its equivalent variants. Gc(f, ϵ) denotes the number of gradient queries w.r.t. f (k) (x) for k ∈ [K]; Cc(f, ϵ) denotes the number of communication rounds; State means what state the algorithm maintains locally (Primal/Dual); Local-Adaptive means whether the algorithm performs adaptive gradient descent locally or not; Constrained means whether the algorithm can solve both constrained and unconstrained problems or not. The first three algorithms are not adaptive gradient methods, and the last four methods support some form of adaptive gradients.

