FEDEXP: SPEEDING UP FEDERATED AVERAGING VIA EXTRAPOLATION

Abstract

Federated Averaging (FedAvg) remains the most popular algorithm for Federated Learning (FL) optimization due to its simple implementation, stateless nature, and privacy guarantees combined with secure aggregation. Recent work has sought to generalize the vanilla averaging in FedAvg to a generalized gradient descent step by treating client updates as pseudo-gradients and using a server step size. While the use of a server step size has been shown to provide performance improvement theoretically, the practical benefit of the server step size has not been seen in most existing works. In this work, we present FedExP, a method to adaptively determine the server step size in FL based on dynamically varying pseudo-gradients throughout the FL process. We begin by considering the overparameterized convex regime, where we reveal an interesting similarity between FedAvg and the Projection Onto Convex Sets (POCS) algorithm. We then show how FedExP can be motivated as a novel extension to the extrapolation mechanism that is used to speed up POCS. Our theoretical analysis later also discusses the implications of FedExP in underparameterized and non-convex settings. Experimental results show that FedExP consistently converges faster than FedAvg and competing baselines on a range of realistic FL datasets.

1. INTRODUCTION

Federated Learning (FL) has emerged as a key distributed learning paradigm in which a central server orchestrates the training of a machine learning model across a network of devices. FL is based on the fundamental premise that data never leaves a clients device, as clients only communicate model updates with the server. Federated Averaging or FedAvg, first introduced by McMahan et al. (2017) , remains the most popular algorithm in this setting due to the simplicity of its implementation, stateless nature (i.e., clients do not maintain local parameters during training) and the ability to incorporate privacy-preserving protocols such as secure aggregation (Bonawitz et al., 2016; Kadhe et al., 2020) . Slowdown Due to Heterogeneity. One of the most persistent problems in FedAvg is the slowdown in model convergence due to data heterogeneity across clients. Clients usually perform multiple steps of gradient descent on their heterogeneous objectives before communicating with the server in FedAvg, which leads to what is colloquially known as client drift error (Karimireddy et al., 2019) . The effect of heterogeneity is further exacerbated by the constraint that only a fraction of the total number of clients may be available for training in every round (Kairouz et al., 2021) 

Server

Step Size. Recent work has sought to deal with this slowdown by using two separate step sizes in FedAvg -a client step size used by the clients to minimize their local objectives and a server step size used by the server to update the global model by treating client updates as pseudo-gradients (Karimireddy et al., 2019; Reddi et al., 2021) . To achieve the fastest convergence rate, these works propose keeping the client step size as O 1/τ √ T and the server step size as O √ τ M , where T is the number of communication rounds, τ is the number of local steps and M is the number of clients. Using a small client step size mitigates client drift, and a large server step size prevents global slowdown. While this idea may be asymptotically optimal, it is not always effective in practical non-asymptotic and communication-limited settings (Charles & Konečnỳ, 2020). In practice, a small client step size severely slows down convergence in the initial rounds and cannot be fully compensated for by a large server step size (see Figure 1 ). Also, if local objectives differ significantly, then it may be beneficial to use smaller values of the server step size (Malinovsky et al., 2022) . Therefore, we seek to answer the following question: For a moderate client step size, can we adapt the server step size according to the local progress made by the clients and the heterogeneity of their objectives? In general, it is challenging to answer this question because it is difficult to obtain knowledge of the heterogeneity between the local objectives and appropriately use it to adapt the server step size. Our Contributions. In this paper, we take a novel approach to address the question posed above. We begin by considering the case where the models are overparameterized, i.e., the number of model parameters is larger than the total number of data points across all clients. This is often true for modern deep neural network models (Zhang et al., 2017; Jacot et al., 2018) and the small datasets collected by edge clients in the FL setting. In this overparameterized regime, the global minimizer becomes a common minimizer for all local objectives, even though they may be arbitrarily heterogeneous. Using this fact, we obtain a novel connection between FedAvg and the Projection Onto Convex Sets (POCS) algorithm, which is used to find a point in the intersection of some convex sets. Based on this connection, we find an interesting analogy between the server step size and the extrapolation parameter that is used to speed up POCS (Pierra, 1984) . We propose new extensions to the extrapolated POCS algorithm to support inexact and noisy projections as in FedAvg. In particular, we derive a time-varying bound on the progress made by clients towards the global minimum and show how this bound can be used to adaptively estimate a good server step size at each round. The result is our proposed algorithm FedExP, which is a method to adaptively determine the server step size in each round of FL based on the pseudo-gradients in that round. Although motivated by the overparameterized regime, our proposed FedExP algorithm performs well (both theoretically and empirically) in the general case, where the model can be either overparameterized or underparameterized. For this general case, we derive the convergence upper bounds for both convex and non-convex objectives. Some highlights of our work are as follows. • We reveal a novel connection between FedAvg and the POCS algorithm for finding a point in the intersection of convex sets. • The proposed FedExP algorithm is simple to implement with virtually no additional communication, computation, or storage required at clients or the server. It is well suited for both cross-device and cross-silo FL, and is compatible with partial client participation. • Experimental results show that FedExP converges 1.4-2× faster than FedAvg and most competing baselines on standard FL tasks. 



Figure 1: Test accuracy (%) achieved by different server and client step sizes on EMNIST dataset (Cohen et al., 2017) after 50 rounds (details of experimental setup are in Section 6 and Appendix D).

Work. Popular algorithms for adaptively tuning the step size when training neural networks includeAdagrad (Duchi et al., 2011) and its variants RMSProp (Tieleman et al., 2012)   and Adadelta (Zeiler, 2012). These algorithms consider the notion of coordinate-wise adaptivity and adapt the step size separately for each dimension of the parameter vector based on the magnitude of the accumulated gradients. While these algorithms can be extended to the federated setting using the concept of pseudo-gradients as done byReddi et al. (2021), these extensions are agnostic to inherent data heterogeneity across clients, which is central to FL. On the contrary, FedExP is explicitly designed for FL settings and uses a client-centric notion of adaptivity that utilizes the heterogeneity of client updates in each round. The work closest to us is Johnson et al. (2020), which proposes a method to adapt the step size for large-batch training by estimating the gradient diversity (Yin et al., 2018) of a minibatch. This result has been improved in a recent work by Horváth et al. (2022). However, both Johnson et al. (2020); Horváth et al. (2022) focus on the centralized setting. In

