FEDEXP: SPEEDING UP FEDERATED AVERAGING VIA EXTRAPOLATION

Abstract

Federated Averaging (FedAvg) remains the most popular algorithm for Federated Learning (FL) optimization due to its simple implementation, stateless nature, and privacy guarantees combined with secure aggregation. Recent work has sought to generalize the vanilla averaging in FedAvg to a generalized gradient descent step by treating client updates as pseudo-gradients and using a server step size. While the use of a server step size has been shown to provide performance improvement theoretically, the practical benefit of the server step size has not been seen in most existing works. In this work, we present FedExP, a method to adaptively determine the server step size in FL based on dynamically varying pseudo-gradients throughout the FL process. We begin by considering the overparameterized convex regime, where we reveal an interesting similarity between FedAvg and the Projection Onto Convex Sets (POCS) algorithm. We then show how FedExP can be motivated as a novel extension to the extrapolation mechanism that is used to speed up POCS. Our theoretical analysis later also discusses the implications of FedExP in underparameterized and non-convex settings. Experimental results show that FedExP consistently converges faster than FedAvg and competing baselines on a range of realistic FL datasets.

1. INTRODUCTION

Federated Learning (FL) has emerged as a key distributed learning paradigm in which a central server orchestrates the training of a machine learning model across a network of devices. FL is based on the fundamental premise that data never leaves a clients device, as clients only communicate model updates with the server. Federated Averaging or FedAvg, first introduced by McMahan et al. (2017) , remains the most popular algorithm in this setting due to the simplicity of its implementation, stateless nature (i.e., clients do not maintain local parameters during training) and the ability to incorporate privacy-preserving protocols such as secure aggregation (Bonawitz et al., 2016; Kadhe et al., 2020) . Slowdown Due to Heterogeneity. One of the most persistent problems in FedAvg is the slowdown in model convergence due to data heterogeneity across clients. Clients usually perform multiple steps of gradient descent on their heterogeneous objectives before communicating with the server in FedAvg, which leads to what is colloquially known as client drift error (Karimireddy et al., 2019) . The effect of heterogeneity is further exacerbated by the constraint that only a fraction of the total number of clients may be available for training in every round (Kairouz et al., 2021) . 



Various techniques have been proposed to combat this slowdown, among the most popular being variance reduction techniques such as Karimireddy et al. (2019); Mishchenko et al. (2022); Mitra et al. (2021), but they either lead to clients becoming stateful, add extra computation or communication requirements or have privacy limitations. Server Step Size. Recent work has sought to deal with this slowdown by using two separate step sizes in FedAvg -a client step size used by the clients to minimize their local objectives and a server step size used by the server to update the global model by treating client updates as pseudo-gradients (Karimireddy et al., 2019; Reddi et al., 2021). To achieve the fastest convergence rate, these works propose keeping the client step size as O 1/τ √ T and the server step size as O √ τ M , where T is the number of communication rounds, τ is the number of local steps and M is the number of clients. Using a small client step size mitigates client drift, and a large server

