ANALYSIS OF ERROR FEEDBACK IN COMPRESSED FEDERATED NON-CONVEX OPTIMIZATION

Abstract

Communication cost between the clients and the central server could be a bottleneck in real-world Federated Learning (FL) systems. In classical distributed learning, the method of Error Feedback (EF) has been a popular technique to remedy the downsides of biased gradient compression, but literature on applying EF to FL is still very limited. In this work, we propose a compressed FL scheme equipped with error feedback, named Fed-EF, with two variants depending on the global optimizer. We provide theoretical analysis showing that Fed-EF matches the convergence rate of the full-precision FL counterparts in non-convex optimization under data heterogeneity. Moreover, we initiate the first analysis of EF under partial client participation, which is an important scenario in FL, and demonstrate that the convergence rate of Fed-EF exhibits an extra slow down factor due to the "stale error compensation" effect. Experiments are conducted to validate the efficacy of Fed-EF in practical FL tasks and justify our theoretical findings.

1. INTRODUCTION

Federated Learning (FL) has seen numerous applications in, e.g., computer vision, language processing, public health, Internet of Things (IoT) [19; 44; 62; 39; 49; 29; 25] . A centralized FL system includes multiple clients each with local data, and one central server that coordinates the training process. The goal of FL is for n clients to collaboratively find a global model, parameterized by θ, such that θ * = arg min θ∈R d f (θ) := arg min θ∈R d 1 n n i=1 f i (θ), where f i (θ) := E D∼Di F i (θ; D) is a non-convex loss function for the i-th client w.r. Our contributions. Despite the rich literature on EF in classical distributed training, it has not been well explored in the context of federated learning. In this paper, we provide a thorough analysis of EF in FL. In particular, the three key features of FL, local steps, data heterogeneity and partial participation, pose interesting questions regarding the performance of EF in federated learning: (i) Can EF still achieve the same convergence rate as full-precision FL algorithms, possibly with highly non-iid local data distribution? (ii) How does partial participation change the situation and the results? We present new algorithm and results to address these questions: • We study a FL framework with biased compression and error feedback, called Fed-EF, with two variants (Fed-EF-SGD and Fed-EF-AMS) depending on the global optimizer (SGD and adaptive AMSGrad [46] , respectively). Under data heterogeneity, Fed-EF has asymptotic convergence rate O( 1 √ T Kn ) where T is the number of communication rounds, K is the number of local training steps and n is the number of clients. Our new analysis matches the convergence rate of full-precision FL counterparts, improving the previous convergence result [6] on error compensated FL (see detailed comparisons in Section 3). Moreover, Fed-EF-AMS is the first compressed adaptive FL algorithm in literature. • Partial participation (PP) has not been considered for standard error feedback in distributed learning. We initiate new analysis of Fed-EF in this setting, with local steps and non-iid data all considered at the same time. We prove that under PP, Fed-EF exhibits a slow down of factor n/m compared with the best full-precision rate, where m is the number of active clients per round, which is caused mainly by the "delayed error compensation" effect. • Experiments are conducted to illustrate the effectiveness of the proposed methods, where we show that Fed-EF matches the performance of full-precision FL with significant reduction in communication, and compares favorably against the algorithms using unbiased compression without error feedback. Numerical examples are also provided to justify our theory. 



Unlike in the classical distributed training, the local data distribution in FL (D That is, only a fraction of clients are involved in each FL training round, which may also slow down the convergence of the global model [10; 12].FL under compression.In order to overcome the main challenge of communication bottleneck, several works have considered federated learning with compressed message passing. Examples includeFedPaQ [47],FedCOM [18]  and FedZip[40]. All these algorithms are built upon directly compressing model updates communicated from clients to server. In particular,[47; 18]  proposed to use unbiased stochastic compressors such as stochastic quantization [3] and sparsification[57], which showed that with considerable communication saving, applying unbiased compression in FL could approach the learning performance of un-compressed FL algorithms. However, unbiased (stochastic) compressors typically require additional computation (sampling) which is less efficient in real-world large training systems. Biased gradients/compressors are also common in many applications[2]. , etc., which belong to biased compression operators. In classical distributed learning literature, it has been shown that directly updating with the biased gradients may slow down the convergence or even lead to divergence[28; 2]. A popular remedy is the so-called error feedback (EF) strategy [54]: in each iteration, the local worker sends a compressed gradient to the server and records the local compression error, which is subsequently used to adjust the gradient computed in next iteration, conceptually "correcting the bias" due to compression. It is known that biased gradient compression with EF can achieve same convergence rate as the full-precision counterparts[28; 35].

Distributed SGD with compressed gradients. In distributed SGD training systems, extensive works have considered compression applied to the communicated gradients. Unbiased stochastic compressors include stochastic rounding andQSGD [3; 66; 58; 38], and magnitude based random sparsification[57]. The works[50; 7; 8; 28; 24]  analyzed communication compression using only the sign (1-bit) information of the gradients. Unbiased compressors can be combined with variance reduction techniques for acceleration, e.g.,[17]. On the other hand, examples of popular biased compressors includeTopK [37; 54; 52], which only transmits gradient coordinates with largest magnitudes, and fixed (or learned) quantization[13; 66].See [9]  for a summary of more biased

