ANALYSIS OF ERROR FEEDBACK IN COMPRESSED FEDERATED NON-CONVEX OPTIMIZATION

Abstract

Communication cost between the clients and the central server could be a bottleneck in real-world Federated Learning (FL) systems. In classical distributed learning, the method of Error Feedback (EF) has been a popular technique to remedy the downsides of biased gradient compression, but literature on applying EF to FL is still very limited. In this work, we propose a compressed FL scheme equipped with error feedback, named Fed-EF, with two variants depending on the global optimizer. We provide theoretical analysis showing that Fed-EF matches the convergence rate of the full-precision FL counterparts in non-convex optimization under data heterogeneity. Moreover, we initiate the first analysis of EF under partial client participation, which is an important scenario in FL, and demonstrate that the convergence rate of Fed-EF exhibits an extra slow down factor due to the "stale error compensation" effect. Experiments are conducted to validate the efficacy of Fed-EF in practical FL tasks and justify our theoretical findings.

1. INTRODUCTION

Federated Learning (FL) has seen numerous applications in, e.g., computer vision, language processing, public health, Internet of Things (IoT) [19; 44; 62; 39; 49; 29; 25] . A centralized FL system includes multiple clients each with local data, and one central server that coordinates the training process. The goal of FL is for n clients to collaboratively find a global model, parameterized by θ, such that θ * = arg min θ∈R d f (θ) := arg min θ∈R d 1 n n i=1 f i (θ), where  f i (θ) := E D∼Di F i (θ; D)



There are two primary benefits of FL: (i) the clients train the model simultaneously, which is efficient in terms of computational resources; (ii) each client's data are kept local throughout training and never transmitted to other parties, which promotes data privacy. However, the efficiency and broad application scenarios also brings challenges for FL method design: Unlike in the classical distributed training, the local data distribution in FL (D i in (1)) can be different (non-iid), reflecting many real-world scenarios where the local data held by different clients (e.g., app/website users) are highly personalized. When multiple local training steps are taken, the local models could become "biased" towards minimizing the local losses, instead of the global loss. This data heterogeneity may hinder the global model to converge to a good solution [34; 67; 33]. • Partial participation (PP): Another practical issue, especially for cross-device FL, is the partial participation (PP) where the clients do not join training consistently, e.g., due to

