z-SIGNFEDAVG: A UNIFIED STOCHASTIC SIGN-BASED COMPRESSION FOR FEDERATED LEARNING

Abstract

Federated Learning (FL) is a promising privacy-preserving distributed learning paradigm but suffers from high communication cost when training large-scale machine learning models. Sign-based methods, such as SignSGD (Bernstein et al., 2018), have been proposed as a biased gradient compression technique for reducing the communication cost. However, sign-based algorithms could diverge under heterogeneous data, which thus motivated the development of advanced techniques, such as the error-feedback method and stochastic sign-based compression, to fix this issue. Nevertheless, these methods still suffer from slower convergence rates. Besides, none of them allows multiple local SGD updates like FedAvg (McMahan et al., 2017). In this paper, we propose a novel noisy perturbation scheme with a general symmetric noise distribution for sign-based compression, which not only allows one to flexibly control the tradeoff between gradient bias and convergence performance, but also provides a unified viewpoint to existing stochastic sign-based methods. More importantly, the unified noisy perturbation scheme enables the development of the very first sign-based FedAvg algorithm (z-SignFedAvg) to accelerate the convergence. Theoretically, we show that z-SignFedAvg achieves a faster convergence rate than existing sign-based methods and, under the uniformly distributed noise, can enjoy the same convergence rate as its uncompressed counterpart. Extensive experiments are conducted to demonstrate that the z-SignFedAvg can achieve competitive empirical performance on real datasets and outperforms existing schemes.

1. INTRODUCTION

We consider the Federated Learning (FL) network with one parameter server and n clients (McMahan et al., 2017; Li et al., 2020a) , with the focus on solving the following distributed learning problem min x∈R d f (x) = 1 n n i=1 f i (x), where f i (•) is the local objective function for the i-th client, for i = 1, . . . , n. Throughout this paper, we assume that each f i is smooth and possibly non-convex. The local objective functions are generated from the local dataset owned by each client. When designing distributed algorithms to solve (1), a crucial aspect is the communication efficiency since a massive number of clients need to transmit their local gradients to the server frequently (Li et al., 2020a) . As one of the most popular FL algorithms, the federated averaging (FedAvg) algorithm (McMahan et al., 2017; Konečnỳ et al., 2016) considers multiple local SGD updates with periodic communications to reduce the communication cost. Another way is to compress the local gradients before sending them to the server (Li et al., 2020a; Alistarh et al., 2017; Reisizadeh et al., 2020) . Among the existing compression methods, a simple yet elegant technique is to take the sign of each coordinate of the local gradients, which requires only one bit for transmitting each coordinate. For any x ∈ R, we define the sign operator as: Sign(x) = 1 if x ≥ 0 and -1 otherwise. It has been shown recently that optimization algorithms with the sign-based compression can enjoy a great communication efficiency while still achieving comparable empirical performance as uncompressed algorithms (Bernstein et al., 2018; Karimireddy et al., 2019; Safaryan & Richtárik, 2021) . However, for distributed learning, especially the scenarios with heterogeneous data, i.e., 1

