z-SIGNFEDAVG: A UNIFIED STOCHASTIC SIGN-BASED COMPRESSION FOR FEDERATED LEARNING

Abstract

Federated Learning (FL) is a promising privacy-preserving distributed learning paradigm but suffers from high communication cost when training large-scale machine learning models. Sign-based methods, such as SignSGD (Bernstein et al., 2018), have been proposed as a biased gradient compression technique for reducing the communication cost. However, sign-based algorithms could diverge under heterogeneous data, which thus motivated the development of advanced techniques, such as the error-feedback method and stochastic sign-based compression, to fix this issue. Nevertheless, these methods still suffer from slower convergence rates. Besides, none of them allows multiple local SGD updates like FedAvg (McMahan et al., 2017). In this paper, we propose a novel noisy perturbation scheme with a general symmetric noise distribution for sign-based compression, which not only allows one to flexibly control the tradeoff between gradient bias and convergence performance, but also provides a unified viewpoint to existing stochastic sign-based methods. More importantly, the unified noisy perturbation scheme enables the development of the very first sign-based FedAvg algorithm (z-SignFedAvg) to accelerate the convergence. Theoretically, we show that z-SignFedAvg achieves a faster convergence rate than existing sign-based methods and, under the uniformly distributed noise, can enjoy the same convergence rate as its uncompressed counterpart. Extensive experiments are conducted to demonstrate that the z-SignFedAvg can achieve competitive empirical performance on real datasets and outperforms existing schemes.

1. INTRODUCTION

We consider the Federated Learning (FL) network with one parameter server and n clients (McMahan et al., 2017; Li et al., 2020a) , with the focus on solving the following distributed learning problem min x∈R d f (x) = 1 n n i=1 f i (x), where f i (•) is the local objective function for the i-th client, for i = 1, . . . , n. Throughout this paper, we assume that each f i is smooth and possibly non-convex. The local objective functions are generated from the local dataset owned by each client. When designing distributed algorithms to solve (1), a crucial aspect is the communication efficiency since a massive number of clients need to transmit their local gradients to the server frequently (Li et al., 2020a) . As one of the most popular FL algorithms, the federated averaging (FedAvg) algorithm (McMahan et al., 2017; Konečnỳ et al., 2016) considers multiple local SGD updates with periodic communications to reduce the communication cost. Another way is to compress the local gradients before sending them to the server (Li et al., 2020a; Alistarh et al., 2017; Reisizadeh et al., 2020) . Among the existing compression methods, a simple yet elegant technique is to take the sign of each coordinate of the local gradients, which requires only one bit for transmitting each coordinate. For any x ∈ R, we define the sign operator as: Sign(x) = 1 if x ≥ 0 and -1 otherwise. It has been shown recently that optimization algorithms with the sign-based compression can enjoy a great communication efficiency while still achieving comparable empirical performance as uncompressed algorithms (Bernstein et al., 2018; Karimireddy et al., 2019; Safaryan & Richtárik, 2021) . However, for distributed learning, especially the scenarios with heterogeneous data, i.e., f i ̸ = f j for every i ̸ = j, a naive application of the sign-based algorithm may end up with divergence (Karimireddy et al., 2019; Chen et al., 2020a; Safaryan & Richtárik, 2021) . A counterexample for sign-based distributed gradient descent. Consider the one-dimensional problem with two clients: (1) A unified family of stochastic sign operators. We show an intriguing fact: The bias brought by the sign-based compression can be flexibly controlled by injecting a proper amount of random noise before the sign operation. In particular, our analysis is based on a novel noisy perturbation scheme with a general symmetric noise distribution, which also provides a unified framework to understand existing stochastic sign-based methods including (Jin et al., 2020; Safaryan & Richtárik, 2021; Chen et al., 2020a) . min x∈R (x -A) 2 + (x + A) 2 , where A > 0 is some constant. For any x ∈ [-A, A], the averaged sign gradient at x is Sign(x -A) + Sign(x + A) = 0, i.e., (2) The first sign-based FedAvg algorithm. In contrast to the existing sign-based methods which do not allow multiple local SGD updates within one communication round, based on the proposed stochastic sign-based compression, we design a novel family of sign-based federated averaging algorithms (z-SignFedAvg) that can achieve the best of both worlds: high communication efficiency and fast convergence rate. (3) New theoretical convergence rate analyses. By leveraging the asymptotic unbiasedness property of the stochastic sign-based compression, we derive a series of theoretical results for z-SignFedAvg and demonstrate its improved convergence rates over the existing signbased methods. In particular, we show that by injecting a sufficiently large uniform noise, z-SignFedAvg can have a matching convergence rate with the uncompressed algorithms. Organization. In Section 2, the proposed general noisy perturbation scheme for the sign-based compression and its key property, i.e., asymptotic unbiasedness, are presented. Inspired by this result, the main algorithms are devised in Section 3 together with their convergence analyses under different noise distribution parameters. We evaluate our proposed algorithms on real datasets and benchmarks with existing sign-based methods in Section 4. Finally, conclusions are drawn in Section 5. Notations. For any x ∈ R d , we denote x(j) as the j-th element of the vector x. We define the ℓ p -norm for p ≥ 1 as ∥x∥ p = ( d j=1 |x(j)| p ) 1 p . We denote that ∥ • ∥ = ∥ • ∥ 2 , and ∥x∥ ∞ = max j∈{1,...,d} |x(j)|. For any function f (x), we denote f (k) (x) as its k-th derivative, and for a vector x = [x(1), ..., x(d)] ⊤ ∈ R d , we define Sign(x) = [Sign(x(1)), ..., Sign(x(d))] ⊤ .

1.1. RELATED WORKS

Stochastic sign-based method. Our proposed algorithm belongs to this category. Among the existing works (Safaryan & Richtárik, 2021; Jin et al., 2020; Chen et al., 2020a) , the setting considered by (Safaryan & Richtárik, 2021) is closest to ours since the latter two consider gradient compression not only in the uplink but also in the downlink. Despite of this difference and the use of different convergence metrics, the algorithms therein achieve the same convergence rate O(τ -1 4 ), where τ is the total number of gradient queries to the local objective function. Compared to existing works, our proposed z-SignFedAvg requires a slightly stronger assumption on the mini-batch gradient noise, but achieves a faster convergence rate O(τ -1 3 ) or even O(τ -1 2 ), with the standard squared ℓ 2 -norm of gradients as the convergence metric.



the algorithm never moves. Similar examples are also discussed by(Chen et al., 2020a; Safaryan & Richtárik,  2021). The fundamental reason for this undesirable result is the uncontrollable bias brought by the sign-based compression. There are mainly two approaches to fixing this issue in the existing literature. The first one is the stochastic sign-based method, which introduces stochasticity into the sign operation(Jin et al., 2020;  Safaryan & Richtárik, 2021; Chen et al., 2020a), and the second one is the Error-Feedback (EF) method(Karimireddy et al., 2019; Vogels et al., 2019; Tang et al., 2019). However, these works are still unsatisfactory. Specifically, on one hand, both the theoretical convergence rates and empirical performance of these algorithms are still worse than uncompressed algorithms like(Ghadimi & Lan,  2013; Yu et al., 2019). On the other hand, none of them allows the clients to have multiple local SGD updates within one communication round like the FedAvg, which thereby are less communication efficient. This work aims at addressing these issues and closing the gaps for sign-based methods. Main contributions. Our contributions are summarized as follows.

