QUIC-FL: QUICK UNBIASED COMPRESSION FOR FED-ERATED LEARNING

Abstract

Distributed Mean Estimation (DME) is a fundamental building block in communication efficient federated learning. In DME, clients communicate their lossily compressed gradients to the parameter server, which estimates the average and updates the model. State of the art DME techniques apply either unbiased quantization methods, resulting in large estimation errors, or biased quantization methods, where unbiasing the result requires that the server decodes each gradient individually, which markedly slows the aggregation time. In this paper, we propose QUIC-FL, a DME algorithm that achieves the best of all worlds. QUIC-FL is unbiased, offers fast aggregation time, and is competitive with the most accurate (slow aggregation) DME techniques. To achieve this, we formalize the problem in a novel way that allows us to use standard solvers to design near-optimal unbiased quantization schemes.

1. INTRODUCTION

In federated learning McMahan et al. (2017) ; Kairouz et al. (2019) , clients periodically send their gradients to the parameter server, which calculates their means. This communication is often a network bottleneck, and methods to approximate the mean using small communication are desirable. The Distributed Mean Estimation problem (DME) Suresh et al. (2017) formalizes this fundamental building block as follows: each of n clients communicate a representation of a d-dimensional vector to a parameter server which estimates the vectors' mean. 2017) use biased (deterministic) quantization on the rotated coordinates. Interestingly, both yield unbiased estimates of the input vector after multiplying the estimated vector by a real-valued "scale" that is sent by each client together with the quantization. Both solutions have an NMSE of Op1{nq and are empirically more accurate than Kashin's representation. However, to achieve unbiasedness, each client must generate a distinct rotation matrix independently



Various DME methods have been studied (e.g., Suresh et al. (2017); Konečnỳ & Richtárik (2018); Vargaftik et al. (2021); Davies et al. (2021); Vargaftik et al. (2022)), examining tradeoffs between the required bandwidth and performance metrics such as the estimation accuracy, learning speed, and the eventual accuracy of the model. These works utilize lossy compression techniques, using only a small number of bits per coordinate, which is shown to accelerate the training process Bai et al. (2021); Zhong et al. (2021). For example, in Suresh et al. (2017), each client randomly rotates its vector before applying stochastic quantization. When receiving the messages from the clients, the server sums up the estimates of the rotated vectors and applies the inverse rotation. As the largest coordinates are asymptotically larger than the mean, their Normalized Mean Squared Error (NMSE ) is bounded by O plog d{nq. They also propose an entropy encoding method that reduces the NMSE to Op1{nq but is slow and not GPU-friendly. A different approach to DME computes the Kashin's representation Lyubarskii & Vershynin (2010) of a client's vector before applying quantization Caldas et al. (2018); Safaryan et al. (2020). Intuitively, this replaces the input d-dimensional vector by λ ¨d coefficients, for some λ ą 1, each bounded by O `a∥x∥ 2 {d ˘. Applying quantization to the coefficients instead of the original vectors allows the server to estimate the mean using λ ą 1 bits per coordinate with an NMSE of O ´λ2 p ? λ´1q 4 ¨n ¯. However, it requires applying multiple randomized Hadamard transforms, slowing down its encoding. The recently introduced DRIVE Vargaftik et al. (2021) (which uses b " 1 bits per coordinate) and its generalization EDEN Vargaftik et al. (2022) (that works with any b ą 0) also randomly rotate the input vector, but unlike Suresh et al. (

annex

from other clients. In turn, the server must invert the rotation for each vector before aggregating them, resulting in Opnq rotations instead of one, asymptotically increasing the decoding time.Here we attempt to resolve the decoding time slowdown from these recent state of the art DME techniques Vargaftik et al. (2021; 2022) . Again, this slowdown arises because unbiasing the estimates requires each client must use its own independent random rotation, and accordingly the server must invert the rotation for each quantized gradient.In 2017), we present two key improvements: (1) Instead of quantizing all coordinates, we allow the algorithm to send an expected p-fraction of the rotated coordinates exactly (up to precision) for some small p (e.g., p " 1{512). This limits the range of the other coordinates to r´T p , T p s, where T p " Op1q for any constant p ą 0, thus reducing the possible quantization error significantly.(2) We study how to leverage client-specific shared randomness Ben Basat et al. ( 2021) to reduce the error further. Specifically, we model the problem of transmitting a "bounded-support" normal random variable Z " N p0, 1q | Z P r´T p , T p s, using b P N `bits, with the goal of obtaining an unbiased estimate at the server. Our model considers both a client's private randomness and shared randomness between the clients and server, allowing us to derive an input to optimization problem solver, whose output yields algorithms with a near-optimal accuracy to bandwidth tradeoff. 1 .We note that our algorithm is based on deriving near-optimal stochastic quantizations for a specific distribution by determining a mathematical program (a set of constraints) fed to an optimization program solver. We believe this approach will prove useful for other problems that use stochastic quantization.While we have surveyed the most relevant related work above, we review other techniques in Appendix A. (All appendices appear in the supplementary material.)

2. PRELIMINARIES

Problems and Metrics. Given a non-zero vector x P R d , a vector compression protocol consists of a client that computes a message X and a server that given the message estimates p x P R d . The vector Normalized Mean Squared Error (vNMSE ) of the protocol is defined as 

