QUIC-FL: QUICK UNBIASED COMPRESSION FOR FED-ERATED LEARNING

Abstract

Distributed Mean Estimation (DME) is a fundamental building block in communication efficient federated learning. In DME, clients communicate their lossily compressed gradients to the parameter server, which estimates the average and updates the model. State of the art DME techniques apply either unbiased quantization methods, resulting in large estimation errors, or biased quantization methods, where unbiasing the result requires that the server decodes each gradient individually, which markedly slows the aggregation time. In this paper, we propose QUIC-FL, a DME algorithm that achieves the best of all worlds. QUIC-FL is unbiased, offers fast aggregation time, and is competitive with the most accurate (slow aggregation) DME techniques. To achieve this, we formalize the problem in a novel way that allows us to use standard solvers to design near-optimal unbiased quantization schemes.

1. INTRODUCTION

In federated learning McMahan et al. (2017) ; Kairouz et al. (2019) , clients periodically send their gradients to the parameter server, which calculates their means. This communication is often a network bottleneck, and methods to approximate the mean using small communication are desirable. The Distributed Mean Estimation problem (DME) Suresh et al. (2017) formalizes this fundamental building block as follows: each of n clients communicate a representation of a d-dimensional vector to a parameter server which estimates the vectors' mean. 2017) use biased (deterministic) quantization on the rotated coordinates. Interestingly, both yield unbiased estimates of the input vector after multiplying the estimated vector by a real-valued "scale" that is sent by each client together with the quantization. Both solutions have an NMSE of Op1{nq and are empirically more accurate than Kashin's representation. However, to achieve unbiasedness, each client must generate a distinct rotation matrix independently



Various DME methods have been studied (e.g., Suresh et al. (2017); Konečnỳ & Richtárik (2018); Vargaftik et al. (2021); Davies et al. (2021); Vargaftik et al. (2022)), examining tradeoffs between the required bandwidth and performance metrics such as the estimation accuracy, learning speed, and the eventual accuracy of the model. These works utilize lossy compression techniques, using only a small number of bits per coordinate, which is shown to accelerate the training process Bai et al. (2021); Zhong et al. (2021). For example, in Suresh et al. (2017), each client randomly rotates its vector before applying stochastic quantization. When receiving the messages from the clients, the server sums up the estimates of the rotated vectors and applies the inverse rotation. As the largest coordinates are asymptotically larger than the mean, their Normalized Mean Squared Error (NMSE ) is bounded by O plog d{nq. They also propose an entropy encoding method that reduces the NMSE to Op1{nq but is slow and not GPU-friendly. A different approach to DME computes the Kashin's representation Lyubarskii & Vershynin (2010) of a client's vector before applying quantization Caldas et al. (2018); Safaryan et al. (2020). Intuitively, this replaces the input d-dimensional vector by λ ¨d coefficients, for some λ ą 1, each bounded by O `a∥x∥ 2 {d ˘. Applying quantization to the coefficients instead of the original vectors allows the server to estimate the mean using λ ą 1 bits per coordinate with an NMSE of O ´λ2 p ? λ´1q 4 ¨n ¯. However, it requires applying multiple randomized Hadamard transforms, slowing down its encoding. The recently introduced DRIVE Vargaftik et al. (2021) (which uses b " 1 bits per coordinate) and its generalization EDEN Vargaftik et al. (2022) (that works with any b ą 0) also randomly rotate the input vector, but unlike Suresh et al. (

