QUIC-FL: QUICK UNBIASED COMPRESSION FOR FED-ERATED LEARNING

Abstract

Distributed Mean Estimation (DME) is a fundamental building block in communication efficient federated learning. In DME, clients communicate their lossily compressed gradients to the parameter server, which estimates the average and updates the model. State of the art DME techniques apply either unbiased quantization methods, resulting in large estimation errors, or biased quantization methods, where unbiasing the result requires that the server decodes each gradient individually, which markedly slows the aggregation time. In this paper, we propose QUIC-FL, a DME algorithm that achieves the best of all worlds. QUIC-FL is unbiased, offers fast aggregation time, and is competitive with the most accurate (slow aggregation) DME techniques. To achieve this, we formalize the problem in a novel way that allows us to use standard solvers to design near-optimal unbiased quantization schemes.

1. INTRODUCTION

In federated learning McMahan et al. (2017) ; Kairouz et al. (2019) , clients periodically send their gradients to the parameter server, which calculates their means. This communication is often a network bottleneck, and methods to approximate the mean using small communication are desirable. The Distributed Mean Estimation problem (DME) Suresh et al. (2017) formalizes this fundamental building block as follows: each of n clients communicate a representation of a d-dimensional vector to a parameter server which estimates the vectors' mean. Various DME methods have been studied (e.g., Suresh et al. (2017) ; Konečnỳ & Richtárik (2018) ; Vargaftik et al. (2021) ; Davies et al. (2021) ; Vargaftik et al. (2022) ), examining tradeoffs between the required bandwidth and performance metrics such as the estimation accuracy, learning speed, and the eventual accuracy of the model. These works utilize lossy compression techniques, using only a small number of bits per coordinate, which is shown to accelerate the training process Bai et al. (2021) ; Zhong et al. (2021) . For example, in Suresh et al. (2017) , each client randomly rotates its vector before applying stochastic quantization. When receiving the messages from the clients, the server sums up the estimates of the rotated vectors and applies the inverse rotation. As the largest coordinates are asymptotically larger than the mean, their Normalized Mean Squared Error (NMSE ) is bounded by O plog d{nq. They also propose an entropy encoding method that reduces the NMSE to Op1{nq but is slow and not GPU-friendly. A different approach to DME computes the Kashin's representation Lyubarskii & Vershynin (2010) of a client's vector before applying quantization Caldas et al. (2018) ; Safaryan et al. (2020) . Intuitively, this replaces the input d-dimensional vector by λ ¨d coefficients, for some λ ą 1, each bounded by O `a∥x∥ 2 {d ˘. Applying quantization to the coefficients instead of the original vectors allows the server to estimate the mean using λ ą 1 bits per coordinate with an NMSE of O ´λ2 p ? λ´1q 4 ¨n ¯. However, it requires applying multiple randomized Hadamard transforms, slowing down its encoding. The recently introduced DRIVE Vargaftik et al. (2021) (which uses b " 1 bits per coordinate) and its generalization EDEN Vargaftik et al. (2022) (that works with any b ą 0) also randomly rotate the input vector, but unlike Suresh et al. (2017) use biased (deterministic) quantization on the rotated coordinates. Interestingly, both yield unbiased estimates of the input vector after multiplying the estimated vector by a real-valued "scale" that is sent by each client together with the quantization. Both solutions have an NMSE of Op1{nq and are empirically more accurate than Kashin's representation. However, to achieve unbiasedness, each client must generate a distinct rotation matrix independently gaftik et al. (2021; 2022) , where p x avg is our estimate of the averagefoot_0 n ř n c"1 x c . Note that for unbiased algorithms and independent estimates, we have that NMSE " vNMSE {n Vargaftik et al. (2021) . Shared randomness. We use both global (common to all clients and the server) and client-specific shared randomness (one client and the server). Client-only randomness is termed private randomness. 3 THE QUIC-FL ALGORITHM 3.1 BOUNDED-SUPPORT-QUANTIZATION Our first contribution is the introduction of bounded-support-quantization (BSQ). For a parameter p P p0, 1s, we pick a threshold T p such that at most d ¨p coordinates can fall outside r´T p , T p s. BSQ separates the vector into two parts: the large coordinates whose absolute value is at least T p , and the small ones. The large values are sent exactly (matching the precision of the input gradient), whereas the small values are quantized and transmitted using a small number of bits each. This simple approach decreases the error of every quantized coordinate by bounding the quantized coordinates' support at the cost of transmitting some entries accurately. As stated in Appendix G, we formally show that BSQ, without further assumptions, admits a worst-case vNMSE of 1 p¨p2 b ´1q 2 . In particular, when p and b are constants, we get an NMSE of Op1{nq with encoding and decoding times of Opdq and Opndq, respectively. However, the linear dependence on p means that the hidden constant in the Op1{nq NMSE is too large to be practical. For example, if p " 2 ´5 and b " 1, we need two bits per coordinate on average: one for sending the exact values (assuming coordinates are single precision floats) and another for stochastically quantizing the remaining coordinates. In turn, we get a vNMSE bound of 1 2 ´5¨p2 1 ´1q 2 " 32. In the following section, we show that combining BSQ with random rotation allows us to get an Op1{nq NMSE even with a low constant for low values of p. For example, p " 2 ´9 and an additional one bit per coordinate for the quantization, we reach a vNMSE of 1.52, a 21ˆimprovement despite using less bandwidth.

3.2. ROTATIONS WITH BOUNDED SUPPORT QUANTIZATION

Similarly to previous works Suresh et al. (2017); Vargaftik et al. (2021; 2022) , our algorithm QUIC-FL begins by randomly rotating the input vector, after which the coordinates' distribution approaches independent normal random variables for high dimensions Vargaftik et al. (2021) . This effectively turns every input into the average case. We note that, unlike Vargaftik et al. (2021; 2022) all clients use the same rotation, generated with global shared randomness. QUIC-FL then utilizes near-optimal unbiased quantization for the normal distribution for each coordinate. We emphasize that QUIC-FL is unbiased for any input; the quantization is tuned for the normal distribution, as after rotation each coordinate it is well-approximated by a normal distribution. 1 Unlike previous algorithms, we combine the rotation with bounded support quantization. QUIC-FL achieves unbiasedness using both private randomness at the client and client-specific shared randomness (shared between it and the server). As another comparison point, Suresh et al. (2017) , given a bit budget of bp1 `op1qq bits per packet, stochastically quantizes each rotated coordinate into one of 2 b levels. The algorithm uses a max-min normalization, and the levels are uniformly spaced between the minimal and maximal coordinates. Their algorithm then communicates the max and min, together with b bits per coordinate indicating its quantized level, and is shown to have a NMSE of O plog d{nq for any b " Op1q. We begin by analyzing the value of rotation with BSQ. Let Z " N p0, 1q be a normal random variable, modeling a rotated (and scaled) coordinate. Given a user-defined parameter p, we can compute a threshold T p such that Pr rZ R r´T p , T p ss " p. For example, by picking p " 2 ´9 (i.e., less than 0.2%), we get a threshold of T p « 3.097.foot_1  In general, for any constant p ą 0, we have T p is constant, and using b bits for each coordinate in r´T p , T p s we get a NMSE of O p1{nq for any constant b (due to unbiased and independent quantization among clients). For example, consider sending each coordinate in r´T p , T p s using b " 1 bit per coordinate. One solution would be to use stochastic quantization, i.e., given a coordinate Z P r´T p , T p s send the bit for which p Z " T p with probability We can view the algorithm expressed so far as a special case of QUIC-FL without shared randomness. As shown in Appendix B, for QUIC-FL (with or without shared randomness) on any ddimensional input vector (and any quantization scheme for Z P r´T p , T p s), vNMSE " E " ´Z´p Z ¯2ȷ `O ˆb log d d ˙. The additional additive term occurs because we chose to optimize for the normal distributionfoot_2 Again, this holds for any initial vector because QUIC-FL starts with a random rotation. Thus, using the above quantization for each coordinate for large gradients results in NMSE « 8.58{n. We next show that additionally using client-specific shared randomness can decrease ErpZ ´p Zq 2 s and thus the NMSE .

3.3. LEVERAGING CLIENT-SPECIFIC SHARED RANDOMNESS

We now provide an example to show how shared randomness can improve the vNMSE , leading to §3.4 where we formalize our approach to finding near-optimal unbiased compression schemes for bounded-support N p0, 1q variables. Using a single shared random bit (i.e., H P t0, 1u), we can use the following algorithm, where X is the sent message and α " 0.8, β " 5.4 are constants: X " $ ' ' ' & ' ' ' % 1 if H " 0 and Z ě 0 0 if H " 1 and Z ă 0 Bernoullip 2Z α`β q If H " 1 and Z ě 0 1 ´Bernoullip ´2Z α`β q If H " 0 and Z ă 0 p Z " $ ' ' ' & ' ' ' % ´β if H " X " 0 ´α if H " 1 and X " 0 α If H " 0 and X " 1 β If H " X " 1 . For example, if Z " 1, then with probability 1{2 we have that H " 0 and thus X " 1, and otherwise the client sends X " 1 with probability 2 α`β (and otherwise X " 0). Similarly, the reconstruction would be p Z " α with probability 1{2 (when H " 0), p Z " β with probability 1{2 ¨2 α`β " 0.16, and p Z " ´α with probability 1{2 ¨α`β´2 α`β " 0.84. Indeed, we have that the estimate is unbiased since: Er p Z | Z " 1s " α ¨1{2 `β ¨1{2 ¨2 α `β `p´αq ¨1{2 ¨α `β ´2 α `β " 1. We calculate the quantization's expected squared error, conditioned on Z P r´T p , T p s. (From symmetry, we integrate over positive t.) E " pZ ´p Zq 2 ı " c 2 π ˆż Tp 0 1 2 ¨ˆpz ´αq 2 `2z α `β ¨pz ´βq 2 `α `β ´2z α `β ¨pz `αq 2 ˙¨e ´z2 {2 dz Using the same p " 2 ´9 parameter (T p « 3.097), we get an error of E " pZ ´p Zq 2 ı « 3.29, 61% lower than without shared randomness. This algorithm is derived from the solver, which numerically approximates the optimal unbiased algorithm with a single shared random bit, in terms of expected squared error, for this p. We present our general approach for using the solver in the following sections.

3.4. DESIGNING NEAR-OPTIMAL UNBIASED COMPRESSION SCHEMES

In order to design our post-rotation compression scheme, we first model the problem as follows: • We first choose a parameter p ą 0, the expected fraction of coordinates allowed to be sent exactly. • The input, known to the client, is a coordinate Z " N p0, 1q. The p parameter restricts further the distribution to Z P r´T p , T p s. • The client-specific shared randomness H is known to both the client and server, and without loss of generality, we assume that H " U r0, 1s. We denote by H " r0, 1s the domain of H. • We use a bit budget of b P N `bits per coordinate, and accordingly assume that the messages are in the set X b " ␣ 0, . . . , 2 b ´1( . 4 Again, coordinates outside the range r´T p , T p s are sent exactly. • The client is modeled as S : H ˆR Ñ ∆pX b q. That is, the client observes the shared randomness H and the input Z, and chooses a distribution over the messages. We further denote by S x ph, zq the probability that the client sends x P X b given h and z (i.e., @h, z : ř x S x ph, zq " 1). For example, it may choose S x p0, 0q " " 1{2 If x P t0, 1u Otherwise . That is, given z " h " 0, the client shall use private randomness to decide whether to send x " 0 or x " 1, each with probability 1{2. • The server is modeled as a function R : H ˆXb Ñ R, such that if the shared randomness is h P H and the server receives the message x P X b , it produces an estimate p z " Rph, xq. • We require that the estimates are unbiased, i.e., Er p Z | Zs " Z, where the expectation is taken over both the client-specific shared randomness H and the private randomness of the client. We are now ready to formally define the optimal unbiased quantization problem: minimize S,R 1 ? 2π ż Tp ´Tp ż 1 0 ÿ x S x ph, zq ¨pz ´Rph, xqq 2 ¨e´z 2 {2 dh dz subject to ż 1 0 ÿ x S x ph, zq ¨Rph, xq dh " z, @z P r´T p , T p s. We are unaware of methods for solving the above problem analytically. Instead, we propose a discrete relaxation of the problem, allowing us to approach it with a solver.foot_4 Namely, we model the algorithm as an optimization problem and let the solver output the optimal algorithm. To that end, we need to discretize the problem. Specifically, we make the following relaxations: • The shared randomness H is selected uniformly at random from a finite set of values H ℓ fi ␣ 0, . . . , 2 ℓ ´1( , i.e., using ℓ shared random bits. • The bounded-support distribution of a rotated and scaled Z " N p0, 1q coordinate is approximated using a finite set of quantiles Q m " tq 0 , . . . , q m´1 u, for a parameter m P N `. In particular, the quantile q i is the point on the CDF of the bounded-support normal distribution (restricted to r´T p , T p s) such that the PrrZ ď q i | Z P r´T p , T p ss " i m´1 . Notice that we have m such quantiles, corresponding to the probabilities ! 0, 1 m´1 , 2 m´1 , . . . , ) . For example, p " 2 ´9 and m " 4 we get the quantile set Q 4 « t´3.097, ´0.4298, 0.4298, 3.097u. • The client is now modeled as S : H ℓ ˆQm Ñ ∆pX b q. That is, for each shared randomness h P H ℓ and quantile q P Q m values, the client has a probability distribution on the messages from which it samples, using private randomness, at encoding time. • The server is modeled as a function R : H ℓ ˆXb Ñ R, such that if the shared randomness is H and the server receives the message X, it produces an estimate p Z " RpH, Xq. Given this modeling, we use the following variables: • s " ts h,q,x | h P H ℓ , q P Q m , x P X b u, where s h,q,x denotes the probability of sending a message x, given the quantile q and shared randomness value h. We note that the solver's solution will only instruct us what to do if all our coordinates were quantiles in Q m . In what follows, we show how to interpolate the result and get a practical algorithm for any Z P r´T p , T p s. • r " tr h,x | h P H ℓ , x P X b u, where r h,x denotes the server's estimate value given the shared randomness h and the received message x. Accordingly, the discretized unbiased quantization problem is defined as: minimize s,r 1 m ¨1 2 ℓ ¨ÿ h,q,x s h,q,x ¨pq ´rh,x q 2 subject to pUnbiasednessq 1 2 ℓ ¨ÿ h,x s h,q,x ¨rh,x " q, @q pProbabilityq ÿ x s h,q,x " 1, @h, q s h,q,x ě 0, @h, q, x As mentioned, the solver's output does not directly yield an implementable algorithm, as it only associates probabilities to each xh, q, xy tuple. A natural option is to first stochastically quantize Z to a quantile. For example, when Z " 1 and using the Q 4 described above, before applying the algorithm, we quantize it to q ´" 0.4298 with probability « 0.786 or q `" 3.097 with probability « 0.214. This approach gives an algorithm whose pseudo-code is given in Algorithm 1. The resulting algorithm is near-optimal in the sense that as the number of quantiles and shared random bits tend to infinity, we converge to an optimal algorithm. In practice, the solver is only able to produce an output for finite m, ℓ values; this means that the algorithm would be optimal if coordinates are uniformly distributed over Q m , and not in N p0, 1q. In words, in Algorithm 1 each client c uses shared randomness to compute a global random rotation R (note that all clients use the same rotation). Next, it computes the rotated vector R px c q; for sufficiently large dimensions, the distribution of each entry in Z c converges to N ´0, ∥xc∥ 2 2 d ¯. The client then normalizes it, Z c " ? d ∥xc∥ 2 ¨R px c q, to have the coordinates roughly distributed N p0, 1q. Next, it stochastically quantizes the vector to Q m . Namely, for a given coordinate Z, let q ´, q `P Q m denote the largest quantile smaller or equal to Z, and the smallest quantile larger than q respectively. Then we denote by Q m pZq the stochastic quantization operation that returns q `with probability Z´q q`´q´a nd q ´otherwise. The stochastic quantization of the vector applies coordinate-wise, i.e., Q m pZ c q " pQ m pZ c r0sq, . . . , Q m pZ c rd ´1sqq. The next step is to generate a client-specific shared randomness vector H c in which each entry is drawn uniformly and independently from H ℓ . Finally, the client follows the client algorithm produced by the solver. That is, for each coordinate Z, the client takes the mapped quantile q " Q m pZq P Q m , considers the set of probabilities ts h,q,x | x P X b u, and samples a message accordingly. We denote applying this operation coordinate-wise by X c " ! x with prob. s Hc, r Zc,x | x P X b ) . It then sends the resulting vector X c to the server, together with the norm ∥x c ∥ 2 . In turn, for each client c, the server estimates its rotated vector by looking up the shared randomness and message for each coordinate. That is, given H c " pH c r0s, . . . , H c rd ´1sq and X c " pX c r0s, . . . , X c rd´1sq we denote r Hc,Xc " pr Hcr0s,Xcr0s , . . .q. The server then estimates Rpx c q as ´∥x c ∥ { ? d ¨rHc,Xc ¯and averages across all clients before performing the inverse rotation. In the next section, we analyze the solver's output and show how to improve this method. Further optimization A different approach to yield an implementable algorithm from the optimal solution to the discrete problem is to calculate the message distribution directly from the rotated values without stochastically quantizing as we do in Line 2. Indeed, we have found this approach to be somewhat faster and more accurate. Due to space constraints, we defer the details to Appendix C.  2: Compute p Zavg " 1 n ¨1 ? d ¨řn c"1 ∥xc∥ 2 ¨p Zc 3: Estimate p xavg " R ´1 ´p Zavg 1 2 3 4 Bit Budget (b) 10 -2 10 -1 10 0 10 1 vNMSE = 0 = 1 = 2 = 3 = 4 = 5 2 -4 2 -5 2 -6 2 -7 2 -8 2 -9 Accuratly sent coordiantes (p) ) instead of uniform random rotations. Although RHT does not induce a uniform distribution on the sphere (and the coordinates are not exactly normally distributed), it is considerably more efficient to compute and, under mild assumptions, the resulting distribution is sufficiently close to the normal distribution Vargaftik et al. (2021) . Here, we are interested in how using RHT affects the guarantees of our algorithm. We analyze how using RHT affects our guarantees, starting by noting that our algorithm remains unbiased for any input vector. However, adversarial inputs may (1) increase the probability that a rotated coordinate falls outside r´T p , T p s and (2) increase the vNMSE as the coordinates' distribution deviates from the normal distribution. We show in Appendix D that QUIC-FL with RHT has similar guarantees as with random rotations, albeit somewhat weaker (constant factor increases in the fraction of accurately sent coordinates and vNMSE ). We note that these guarantees are still stronger than those of DRIVE Vargaftik et al. (2021) and EDEN Vargaftik et al. (2022) , which only prove RHT bounds for input vectors whose coordinates are sampled i.i.d. from a distribution with finite moments, and are not applicable to adversarial vectors. In practice, as shown in the evaluation, the actual performance is close to the theoretical results for uniform rotations; improving the bounds is left as future work. In our evaluation, we use QUIC-FL (Algorithm 2) with RHT-based vector rotation.

4.1. THEORETICAL EVALUATION: NMSE AND SPEED MEASUREMENTS

Parameter Selection. We experiment with how the different parameters (number of quantiles m, the fraction of coordinates sent exactly p, the number of shared random bits ℓ, etc.) affect the performance of our algorithm. As shown in Figure 1 , introducing shared randomness decreases the vNMSE significantly compared with ℓ " 0. Additionally, the benefit from adding each additional shared random bit diminishes, and the gain beyond ℓ " 4 is negligible, especially for large b. Accordingly, we hereafter use ℓ " 6 for b " 1, ℓ " 5 for b " 2, and ℓ " 4 for b P t3, 4u. With respect to p, we determined 1 512 as a good balance between the vNMSE and bandwidth overhead. Comparison to state of the art DME techniques. Next, we compare the performance of QUIC-FL to the baseline algorithms in terms of NMSE , encoding speed, and decoding speed, using an NVIDIA 3080 RTX GPU machine with 32GB RAM and i7-10700K CPU @ 3.80GHz. Specifically, we compare with Hadamard Suresh et al. ( 2017), Kashin's representation Caldas et al. (2018) ; Safaryan et al. (2020) , QSGD Alistarh et al. (2017), and EDEN Vargaftik et al. (2022) . We evaluate two variants of Kashin's representation: (1) The TensorFlow (TF) implementation Authors that, by default, limits the decomposition to three iterations, and (2) the theoretical algorithm that requires Oplogpndqq iterations. As shown in Figure 2 , QUIC-FL has the second-lowest NMSE , slightly higher than EDEN's, which has a far slower decode time. Further, QUIC-FL is significantly more accurate than approaches with similar speeds. We observed that the default TF configuration of Kashin's representation suffers from a bias, and therefore its NMSE does not decrease inversely proportional to n. In contrast, the theoretical algorithm is unbiased but has a markedly higher encoding time. We observed similar trends for different n, b, and d values. We consider the algorithms' bandwidth over all coordinates (e.g., with b`6 4 512 bits for QUIC-FL). Overall, the empirical measurements fall in line with the bounds in Table 1 . 2016), with learning rates of 0.1 and 0.05, respectively. For both datasets, the clients perform a single optimization step at each round. Our setting includes an SGD optimizer with a cross entropy loss criterion, a batch size of 128, and a bit budget b " 1. The results are shown in Figure 4 , with a rolling mean average window of 500 rounds. As shown, QUIC-FL is competitive with EDEN and the Float32 baseline and is more accurate than other methods. Figure 5 shows the results with a rolling mean window of 200 rounds. Again, QUIC-FL is competitive with EDEN and the uncompressed baseline. Kashin-TF is less accurate followed by Hadamard. Additional evaluation Due to lack of space, we defer additional evaluation results to Appendix F.

5. DISCUSSION

In this work, we presented QUIC-FL, a quick unbiased compression algorithm for federated learning. Both theoretically and empirically, QUIC-FL achieves an NMSE that is comparable with the most accurate DME techniques, while allowing an asymptotically faster decode time. We point out a few challenging directions for future work. QUIC-FL optimizes the worst-case error, and while it is compatible with orthogonal directions such as sparsification Konečnỳ & Richtárik (2018) 

A ALTERNATIVE COMPRESSION METHODS

This paper focused on the Distributed Mean Estimation (DME) problem where clients send lossily compressed gradients to a centralized server for averaging. While this problem is worthy of study on its own merits, we are particularly interested in applications to federated learning, where there are many variations and practical considerations, which have led to many alternative compression methods to be considered. For example, when the encoding and decoding time is less important, different approaches suggest to use entropy encodings such as Huffman or arithmetic encoding to improve the accuracy (e.g., Suresh et al. ( 2017); Vargaftik et al. (2022) ; Alistarh et al. (2017) ). Intuitively, such encodings allow us to losslessly compress the lossily compressed vector to reduce its representation size, thereby allowing less aggressive quantization. However, we are unaware of available GPU-friendly entropy encoding implementation and thus such methods incur a significant time overhead. Critically, for the basic DME problem, the assumption is that this is a one-shot process where the goal is to optimize the accuracy without relying on client-side memory. This model naturally fits cross-device federated learning, where different clients are sampled in each round. We focused on unbiased compression, which is standard in prior works Suresh et al. ( 2017); Konečnỳ & Richtárik (2018) ; Vargaftik et al. (2021) ; Davies et al. (2021) ; Mitchell et al. (2022) . However, if the compression error is low enough, and under some assumptions, SGD can be proven to converge even with biased compression Beznosikov et al. (2020) . In other settings, such as distributed learning or cross-silo federated learning, we may assume that clients are persistent and have a memory that keeps states between rounds. A prominent option to leverage such a state is to use Error Feedback (EF). In EF, clients can track the compression error and add it to the gradient computed in the consecutive round. This scheme is often shown to recover the model's convergence rate and resulting accuracy Seide et al. ( 2014 An orthogonal proposal that works with persistent clients, which is also applicable with QUIC-FL, is to encode the difference between the current gradient and the previous one, instead of directly compressing the gradient Mishchenko et al. (2019) ; Gorbunov et al. (2021) . Broadly speaking, this allows a compression error that is proportional to the size of the difference and not the gradient, and can decrease the error if consecutive gradients are similar to each other. When running distributed learning in cluster settings, recent works show how in-network aggregation can accelerate the learning process Sapio et al. (2021) ; Lao et al. (2021) ; Segal et al. (2021) . Intuitively, as switches are designed to move data at high speeds and recent advances in switch programmability enable it to easily perform simple aggregation operations like summation while processing the data. Another line of work focuses on sparsifying the gradients before compressing them Fei et al. (2021); Stich et al. (2018b); Aji & Heafield (2017) . Intuitively, in some learning settings, many of the coordinates are small. and we can improve the accuracy to bandwidth tradeoff by removing all small coordinates prior to compression. Another form of sparsification is random sampling, which allows us to avoid sending the coordinate indices Vargaftik et al. (2022) ; Konečný et al. (2017) . We note that combining such approaches with QUIC-FL is straightforward, as we can compress the non-zero entries of the sparsified vectors. Combining several techniques, including warm-up training, gradient clipping, momentum factor masking, momentum correction, and deep gradient compression Lin et al. (2018) , reports savings of two orders of magnitude in the bandwidth required for distributed learning. Another promising orthogonal approach is to leverage shared randomness to get the clients' compression to yield errors in opposite directions, thus making them cancel out and lowering the overall NMSE Suresh et al. ( 2022). The QUIC-FL algorithm, based on the output of the solver (see §3), uses non-uniform quantization, i.e., has quantization levels that are not uniformly spaced. Indeed, recent works observed that non-uniform quantization improves the estimation accuracy and accelerates the learning convergence Vargaftik et al. (2022); Ramezani-Kebrya et al. (2021) . We refer the reader to Kairouz et al. ( 2019 

B QUIC-FL'S vNMSE PROOF

In this appendix, we analyze how the expected squared quantization error for a bounded-support normal random variables relates to the vNMSE of our algorithm. Namely, let χ " E " ´Z ´p Z ¯2ȷ denote the error of the quantization of a normal random variable Z " N p0, 1q. Our analysis is general, covers the quantization methods presented in sections 3.2-3.3, but is applicable to any unbiased method that is used following a uniform random rotation. We show that QUIC-FL's vNMSE is essentially χ plus a small additional additive error term (arising because the rotation does not yield exactly normally distributed coordinates, as we have explained) that goes to 0 quickly as the dimension increases. We bound the additional error caused by using the Randomized Hadamard Transform in Section 3.5 and Appendix D. Theorem 1. It holds that: vNMSE ď χ `O ˜c log d d ¸. Proof. The proof follows similar lines to that of Vargaftik et al. (2021; 2022) . However, here the vNMSE expression is different and is somewhat simpler as it takes advantage of our unbiased quantization technique. A rotation preserves a vector's euclidean norm. Thus, according to Algorithms 1 and 2 it holds that ∥x ´p x∥ 2 2 " ∥R px ´p xq∥ 2 2 " ∥R pxq ´R pp xq∥ 2 2 " ∥x∥ 2 ? d ¨Z ´∥x∥ 2 ? d ¨p Z 2 2 " ∥x∥ 2 2 d ¨ Z ´p Z 2 2 . Taking expectation and dividing by ∥x∥ 2 2 yields vNMSE fi E « ∥x ´p x∥ 2 2 ∥x∥ 2 2 ff " 1 d ¨E " Z ´p Z 2 2 ȷ " 1 d ¨E « d ÿ i"1 ´Zris ´p Zris ¯2ff " 1 d ¨d ÿ i"1 E " ´Zris ´p Zris ¯2ȷ . Let Z " p Z1 , . . . , Zd q be a vector of independent N p0, 1q random variables. Then, the distribution of each coordinate Zris is given by Zris " ? d ¨Zris ∥ Z∥ 2 (e.g., see Vargaftik et al. (2021) ; Muller (1959) ). This means that all coordinates of Z and thus all coordinates of p Z follow the same distribution. Thus, without loss of generality, we obtain vNMSE fi E « ∥x ´p x∥ 2 2 ∥x∥ 2 2 ff " E " ´Zr0s ´p Zr0s ¯2ȷ " E » - - ¨?d Z 2 ¨Zr0s ´p Zr0s ‚2 fi ffi fl . For some 0 ă α ă 1 2 , denote the event A " " d ¨p1 ´αq ď Z 2 2 ď d ¨p1 `αq * . Let A c be the complementary event of A. By Lemma D.2 in Vargaftik et al. (2022) it holds that PpA c q ď 2 ¨e´α 2 8 ¨d . Also, by the law of total expectation E » - - ¨?d Z 2 ¨Zr0s ´p Zr0s ‚2 fi ffi fl ď E » - - ¨?d Z 2 ¨Zr0s ´p Zr0s ‚2 ˇˇˇˇA fi ffi fl ¨PpAq `E » - - ¨?d Z 2 ¨Zr0s ´p Zr0s ‚2 ˇˇˇˇA c fi ffi fl ¨PpA c q ď E » - - ¨?d Z 2 ¨Zr0s ´p Zr0s ‚2 ˇˇˇˇA fi ffi fl ¨PpAq `M ¨PpA c q , where M " pvNMSE max q 2 and vNMSE max is the maximal value at the server's reconstruction table (i.e., maxprq) which is a constant that is independent of the vector's dimension. Next, Based on our examination of solver outputs, we determined an alternative approach that does not stochastically quantize each coordinate to a quantile as above and empirically performs better. E » - - ¨?d Z 2 ¨Zr0s ´p Zr0s ‚2 ˇˇˇˇA fi ffi fl " E » - -¨´Z r0s ´p Zr0s ¯`¨? d Z 2 ´1‚ ¨Zr0s ‚2 ˇˇˇˇA fi ffi fl " E « ´Zr0s ´p Zr0s ¯2 ˇˇˇˇA ff `2 ¨E » -´Zr0s ´p Zr0s ¯¨¨? d Z 2 ´1‚ ¨Zr0s ˇˇˇˇA fi fl È » - -¨¨? d Z 2 ´1‚ ¨Zr0s ‚2 ˇˇˇˇA fi ffi fl We explain the process first considering an example. We consider the setting of p " 1 512 (T p « 3.097), m " 512 quantiles, b " 2 bits per coordinate, and ℓ " 2 bits of shared randomness. One optimal solution for the server is given below:foot_5  x " 0 x " 1 x " 2 x " 3 h " 0 -5.48 -1.23 0.164 1.68 h " 1 -3.04 -0.831 0.490 2.18 h " 2 -2.18 -0.490 0.831 3.04 h " 3 -1.68 -0.164 1.23 5.48 Table 2 : Optimal server values (r h,x ) for x P X2, H P H2 when p " 1{512 and m " 512, rounded to 3 significant digits. For example, when Z " 0, the server will estimate one of the values in bold based on the shared randomness and the message received from the client. Given this table, by symmetry, if Z " 0 we can send X " # 1 If H ď 1 2 Otherwise , which is formally written as SxpH, 0q " # 1 If px " 1 ^H ď 1q _ px " 2 ^H ą 1q 0 Otherwise . Indeed, we have that E " p Z ı " 1 4 ř h r h,X " 0. Now, suppose that Z ą 0 (the negative case is symmetric); the client can increase the server estimate's expected value (compared with the above choice of X) by moving probability mass to larger x values for some (or all) of the options for H. For any Z P p´T p , T p q, there are infinitely many client alternatives that would yield an unbiased estimate. For example, if Z " 0.1, below are two client options (rounded to one significant digit): subsequently, from x " 2, h " 0 and so on.) This process is visualized in Figure 6 . Note that S x ph, zq values are piecewise linear as a function of z, and further, these values either go from 0 to 1, 1 to 0, or 0 to 1 and back again (all of which follow from our description). We can turn this description into formulae, and we defer this mathematical interpretation to Appendix C. The final algorithm, named QUIC-FL, is given by Algorithm 2 (based on the formula given in the appendix).  S 3 (z, h) h = 0 h = 1 h = 2 h = 3 Figure 6 : The solver's client algorithm (for b " ℓ " 2, m " 512, p " 1 512 ) for the quantiles ts h,q,x u h,q,x . Markers correspond to quantiles in Qm, and the lines illustrate our interpolation.

C.2 DERIVATION OF EQUATIONS FOR ALGORITHM 2

As we described in §C.1 through an example, and as illustrated by Figure 6 , we have found through examining the solver's solutions, for our parameter range, that the optimal approach for the client has a structure that we can generalize. This allows us to readily interpolate the algorithm to non-quantile values Z R Q m without stochastically quantizing the coordinate to a quantile. Recall the example from §C.1 with ℓ " b " 2. There, we had the server table: x " 0 x " 1 x " 2 x " 3 h " 0 -5. As described earlier, while ideally we would like to use a fully random rotation on the d-dimensional sphere as the first step to our algorithms, this is computationally expensive. Instead, we suggest using a randomized Hadamard transform (RHT), which is computationally more efficient. We formally show below that we maintain some performance bounds using a single RHT. We note that some works suggest using two or three successive randomized Hadamard transforms to obtain something that should be closer to a uniform random rotation Yu et al. ( 2016); Andoni et al. (2015) . This naturally takes more computation time. R RHT pxqris be a coordinate in the transformed vector. Denoting by E b " E " pZ ´x Z b q 2 ı the mean squared error using b bits per coordinate, we have E 1 ď 4.831, E 2 ď 0.692, E 3 ď 0.131, E 4 ď 0.0272. Proof. We present an approach to bound the MSE of encoding the coordinate Z, leveraging Theorem 3. 3 2 1 0 1 2 3 Encoded Cooridnate (z) 10 2 10 1 10 0 10 1 MSE = E[(z z) 2 ] T T b = 1 b = 2 b = 3 b = 4 Since we believe that it provides only a loose bound, we do not optimize the argument beyond showing the technique. Since the MSE, as a function of Z, is symmetric around 0 (as illustrated in Figure 7 ), we analyze the Z ě 0 case. The first option is to split r0, T s into intervals, e.g., I 1 " r0, 1.5s, I 2 " p1.5, 2.2s, I 3 " p2.2, T q. Using Theorem 3, we get that P 1 fi PrrZ R r0, 1.5ss ď 3.2 PrrZ R r0, 1.5ss ď 0.427 and similarly, P 2 fi PrrZ R r0, 2.2ss ď 3.2 PrrZ R r0, 2.2ss ď 0.089. Next, we provide the maximal error for each bit budget b and such interval: b  " 1 b " 2 b " 3 b " 4 r0,

F ADDITIONAL EVALUATION

Our code appears in the supplementary material and will be released as open source upon publication. We simulate 10 clients that distributively compute the top eigenvector in a matrix (i.e., the matrix rows are distributed among the clients). Particularly, each client executes a power iteration, compresses its top eigenvector, and sends it to the server. The server updates the next estimated eigenvector by the averaged diffs (of each client to the eigenvector from the previous round) and scales it by a learning rate of 0.1. Then, the estimated eigenvector is sent by the server to the clients and the next round can begin.

F.1 DISTRIBUTED POWER ITERATION

Figure 8 presents the L2 error of the obtained eigenvector by each compression scheme when compared to the eigenvector that is achieved without compression. The results cover bit budget b from one bit to four bits for both MNIST and CIFAR-10 Krizhevsky et al. (2009); LeCun et al. (1998; 2010) datasets. Each distributed power iteration simulation is executed for 50 rounds for the MNIST dataset and for 200 rounds for the CIFAR-10 dataset. As shown, QUIC-FL has an accuracy that is competitive with that of EDEN (especially for b ě 2), and considerably better than other algorithms that offer fast decoding time. Also, Kashin-TF is not unbiased (as illustrated by Figure 1 ), and is therefore less competitive for a larger number of clients. F.2 FEDERATED LEARNING: EXTENDED COMPARISON WITH QSGD We repeat the experiments from Figure 4 and Figure 5 , adding another curve for QSGD, which uses twice the bandwidth of the other algorithms (one bit for sign, and another for the stochastic quantization). As shown in figures 9 and 10, even with the additional bandwidth, QSGD's accuracy is lower than all algorithms except CS. We note that QSGD also has a more accurate variant that uses variable length encoding Alistarh et al. (2017) . However, it is not GPU-friendly, and therefore, as with other variable length encoding schemes as we have discussed previously, we do not include it in the experiment. 

G ANALYSIS OF THE BOUNDED STOCHASTIC QUANTIZATION TECHNIQUE

In this appendix, we analyze the Bounded Stochastic Quantization (BSQ) approach that sends all coordinates outside a range r´T, T s exactly and performs a standard stochastic quantization for the rest. Let p P p0, 1q and denote T p " ∥x∥ 2 ? d¨p ; notice that there can be at most d ¨p coordinates outside r´T p , T p s. Using b bits, we split this range into 2 b ´1 intervals of size 



For any fixed dimension, we can optimize the quantization (using the solver) for its exact rotated coordinate distribution (known to be a shifted-beta). However, this would require a different quantization for each dimension. The difference using a normal distribution is negligible, even with dimensions in the hundreds. Note that, as we begin with a normal random variable Z, bounding its support is effective at removing the long, small-probability tails. Additionally, as the learning process typically uses 16-64 bit floats, and we further need to send the coordinate indices, sending each coordinate is expensive, and thus we focus on small p values. If instead we were quantizing a shifted-beta random variable B, we would get vNMSE " E " B ´p B ı . We note that using entropy encoding, one may use more than 2 b messages (and thereby reduce the error) if the resulting entropy is bounded by b (e.g.,Suresh et al. (2017);Vargaftik et al. (2022);Alistarh et al. (2017)). As we aim to design a quick and GPU-friendly compression scheme, we do not investigate entropy encoding further. We used the GekkoBeal et al. (2018) software package that provides a Python wrapper to the APMonitorHedengren et al. (2014) environment, running the solvers IPOPT IPO and APOPT APO . A crucial ingredient in getting a human-readable solution from the solver is that we, without loss of generality, force monotonicity in both h and x, i.e., px ě x 1 q ^ph ě h 1 q ùñ r h,x ě r h 1 ,x 1 . Further, note that Table2is symmetric. We found tables were symmetric for small ℓ, m, and then forced symmetry in order to reduce model size for larger values. We use this symmetry in our interpolation.



Figure 1: The vNMSE of QUIC-FL as a function of the bit budget, fraction p, and shared random bits ℓ.

3.5 HADAMARD Similarly to previous rotation-based compression algorithms Suresh et al. (2017); Vargaftik et al. (2021; 2022) we propose to use the Randomized Hadamard Transform (RHT) (Ailon & Chazelle, 2009

Figure 2: Comparison to alternatives with n clients that have the same LogN ormalp0, 1q input vector Vargaftik et al. (2021; 2022). The default values are n " 256 clients, b " 4 bit budget, and d " 2 20 dimensions.

Figure3: FedAvg over the Shakespeare next-word prediction task at various bit budgets (rows). We report training accuracy per round with a rolling mean window of 200 rounds. The second row zooms in on the last 100 rounds (QSGD is not included in the zoom since it performed poorly).

Figure 4: Train and test accuracy for CIFAR-10 and CIFAR-100 with 10 persistent clients (i.e., silos) and b " 1.

); Alistarh et al. (2018); Richtárik et al. (2021) and enables biased compressors such as Top-k Stich et al. (2018a).

);Konečný et al. (2017);Wang et al. (2021) for an extensive review of the current state of the art and challenges.

Figure 7: Expected squared error as a function of the value of Z (for p " 1 512 , m " 512).

Figure 8: Distributed power iteration of MNIST and CIFAR-10 with 10 and 100 clients.

Figure 9: Train and test accuracy for CIFAR-10 and CIFAR-100 with 10 persistent clients (i.e., silos) and b " 1.

´1 , meaning that each coordinate's expected squared error is at most´2Tp 2 b ´1 ¯2 {4.The MSE of the algorithm is therefore

This problem generalizes to the Distributed Mean Estimation (DME) problem, where n clients have vectors ␣ x c P R d ( that they communicate to a centeralized server. We are interested in minimizing the Normalized Mean Squared Error (NMSE ), defined as

Algorithm 2 QUIC-FL

In our case, and in line with previous worksVargaftik et al. (2021; 2022), we find empirically that one RHT appears to suffice. Unlike these works, our algorithm remains provably unbiased and maintains strong NMSE guarantees in this case. Determining better provable bounds using two or more RHTs is left as an open problem.Theorem 2. Let x P R d , let R RHT pxq be its randomized Hadamard transform, and let Z " RHT pxqris be a coordinate in the transformed vector. For any p, Pr rZ R r´T p , T p ss ď 3.2p.Proof. Follows from the following theorem.Theorem 3(Bentkus & Dzindzalieta (2015)). Let ϵ 1 , . . . , ϵ d be i.i.d. Radamacher random variables and let a P R d such that ∥a∥

For each interval I and bit budget b, the maximal MSE, i.e., maxzPI E " pz ´p zq 2 ‰ .Note that for any b P t1, 2, 3, 4u, the MSEs in I 3 are strictly larger than those in I 2 which are strictly larger than those in I 1 . This allows us to derive formal bounds on the error. For example, for b " 1, we have that the error is bounded by

Hyperparameters for the Shakespeare next-word prediction experiments.

annex

Note that while both S 1 and S 2 produce unbiased estimates, their expected squared errors differ. Further, since 0.1 R Q m , the solver's output does not directly indicate what is the optimal client's algorithm, even if the server table is fixed. Unlike Algorithm 1, which stochastically quantizes Z to either q ´or q `, we studied the solver's output ts h,q,x u h,q,x to interpolate the client to non-quantile values.The approach we take corresponds to the following process. We move probability mass from the leftmost, then uppermost entry with mass to its right neighbor in the server table. So, for example, in Table 2 , as Z increases from 0 we first move mass from the entry x " 1, h " 2 to the entry x " 2, h " 2. That is, the client, based on its private randomness, increases the probability of message x " 2 and decreases the probability of message x " 1 when h " 2. The amount of mass moved is always chosen to maintain unbiasedness. At some point, as Z increases, all of the probability mass will have moved, and then we start moving mass from x " 1, h " 3 similarly.When Z " 0, the client would send X " 2 if H ď 1 and X " 1 otherwise, leading to E " p Z ı " 0. Now consider Z " 0.1; the client can increase the expected server's estimate by changing its behaviour for the leftmost and then uppermost cell, namely X " 1, H " 2. Specifically, by followingIn general, by applying the monotonicity constraints 6 we observed a common pattern in the optimal solution found by the solver for any b and ℓ (in the range we tested). Namely, when the server table is monotone, the optimal solution deterministically selects the message to send in all but (at most) one shared randomness value. For example, S 2 x above deterministically selects the message if H ‰ 2 (sending 1 if H " 3 and 2 if H P t0, 1u) and stochastically selects between x " 1 and x " 2 when H " 2. Furthermore, the shared randomness value in which we should stochastically select the message is easy to calculate. Specifically, let xpZq P X b denote the maximal value such that sending xpZq for all H would result in not overestimating Z in expectation (that is:). We note that there must exists such a xpZq because, by design, the solver's output for the optimal solution would always satisfy that 1 2 ℓ ¨ř2 ℓ h"0 r h,0 " ´T and 1 2 ℓ ¨ř2 ℓ h"0 r h,2 b ´1 " T . (Otherwise, the solution would either be infeasible or suboptimal.) In particular, this means that we get xpZq P X b z ␣ 2 b ´1( for all Z P r´T, T q.Next, let hpZq P H ℓ denote the maximal value for which sending xpZq `1 for all H ă h and xpZq for H ě h would underestimate Z in expectation (for convenience, we consider r h,2 b " 8, for all h P H ℓ ). Formally:-For example, consider Z " 0.1 in the b " ℓ " 2 example above. We have that xp0.1q " 1 since 1 2 2 ¨řhPH2 r h,1 ď 0.1 and 1 2 2 ¨řhPH2 r h,2 ą 0.1. We also have that hpZq " 2 as 1 4 p0.164 `0.49 `p´0.49q `p´0.164qq ď 0.1 and 1 4 ¨p0.164 `0.49 `0.831 `p´0.164qq ą 0.1. Similarly, for Z " 3 we get xp3q " 2 and hp3q " 3. Finally, for H " h, it stochastically selects between xpZq and xpZq `1. However, notice that the expected value of this quantization (given that H " h) may not be exactly Z. Namely, as the algorithm needs to satisfy Er p

The interpolated algorithm works as follows: If

Knowing the desired expectation, the overall client's algorithm is then defined as: Indeed, by our choice of µ, the algorithm is guaranteed to be unbiased for all Z P r´T, T s. This gives the final algorithm, whose pseudo-code appears in Algorithm 2. As in Algorithm 1, each client c rotates its vector and scales it by This gives a result of vNMSE ď 1 p ¨p2 b ´1q 2 .Let r be the representation length of each coordinate in the input vector (e.g., r " 32 for singleprecision floats, r " 16 for half-precision, or r " 64 for double precision), we get that BSQ sends a message with less than p ¨r `b bits per coordinate. Further, this method has Opdq time for encoding and decoding and is GPU-friendly.

