Federated Learning With Quantized Global Model Updates

Abstract

We study federated learning (FL), which enables mobile devices to utilize their local datasets to collaboratively train a global model with the help of a central server, while keeping data localized. At each iteration, the server broadcasts the current global model to the devices for local training, and aggregates the local model updates from the devices to update the global model. Previous work on the communication efficiency of FL has mainly focused on the aggregation of model updates from the devices, assuming perfect broadcasting of the global model. In this paper, we instead consider broadcasting a compressed version of the global model. This is to further reduce the communication cost of FL, which can be particularly limited when the global model is to be transmitted over a wireless medium. We introduce a lossy FL (LFL) algorithm, in which both the global model and the local model updates are quantized before being transmitted. We analyze the convergence behavior of the proposed LFL algorithm assuming the availability of accurate local model updates at the server. Numerical experiments show that the proposed LFL scheme, which quantizes the global model update (with respect to the global model estimate at the devices) rather than the global model itself, significantly outperforms other existing schemes studying quantization of the global model at the PS-to-device direction. Also, the performance loss of the proposed scheme is marginal compared to the fully lossless approach, where the PS and the devices transmit their messages entirely without any quantization. Under review as a conference paper at ICLR 2021 There is a fast-growing body of literature on the communication efficiency of FL targeting restricted bandwidth devices. Several studies address this issue by considering communications with rate limitations, and propose different compression and quantization techniques

1. Introduction

Federated learning (FL) enables wireless devices to collaboratively train a global model by utilizing locally available data and computational capabilities under the coordination of a parameter server (PS) while the data never leaves the devices McMahan & Ramage (2017) . In FL with M devices the goal is to minimize a loss function F (θ) = FL mainly targets mobile applications at the network edge, and the wireless communication links connecting these devices to the network are typically limited in bandwidth and power, and suffer from various channel impairments such as fading, shadowing, or interference; hence the need to develop an FL framework with limited communication requirements becomes more vital. While communication-efficient FL has been widely studied, prior works mainly focused on the devices-to-PS links, assuming perfect broadcasting of the global model to the devices at each iteration. In this paper, we design an FL algorithm aiming to reduce the cost of both PS-to-device and devices-to-PS communications. To address the importance of quantization at the PS-to-device direction, we highlight that some devices simply may not have the sufficient bandwidth to receive the global model update when the model size is relatively large, particularly in the wireless setting, where the devices are away from the base station. This would result in consistent exclusion of these devices, resulting in significant performance loss. Moreover, the impact of quantization in the device-to-PS direction is less severe due to the impact of averaging local updates at the PS. FL with compressed global model transmission has been studied recently in Caldas et al. (2019) ; Tang et al. (2019) aiming to alleviate the communication footprint from the PS to the devices. The global model parameters are relatively skewed/diverse and the efficiency of quantization diminishes significantly when the peak-to-average ratio of the parameters is large. To overcome this, in Caldas et al. (2019) the PS first employs a linear transform in order to spread the information of the global model vector more evenly among its dimensions, and broadcasts a quantized version of the resultant vector, and the devices apply the inverse linear transform to estimate the global model. We highlight that this approach requires a relatively high computational overhead due to employing the linear transform at the PS and its inverse at the devices, where this overhead grows with the size of the model parameters. Furthermore, the performance evaluation in Caldas et al. (2019) is limited to the experimental results On the other hand, in Tang et al. (2019) the PS broadcasts quantized global model with error accumulation to compensate the quantization error. Our contributions With the exception of Caldas et al. (2019) ; Tang et al. (2019) , the literature on FL considers perfect broadcasting of the global model from the PS to the devices. With this assumption, no matter what type of local update or device-to-PS communication strategy is used, all the devices are synchronized with the same global model at each iteration. In this paper, we instead consider broadcasting a quantized version of the global model update by the PS, which provides the devices with a lossy estimate of the global model (rather than its accurate estimate) with which to perform local training. This further reduces the communication cost of FL, which can be particularly limited for transmission over a wireless medium while serving a massive number of devices. Also, it is interesting to investigate the impact of various hyperparameters on the performance of FL with lossy broadcasting of the global model since FL involves transmission over wireless networks with limited bandwidth. We introduce a lossy FL (LFL) algorithm, where at each iteration the PS broadcasts a compressed version of the global model update to all the devices through quantization. To be precise, the PS exploits the knowledge of the last global model estimate available at the devices as side information to quantize the global model update. The devices recover an estimate of the current global model by combining the received quantized global model update with their previous estimate, and perform local training using their estimate, and return the local model updates, again employing quantization. The PS updates the global model after receiving the quantized local model updates from the devices. We provide convergence analysis of the LFL algorithm investigating the impact of lossy broadcasting on the performance of FL. Numerical experiments on the MNIST and CIFAR-10 datasets illustrate the efficiency of the proposed LFL algorithm. We observe that the proposed LFL scheme, which leads to a significant communication cost saving, provides a promising performance with no visible gap to the performance of the fully lossless scenario where the communication from both PS-to-device and device-to-PS directions is assumed to be perfect. Also, it is illustrated that the proposed LFL scheme significantly outperforms the schemes introduced in Caldas et al. (2019) and Tang et al. (2019) considering compression from the PS to devices. The proposed LFL algorithm differs from the approaches in Caldas et al. (2019) ; Tang et al. (2019) , since we propose broadcasting the global model update, with respect to the previous estimate at the devices, rather than the global model itself. We remark that the global model update has less variability/variance and peak-to-average ratio than the global model (see Figure 2 ), and hence, for the same communication load, the devices can have a more accurate estimate of the global model. However, this would require all the devices to track the global model at each iteration, even if they do not participate in the learning process by sending their local update. We argue that broadcasting the global model update to the whole set of devices, rather than a randomly chosen subset, would introduce limited additional communication cost as broadcasting is typically more efficient than sending independent information to devices. Moreover, in practice, the subset of participating devices remain the same for a number of iterations, until a device leaves or joins. Our algorithm can easily be adopted to such scenarios by sending the global model, rather than the model update, every time the subset of devices changes. Also, compared to the approach in Caldas et al. (2019) , the LFL algorithm requires a significantly smaller computational overhead. Furthermore, unlike Caldas et al. (2019) , we provide an in-depth convergence analysis of the proposed LFL algorithm. The advantage of the proposed LFL algorithm over the approaches introduced in Caldas et al. (2019) ; Tang et al. (2019) is shown numerically, where, despite its significantly smaller communication load, it provides considerably higher accuracy.

Notation

The set of real numbers is denoted by R. For x ∈ R, |x| returns the absolute value of x. For a vector of real numbers x, the largest and the smallest absolute values among all the entries of x are represented by max {|x|} and min {|x|}, respectively. For an integer i, we let [i] {1, 2, . . . , i}. The l 2 -norm of vector x is denoted by x 2 .

2. Lossy Federated Learning (LFL) Algorithm

We consider a lossy PS-to-device transmission, in which the PS sends a compressed version of the global model to the devices. This reduces the communication cost, and can be particularly beneficial when the PS resources are limited, and/or communication takes place over a constrained bandwidth medium. We denote the estimate of the global model θ(t) at the devices by θ(t), where t represents the global iteration count. Having recovered θ(t), the devices perform a τ -step SGD with respect to their local datasets, and transmit their local model updates to the PS using quantization while accumulating the quantization error.

2.1. Global Model Broadcasting

In the proposed LFL algorithm, the PS performs stochastic quantization similarly to the QSGD algorithm introduced in Alistarh et al. (2017) with a slight modification to broadcast the information about the global model to the devices. In particular, at global iteration t, the PS aims to broadcast the global model update θ(t) -θ(t -1) to the devices. We present the stochastic quantization technique we use, denoted by Q(•, •), in Appendix A. Lemma 1. For the quantization function ϕ (x, q) and vector Q(x, q) given in (21b) and ( 22), respectively, we have PS broadcasts Q θ(t) -θ(t -1), q 1 3: E ϕ [ϕ(x, q)] = x, E ϕ ϕ 2 (x, q) ≤ x 2 + 1/(4q 2 ), ( ) E ϕ [Q(x, q)] = x, E ϕ Q(x, q) 2 2 ≤ x 2 2 + εd x 2 2 /(4q 2 ), θ(t) = θ(t -1) + Q θ(t) -θ(t -1), q 1 • Local update aggregation 4: for m = 1, . . . , M in parallel do 5: Device m transmits Q ∆θ m (t) + δ m (t), q 2 = Q θ τ +1 m (t) -θ(t) + δ m (t), q 2 6: end for 7: θ(t + 1) = θ(t) + M m=1 Bm B Q ∆θ m (t) + δ m (t) , q 2 8: end for where E ϕ represents expectation with respect to the quatization function ϕ (•, •), and 0 ≤ ε ≤ 1 is defined as ε (max {|x|} -min {|x|}) 2 / x 2 2 . The proof of Lemma 1 is provided in Appendix B. We highlight that the value of ε depends on the skewness of the magnitudes of the entries of x, where it increases for a more skewed entries with a higher variance. We have ε = 0, if and only if all the entries of x have the same magnitude, and ε = 1, if and only if x has only one non-zero entry. Given a quantization level q 1 , the PS broadcasts Q θ(t) -θ(t -1), q 1 to the devices at global iteration t. Then the devices obtain the following estimate of θ(t): θ(t) = θ(t -1) + Q θ(t) -θ(t -1), q 1 , which is equivalent to θ(t) = θ(0) + t i=1 Q θ(i) -θ(i -1), q 1 , where we assumed that θ(0) = θ(0). We note that, having the knowledge of the compressed vector Q θ(i) -θ(i -1), q 1 , ∀i ∈ [t], the PS can also track θ(t) at each iteration.

2.2. Local Update Aggregation

After recovering θ(t), device m performs a τ -step local SGD, where the i-th step corresponds to θ i+1 m (t) = θ i m (t) -η i m (t)∇F m θ i m (t), ξ i m (t) , i ∈ [τ ], where θ 1 m (t) = θ(t), and ξ i m (t) denotes the local mini-batch chosen uniformly at random from the local dataset B m . It then aims to transmit local model update ∆θ m (t) = θ τ +1 m (t) -θ(t) through quantization with error compensation and transmits Q ∆θ m (t) + δ m (t), q 2 using a quantization level q 2 , where δ m (t) retains the quantization error, and is updated as δ m (t + 1) = ∆θ m (t) + δ m (t) -Q ∆θ m (t) + δ m (t), q 2 , ( ) where we set δ m (0 ) = 0. Having received Q ∆θ m (t) + δ m (t), q 2 from device m, ∀m ∈ [M ], the PS updates the global model as θ(t + 1) = θ(t) + M m=1 B m B Q ∆θ m (t) + δ m (t), q 2 . ( ) Algorithm 1 summarizes the proposed LFL algorithm. Remark 1. We do not consider error compensation at the PS with LFL since we have observed performance degradation numerically when compensating the quantization error at the PS. We argue that LFL naturally accumulates the quantization error at the PS since it sends the quatized global model update with respect to the last global model estimate at the devices. We further highlight that the proposed approach is not limited to any specific quantization technique, and any compression technique can be used within the proposed framework.

3. Convergence Analysis of LFL Algorithm

Here we analyze the convergence behaviour of LFL, where for simplicity of the analysis, we assume that the devices can transmit their local updates, ∆θ m (t), ∀m, accurately/in a lossless fashion to the PS, and focus on the impact of lossy broadcasting on the convergence.

3.1. Preliminaries

We denote the optimal solution minimizing loss function F (θ) by θ * , and the minimum loss as F * , i.e., θ * arg min θ F (θ), and F * F (θ * ). We also denote the minimum value of the local loss function at device m by F * m . We further define Γ F * -

M m=1

Bm B F * m , where Γ ≥ 0, and its magnitude indicates the bias in the data distribution across devices. For ease of analysis, we set η i m (t) = η(t). Thus, the i-th step SGD at device m is given by θ i+1 m (t) = θ i m (t) -η(t)∇F m θ i m (t), ξ i m (t) , i ∈ [τ ], m ∈ [M ], where θ 1 m (t) = θ(t), given in (2). Device m transmits the local model update ∆θ m (t) = θ τ +1 m (t) -θ(t) = -η(t) τ i=1 ∇F m θ i m (t), ξ i m (t) , m ∈ [M ], and the PS updates the global model as θ(t + 1) = θ(t) -η(t) M m=1 τ i=1 B m B ∇F m θ i m (t), ξ i m (t) . ( ) Assumption 1. The expected squared l 2 -norm of the stochastic gradients are bounded, i.e., E ξ ∇F m θ i m (t), ξ i m (t) 2 2 ≤ G 2 , ∀i ∈ [τ ], ∀m ∈ [M ], ∀t. (8) Assumption 2. The loss functions F 1 , . . . , F M are L-smooth; that is, ∀v, w ∈ R d , 2 F m (v) -F m (w) ≤ 2 v -w, ∇F m (w) + L v -w 2 2 , ∀m ∈ [M ]. ( )

3.2. Strongly Convex Loss Function

Here we provide convergence analysis assuming that the loss functions F 1 , . . . , F M are µ- strongly convex; that is, ∀v, w ∈ R d , 2 F m (v) -F m (w) ≥ 2 v -w, ∇F m (w) + µ v -w 2 2 , ∀m ∈ [M ]. In the following theorem, whose proof is provided in Appendix C, we present the convergence rate of the LFL algorithm assuming that the devices can send their local updates accurately. Theorem 1. Let 0 < η(t) ≤ min 1, 1 µτ , ∀t. We have E θ(t) -θ * 2 2 ≤ t-1 i=0 A(i) θ(0) -θ * 2 2 + t-1 j=0 B(j) t-1 i=j+1 A(i), ( ) where A(i) 1 -µη(i) (τ -η(i)(τ -1)) , ( ) B(i) (1 -µη(i) (τ -η(i)(τ -1))) η(i -1)τ G 2q 1 2 εd + η 2 (i)(τ 2 + τ -1)G 2 + (1 + µ(1 -η(i))) η 2 (i)G 2 τ (τ -1)(2τ -1) 6 + 2η(i)(τ -1)Γ, ( ) for some 0 ≤ ε ≤ 1, and the expectation is with respect to the stochastic gradient function and stochastic quantization. Corollary 1. From the L-smoothness of the loss function, for 0 < η(t) ≤ min 1, 1 µτ , ∀t, and a total of T global iterations, it follows that E [F (θ(T ))] -F * ≤ L 2 E θ(T ) -θ * 2 2 ≤ L 2 T -1 i=0 A(i) θ(0) -θ * 2 2 + L 2 T -1 j=0 B(j) T -1 i=j+1 A(i), ( ) where the last inequality follows from (11a). Considering η(t) = η and τ = 1, we have E [F (θ(T ))] -F * ≤ L 2 (1 -µη) T θ(0) -θ * 2 2 + L 2 (1 -µη) εd 4q 2 1 + 1 1 -(1 -µη) T ηG 2 µ . ( ) Asymptotic convergence analysis Here we show that, for a decreasing learning rate over time, such that lim t→∞ η(t) = 0, and given small enough ε, lim T →∞ E [F (θ(T ))]-F * = 0. For 0 < η(t) ≤ min{1, 1 µτ }, we have 0 ≤ A(t) < 1,

and lim T →∞

T -1 i=0 A(i) = 0. For simplicity, assume η(t) = α t+β , for constant values α and β. For j 0, B(j) → 0, and for limited j values, T -1 i=j+1 A(i) → 0, and so, according to ( 12), lim T →∞ E [F (θ(T ))] -F * = 0.

3.3. Non-Convex Loss Function

Next, we provide convergence guarantees of the proposed LFL scheme for L-smooth and nonconvex loss functions F 1 , . . . , F M . For the non-convex case, we provide a weaker notion of convergence Liu & Wright (2015) lim T →∞ E ∇F (θ(T )) 2 2 → 0. In the following theorem, we bound 1 T -1 t=0 η(t) T -1 t=0 η(t)E ∇F (θ(t)) 2 2 with the proof provided in Appendix F.

Theorem 2. Performing the LFL algorithm for T ≥ 1 global iterations assuming that the PS receives the local model updates accurately leads to

1 T -1 t=0 η(t) T -1 t=0 η(t)E ∇F (θ(t)) 2 2 ≤ 2 F (θ(0)) -F * τ T -1 t=0 η(t) + 2Γ τ T -1 t=0 η(t) + 1 T -1 t=0 η(t) T -1 t=0 η(t -1)G 2q 1 2 η(t)(2τ -1)L + 2 εdτ L + 2G 2 τ L T -1 t=0 η 2 (t) T -1 t=0 η(t) + L 2 G 2 (τ -1)(2τ -1) T -1 t=0 η 3 (t) 3 T -1 t=0 η(t) . (14) Choice of ε We highlight that ε appears in the convergence analysis of the LFL algorithm in inequalities ( 45), ( 63), in which we have E max M m=1 τ i=1 B m B ∇F m θ i m (t -1), ξ i m (t -1) -min M m=1 τ i=1 B m B ∇F m θ i m (t -1), ξ i m (t -1) 2 ≤ εE M m=1 τ i=1 B m B ∇F m θ i m (t -1), ξ i m (t -1) 2 2 , ( ) which follows from (26b), where we note that θ(t) -θ(t -1) = -η(t -1) M m=1 τ i=1 B m B ∇F m θ i m (t -1), ξ i m (t -1) . ( ) On average the entries of θ(t) -θ(t -1), given in ( 16), are not expected to have very diverse magnitudes. Thus, the inequality in (15) should hold for a relatively small value of ε. We have observed numerically that ε ≈ 10 -3 satisfies inequality (15) for the LFL algorithm.

Impact of number of local SGD steps τ

For the non-convex case, assuming η(t) = η, ∀t, it is easy to verify that the upper bound on 1 T T -1 t=0 E ∇F (θ(t)) 2 2 , given in Theorem 2, is simplified as follows: h(τ ) = a -1 τ + a 0 + a 1 τ + a 2 τ 2 , ( ) where a -1 2 (F (0) -F * + Γ) ηT , a 0 η 2 L 2 G 2 3 , a 1 (2 -ηL)ηLG 2 1 + εd 4q 2 1 , a 2 η 2 L 2 G 2 2 3 + εd 2q 2 1 . ( ) We have dh(τ )/dτ = -a -1 /τ 2 + a 1 + 2a 2 τ , where we note that a 1 + 2a 2 ≥ 0. For relatively small a -1 values, particularly a -1 ≤ a 1 + 2a 2 , h(τ ) increases with τ , and τ = 1 minimizes softmax output layer with 10 units h(τ ); that is, when the training is started close to the optimal solution (F (0)-F * is relatively small), and/or η is relatively large, τ = 1 may be the best choice. On the other hand, for relatively large a -1 values, the best τ can be the nearest integer to the positive solution of (a 1 + 2a 2 τ )τ 2 -a -1 = 0.

4. Numerical Experiments

Here we investigate the performance of the proposed LFL algorithm for image classification on both MNIST LeCun et al. (1998) and CIFAR-10 Krizhevsky & Hinton (2009) datasets utilizing ADAM optimizer Kingma & Ba (2017) . We consider M = 40 devices, and we measure the performance as the accuracy with respect to the test samples, called test accuracy. Network architecture We train different convolutional neural networks (CNNs) with MNIST and CIFAR-10 datasets. The architectures of these CNNs are described in Table 1 .

Data distribution

We consider two data distribution scenarios. In the non-iid scenario, we split the training data samples with the same label (from the same class) to M/10 disjoint subsets (assume that M is divisible by 10). We then assign each subset of data samples, selected at random, to a different device. In the iid scenario, we randomly split the training data samples to M disjoint subsets, and assign each subset to a distinct device. We consider non-iid and iid data distributions while training using MNIST and CIFAR-10, respectively. scheme presented in Appendix A with the LGM scheme, and assume the same technique for transmission in the device-to-PS direction introduced in Section 2.2.

Benchmark approaches

We consider the performance of the lossless broadcasting (LB) scenario, where the devices receive the current global model accurately, and perform the quantization with error compensation approach as described in Section 2.2. We highlight that this approach requires transmission of R LB = 33d bits from the PS, where we assume that each entry of the global model is represented by 33 bits. Thus, the saving ratio in the communication bits of broadcasting from the PS using LFL versus LB is R LB R Q = 33d 64 + d (1 + log 2 (q 1 + 1)) (a) ≈ 33 1 + log 2 (q 1 + 1) , ( ) where (a) follows assuming that d 1. We further consider the performance of the fully lossless approach, where in addition to having the accurate global model at the devices, we assume that the PS receives the local model updates from the devices accurately. In Figure 1 we illustrate the performance of different approaches for non-iid and iid scenarios using MNIST and CIFAR-10, respectively, for training with M = 40 devices. Figure 1a demonstrates test accuracy of different approaches for non-iid data using MNIST with local mini-batch size ξ i m (t) = 500 and number of local iterations τ = 4. We set q 2 = 2 for all the approaches where the devices perform quantization, and q 1 = 2 for the LFL and LGM schemes. We observe that the proposed LFL algorithm with (q 1 , q 2 ) = (2, 2) performs as well as the fully lossless and LB approaches, despite a factor of 12.77 savings in the number of bits that need to be broadcast compared to the LB approach. This illustrates the efficiency of the LFL algorithm for the non-iid scenario providing significant communication cost savings without any visible performance degradation. On the other hand, the performance of the LGM algorithm drops after an intermediate number of training iterations, which shows that the quantization level q 1 = 2 does not provide the devices with an accurate estimate of the global model to rely on for local training. This is particularly more harmful in later iterations as the algorithm approaches the optimal point where a more accurate estimate of the global model is required for training. We highlight that the proposed LFL algorithm resolves this deficiency with the LGM algorithm through quantizating the global model update rather than the global model providing a more accurate estimate of the global model to the devices even with a relatively small quantization level q 1 = 2. Throughout our experiments, we found that the random linear transform with the LTGM scheme is not highly efficient in providing a transformed vector with a relatively small peak-to-average ratio, and the quantization level q 1 should be relatively large to guarantee that the algorithm succeeds in learning. Therefore, we set q 1 = 50 for the LTGM scheme, which is a relatively large quantization value. The advantage of the proposed LFL algorithm over the LTGM and LGM algorithms for the non-iid scenario can be clearly seen in the figure . A similar observation is made in Figure 1b illustrating the perforance of different approaches for iid data using CIFAR-10 with local mini-batch size ξ i m (t) = 250 and number of local iterations τ = 5. The the LFL algorithm with (q 1 , q 2 ) = (5, 3) provides ×9.2 smaller communication load compared to LB with q 2 = 3 without any visible performance degradation with respect to the fully lossless and LB approaches. It also significantly outperforms the LGM algorithm with (q 1 , q 2 ) = (5, 3), which shows the advantage of quantizing the global model update rather than the global model for iid data. We also observe that the accuracy level of the LTGM algorithm drops significantly after around 200 global iterations even for a large quantization level q 1 = 1000, which shows the deficiency of the linear transform to provide a relatively small peak-to-average ratio for the transformed vector. In Figure 2 , we investigate the empirical variance and the peak-to-average ratio of the vector (considering absolute values of its entries) to be quantized at the PS with different schemes for the experimental settings used in Figure 1 . This result is provided to better justify the benefits of the proposed LFL scheme over LGM and LTGM shown in Figure 1 . We observe that the global model update, which is quantized at the PS with LFL, has significantly smaller empirical variance than the global model, which is quantized at the PS with LGM. This justifies the improvement of LFL over LGM reflecting smaller quantization error when quantizing the global model update rather than the global model, particularly towards the end of training, where the empirical variance of the global model with LGM has an increasing trend over time. Also, both the empirical variance and the peak-to-average ratio of the transformed vector with LTGM increases over time, particularly for training on CIFAR-10. This illustrates that the quantization error increases with time, which may be more harmful towards the end of training while approaching the optimal solution. We note that the relatively small empirical variance of the transformed vector with LTGM is due to the linear transform applied at the PS which scales down the entries of the global model vector. The relatively large peak-to-average ratio indicates that the quantized vector with LTGM may not provide an accurate estimate of the actual transformed vector at the PS. 

A Stochastic quantization

Given x ∈ R d , with the i-th entry denoted by x i , we define x max max {|x|} , (20a) x min min {|x|} . ( ) Given a quantization level q ≥ 1, we have Q (x i , q) sign (x i ) • x min + (x max -x min ) • ϕ |x i | -x min x max -x min , q , for i ∈ [d], where ϕ(•, •) is a quantization function defined in the following. For 0 ≤ x ≤ 1 and q ≥ 1, let l ∈ {0, 1, . . . , q -1} be an integer such that x ∈ [l/q, (l + 1)/q). We then define ϕ (x, q) l/q, with probability 1 -(xq -l), (l + 1)/q, with probability xq -l. (21b) We define Q(x, q) [Q(x 1 , q), • • • , Q(x d , q)] T , ( ) and we highlight that it is represented by R Q = 64 + d (1 + log 2 (q + 1)) bits, where 64 bits are used to represent x max and x min , d bits are used for sign (x i ), ∀i ∈ [d], and d log 2 (q + 1) bits represent ϕ ((|x i | -x min )/(x max -x min ), q), ∀i ∈ [d]. We note that we have modified the QSGD scheme proposed in Alistarh et al. (2017) by normalizing the entries of vector x with x max -x min rather than x 2 .

B Proof of Lemma 1

Given ϕ (x, q) in (21b), we have E ϕ [ϕ(x, q)] = l q (1 + l -xq) + l + 1 q (xq -l) = x. ( ) Also, we have E ϕ ϕ 2 (x, q) = l q 2 (1 + l -xq) + l + 1 q 2 (xq -l) = 1 q 2 -l 2 + 2lxq + xq -l = x 2 + 1 q 2 (xq -l) (1 -xq + l) (a) ≤ x 2 + 1 4q 2 , ( ) where (a) follows since (xq -l) (1 -xq + l) ≤ 1/4. According to ( 24), ( 25) and the definition of Q(x, q) given in ( 22), it follows that E ϕ [Q(x, q)] = x, E ϕ Q(x, q) 2 2 = d i=1 E ϕ |Q(x i , q)| 2 2 = (x max -x min ) 2 d i=1 E ϕ ϕ 2 |x i | -x min x max -x min , q + dx 2 min + 2x min (x max -x min ) d i=1 E ϕ ϕ |x i | -x min x max -x min , q (b) ≤ (x max -x min ) 2 d i=1 |x i | -x min x max -x min 2 + 1 4q 2 + dx 2 min + 2x min d i=1 (|x i | -x min ) = x 2 2 + d (x max -x min ) 2 4q 2 (c) ≤ x 2 2 + εd x 2 2 4q 2 , ( ) where (b) follows from ( 24) and ( 25), and (c) follows since ε = (x max -x min ) 2 / x 2 2 .

C Proof of Theorem 1

We have E θ(t + 1) -θ * 2 2 =E θ(t) -θ * 2 2 + E M m=1 B m B ∆θ m (t) 2 2 + 2E θ(t) -θ * , M m=1 B m B ∆θ m (t) . ( ) In the following, we bound the last two terms on the right hand side (RHS) of ( 27). From the convexity of • 2 2 , it follows that E M m=1 B m B ∆θ m (t) 2 2 ≤ M m=1 B m B E ∆θ m (t) 2 2 = η 2 (t) M m=1 B m B E τ i=1 ∇F m θ i m (t), ξ i m (t) 2 2 ≤ η 2 (t)τ M m=1 τ i=1 B m B E ∇F m θ i m (t), ξ i m (t) 2 2 (a) ≤ η 2 (t)τ 2 G 2 , ( ) where (a) follows from Assumption 1. We rewrite the third term on the RHS of ( 27) as follows: 2E θ(t) -θ * , M m=1 B m B ∆θ m (t) = 2η(t) M m=1 B m B E θ * -θ(t), τ i=1 ∇F m θ i m (t), ξ i m (t) = 2η(t) M m=1 B m B E θ * -θ(t), ∇F m θ(t), ξ 1 m (t) + 2η(t) M m=1 B m B E θ * -θ(t), τ i=2 ∇F m θ i m (t), ξ i m (t) . ( ) We have 2η(t) M m=1 B m B E θ * -θ(t), ∇F m θ(t), ξ 1 m (t) (a) = 2η(t) M m=1 B m B E θ * -θ(t), ∇F m θ(t) (b) ≤ 2η(t) M m=1 B m B E F m (θ * ) -F m θ(t) - µ 2 θ(t) -θ * 2 2 = 2η(t) F * -E F θ(t) - µ 2 E θ(t) -θ * 2 2 (c) ≤ -µη(t)E θ(t) -θ * 2 2 , ( ) where (a) follows since E ξ ∇F m θ i m (t), ξ i m (t) = ∇F m θ i m (t) , ∀i, m, (b) is the result of assuming µ-strongly loss functions, and (c) follows since F * ≤ F θ(t) , ∀t. Lemma 2. For 0 < η(t) ≤ 1, we have 2η(t) M m=1 B m B E θ * -θ(t), τ i=2 ∇F m θ i m (t), ξ i m (t) ≤ -µη(t)(1 -η(t))(τ -1)E θ(t) -θ * 2 2 + η 2 (t) (τ -1) G 2 + (1 + µ(1 -η(t)))η 2 (t)G 2 τ (τ -1)(2τ -1) 6 + 2η(t)(τ -1)Γ. ( ) Proof. See Appendix D. By substituting ( 30) and ( 31) in ( 29), it follows that 2E θ(t) -θ * , M m=1 B m B ∆θ m (t) ≤ -µη(t)(τ -η(t)(τ -1))E θ(t) -θ * 2 2 + η 2 (t) (τ -1) G 2 + (1 + µ(1 -η(t)))η 2 (t)G 2 τ (τ -1)(2τ -1) 6 + 2η(t)(τ -1)Γ, which, together with the inequality in (28), leads to the following upper bound on E θ(t + 1) -θ * 2 2 , when substituted into ( 27): E θ(t + 1) -θ * 2 2 ≤ (1 -µη(t)(τ -η(t)(τ -1))) E θ(t) -θ * 2 2 + η 2 (t) τ 2 + τ -1 G 2 + (1 + µ(1 -η(t)))η 2 (t)G 2 τ (τ -1)(2τ -1) 6 + 2η(t)(τ -1)Γ. ( ) Lemma 3. For θ(t) given in ( 2), we have E θ(t) -θ * 2 2 ≤ E θ(t) -θ * 2 2 + η(t -1)τ G 2q 1 2 εd. ( ) for some 0 ≤ ε ≤ 1. Proof. See Appendix E. According to Lemma 3, the inequality in (34) can be rewritten as follows: E θ(t + 1) -θ * 2 2 ≤ (1 -µη(t)(τ -η(t)(τ -1))) E θ(t) -θ * 2 2 + (1 -µη(t) (τ -η(t)(τ -1))) η(t -1)τ G 2q 1 (t) 2 εd + η 2 (t) τ 2 + τ -1 G 2 + (1 + µ(1 -η(t)))η 2 (t)G 2 τ (τ -1)(2τ -1) 6 + 2η(t)(τ -1)Γ. ( ) Theorem 1 follows from the inequality in (35) having 0 < η(t) ≤ min 1, 1 µτ , ∀t.

D Proof of Lemma 2

We have 2η(t) M m=1 B m B τ i=2 E θ * -θ(t), ∇F m θ i m (t), ξ i m (t) = 2η(t) M m=1 B m B τ i=2 E θ i m (t) -θ(t), ∇F m θ i m (t), ξ i m (t) + 2η(t) M m=1 B m B τ i=2 E θ * -θ i m (t), ∇F m θ i m (t), ξ i m (t) . ( ) We first bound the first term on the RHS of (36). We have 2η(t) M m=1 B m B τ i=2 E θ i m (t) -θ(t), ∇F m θ i m (t), ξ i m (t) ≤ η(t) M m=1 B m B τ i=2 E 1 η(t) θ i m (t) -θ(t) 2 2 + η(t) ∇F m θ i m (t), ξ i m (t) 2 2 (a) ≤ M m=1 B m B τ i=2 E θ i m (t) -θ(t) 2 2 + η 2 (t) (τ -1) G 2 , ( ) where (a) follows from Cauchy-Schwarz inequality. Plugging (41) into (40) yields 2η(t) M M m=1 τ i=2 E θ * -θ i m (t), ∇F m θ i m (t), ξ i m (t) ≤ -µη(t)(1 -η(t))(τ -1) θ(t) -θ * 2 2 + µ(1 -η(t))η 2 (t)G 2 τ (τ -1)(2τ -1) 6 + 2η(t)(τ -1)Γ, where we used the inequality in (38) and η(t) ≤ 1. Plugging ( 39) and ( 42) into (36) completes the proof of Lemma 2.

E Proof of Lemma 3

We have E θ(t) -θ * 2 2 = E θ(t) 2 2 + E θ * 2 2 -2E θ(t), θ * (a) = E θ(t) 2 2 + E θ * 2 2 -2E [ θ(t), θ * ] , where (a) follows since E θ(t) = E θ(t -1) + E Q θ(t) -θ(t -1), q 1 = E [θ(t)] , ( ) where the last equality follows from (26a). In the following, we upper bound E θ(t) 2 2 . We have E θ(t) 2 2 =E θ(t -1) 2 2 + E Q θ(t) -θ(t -1), q 1 2 2 + 2E θ(t -1), Q θ(t) -θ(t -1), q 1 (a) ≤E θ(t -1) 2 2 + E θ(t) -θ(t -1) 2 2 + ε(t)d 4q 2 1 E θ(t) -θ(t -1) 2 2 + 2E θ(t -1), θ(t) -θ(t -1) (b) ≤E θ(t) 2 2 + εd 4q 2 1 (t) E θ(t) -θ(t -1) 2 2 , where (a) follows from (26) for some 0 ≤ ε(t) ≤ 1 defined as ε(t) E max θ(t) -θ(t -1) -min θ(t) -θ(t -1) 2 E θ(t) -θ(t -1) 2 , ( ) noting that θ(t) -θ(t -1) = -η(t -1) M m=1 τ i=1 B m B ∇F m θ i m (t -1), ξ i m (t -1) , and in (b) we define ε max t {ε(t)}. According to (47), from the convexity of • 2 2 , it follows that E θ(t) -θ(t -1) 2 2 ≤ η 2 (t -1) M m=1 τ i=1 B m B E ∇F m θ i m (t -1), ξ i m (t -1) 2 2 (a) ≤ η 2 (t -1)τ 2 G 2 , ( ) where (a) follows from Assumption 1. Accordingly, (45) reduces to E θ(t) 2 2 ≤ E θ(t) 2 2 + η(t -1)τ G 2q 1 2 εd. ( ) Substituting the above inequality into (43) yields E θ(t) -θ * 2 2 ≤ E θ(t) 2 2 + E θ * 2 2 -2E [ θ(t), θ * ] + η(t -1)τ G 2q 1 2 εd = E θ(t) -θ * 2 2 + η(t -1)τ G 2q 1 2 εd. ( ) F Proof of Theorem 2 According to the L-smoothness of the loss functions F 1 , . . . , F m , we have F (θ(t + 1)) -F (θ(t)) ≤ θ(t + 1) -θ(t), ∇F (θ(t)) + L 2 θ(t + 1) -θ(t) 2 2 . ( ) In the following we bound the average of the two terms on the RHS of the above inequality. Lemma 4. We have E θ(t + 1) -θ(t), ∇F (θ(t)) ≤ η(t -1)τ GL 2q 1 2 εdη(t)(2τ -1) 2 + η 3 (t)L 2 G 2 τ (τ -1)(2τ -1) 6 - η(t)τ 2 E ∇F (θ(t)) 2 2 . ( ) Proof. See Appendix G. Lemma 5. We have E θ(t + 1) -θ(t) 2 2 ≤ 2η 2 (t)τ 2 G 2 + η(t -1)τ G 2q 1 2 2εd. ( ) Proof. See Appendix I. Substituting the results in Lemmas 4 and 5 into (51) yields η(t)E ∇F (θ(t)) 2 2 ≤ 2 τ E F (θ(t)) -E F (θ(t + 1)) + η(t -1)G 2q 1 2 η(t)(2τ -1)L + 2 εdτ L + 2η 2 (t)G 2 τ L + η 3 (t)L 2 G 2 (τ -1)(2τ -1) 3 . ( ) For any T , by summing the above inequality over t we have T -1 t=0 η(t)E ∇F (θ(t)) 2 2 ≤ 2 τ F (θ(0)) -E F (θ(T )) + T -1 t=0 η(t -1)G 2q 1 2 η(t)(2τ -1)L + 2 εdτ L + 2G 2 τ L T -1 t=0 η 2 (t) + L 2 G 2 (τ -1)(2τ -1) 3 T -1 t=0 η 3 (t). ( ) We bound the first term on the RHS of the above inequality as follows: F (θ(0)) -E F (θ(T )) ≤ F (θ(0)) - M m=1 B m B F * m = F (θ(0)) -F * + F * - M m=1 B m B F * m = F (θ(0)) -F * + Γ. ( ) Substituting the above results in (55) and dividing both sides of the inequality in (55) by T -1 t=0 η(t) complete the proof of Theorem 2.

G Proof of Lemma 4

We have E θ(t + 1) -θ(t), ∇F (θ(t)) = E θ(t + 1) -θ(t) -θ(t) + θ(t), ∇F (θ(t)) = E θ(t + 1) -θ(t), ∇F (θ(t)) -E θ(t) -θ(t), ∇F (θ(t)) = E θ(t + 1) -θ(t), ∇F (θ(t)) -E θ(t) -θ(t -1) -Q θ(t) -θ(t -1), q 1 , ∇F (θ(t)) where (a) follows since E ϕ Q x, q 1 = x and the fact that θ(t) -θ(t -1) is independent of the stochastic quantization Q θ(t) -θ(t -1), q 1 , and E ξ ∇F m θ i m (t), ξ i m (t) = ∇F m θ i m (t) , ∀i, m, results (b). We bound the first term on the RHS of (57) as follows: - H Proof of Lemma 6 We have E θ(t) -θ(t) 2 2 = E θ(t) -θ(t -1) -Q θ(t) -θ(t -1), q 1 2 2 = E θ(t) -θ(t -1) 2 2 + E Q θ(t) -θ(t -1), q 1 2 2 -2E θ(t) -θ(t -1), Q θ(t) -θ(t -1), q 1 (a) = -E θ(t) -θ(t -1) 2 2 + E Q θ(t) -θ(t -1), q 1 2 2 (b) ≤ εd 4q 2 1 E θ(t) -θ(t -1) 2 2 = εd 4q 2 1 E η(t -1) M m=1 τ i=1 B m B ∇F m θ i m (t -1), ξ i m (t -1) 2 2 (c) ≤ η(t -1)τ G 2q 1 2 εd, where (a) follows since θ(t) -θ(t -1) is independent of the stochastic quantization Q θ(t)θ(t -1), q 1 and E ϕ [Q(x, q 1 )] = x, the second inequality in 

I Proof of Lemma 5

We have E θ(t + 1) -θ(t) 



,(61)



m (θ) with respect to the global model θ ∈ R d , where F m (θ) = 1 Bm u∈Bm f (θ, u) is the loss function at device m, with B m representing device m's local dataset of size B m , B M m=1 B m , and f (•, •) is an empirical loss function. Having access to the global model θ, device m utilizes its local dataset and performs multiple iterations of stochastic gradient descent (SGD) in order to minimize the local loss function F m (θ). It then sends the local model update to the server, which aggregates the local updates from all the devices to update the global model.

Figure 1: Test accuracy using MNIST and CIFAR-10 for training with local mini-batch size ξ i m (t) = 500 and ξ i m (t) = 250, respectively.

Figure2: Empirical variance and peak-to-average ratio of the vector quantized at the PS.

is demanding in terms of bandwidth, particularly when deep networks with huge numbers of parameters are trained across a large number of devices. Communication is typically the major bottleneck, since it involves iterative transmission over a bandwidth-limited wireless medium between the PS and a massive number of devices at the edge. With the goal of reducing the communication cost, we have studied FL with lossy broadcasting, where, in contrast to most of the existing work in the literature, the PS broadcasts a compressed version of the global model to the devices. We have considered broadcasting quantized global model updates from the PS, which can be used to estimate the current global model at the devices for local SGD iterations. The PS aggregates the quantized local model updates from the devices, according to which it updates the global model. We have derived convergence guarantees for the proposed LFL algorithm to analyze the impact of lossy broadcasting on the FL performance assuming accurate local model updates at the PS. Numerical experiments have shown the efficiency of the proposed LFL algorithm in providing an accurate estimate of the global model to the devices, where it performs as well as the fully lossless and LB approaches for both non-iid and iid data despite the significant reduction in the communication load. It also significantly outperforms theLTGM Caldas et al. (2019)   andLGM Tang et al. (2019)  algorithms studying compression in the PS-to-device direction thanks to quantizing the global model update rather than the global model at the PS.

E θ(t + 1) -θ(t), ∇F (θ(t)) i m (t) , ∇F (θ(t)) = -η(t)E ∇F θ(t) , ∇F (θ(t)) -η(t) i m (t) , ∇F (θ(t)) , (57)

(1b) leads to (b), and (c) is the result of the convexity of • 2 2 and Assumption 1.

CNN architecture for image classification on MNIST and CIFAR-10.

