Federated Learning With Quantized Global Model Updates

Abstract

We study federated learning (FL), which enables mobile devices to utilize their local datasets to collaboratively train a global model with the help of a central server, while keeping data localized. At each iteration, the server broadcasts the current global model to the devices for local training, and aggregates the local model updates from the devices to update the global model. Previous work on the communication efficiency of FL has mainly focused on the aggregation of model updates from the devices, assuming perfect broadcasting of the global model. In this paper, we instead consider broadcasting a compressed version of the global model. This is to further reduce the communication cost of FL, which can be particularly limited when the global model is to be transmitted over a wireless medium. We introduce a lossy FL (LFL) algorithm, in which both the global model and the local model updates are quantized before being transmitted. We analyze the convergence behavior of the proposed LFL algorithm assuming the availability of accurate local model updates at the server. Numerical experiments show that the proposed LFL scheme, which quantizes the global model update (with respect to the global model estimate at the devices) rather than the global model itself, significantly outperforms other existing schemes studying quantization of the global model at the PS-to-device direction. Also, the performance loss of the proposed scheme is marginal compared to the fully lossless approach, where the PS and the devices transmit their messages entirely without any quantization.

1. Introduction

Federated learning (FL) enables wireless devices to collaboratively train a global model by utilizing locally available data and computational capabilities under the coordination of a parameter server (PS) while the data never leaves the devices McMahan & Ramage (2017) . In FL with M devices the goal is to minimize a loss function F (θ) = FL mainly targets mobile applications at the network edge, and the wireless communication links connecting these devices to the network are typically limited in bandwidth and power, and suffer from various channel impairments such as fading, shadowing, or interference; hence the need to develop an FL framework with limited communication requirements becomes more vital. While communication-efficient FL has been widely studied, prior works mainly focused on the devices-to-PS links, assuming perfect broadcasting of the global model to the devices at each iteration. In this paper, we design an FL algorithm aiming to reduce the cost of both PS-to-device and devices-to-PS communications. To address the importance of quantization at the PS-to-device direction, we highlight that some devices simply may not have the sufficient bandwidth to receive the global model update when the model size is relatively large, particularly in the wireless setting, where the devices are away from the base station. This would result in consistent exclusion of these devices, resulting in significant performance loss. Moreover, the impact of quantization in the device-to-PS direction is less severe due to the impact of averaging local updates at the PS. 2019) aiming to alleviate the communication footprint from the PS to the devices. The global model parameters are relatively skewed/diverse and the efficiency of quantization diminishes significantly when the peak-to-average ratio of the parameters is large. To overcome this, in Caldas et al. (2019) the PS first employs a linear transform in order to spread the information of the global model vector more evenly among its dimensions, and broadcasts a quantized version of the resultant vector, and the devices apply the inverse linear transform to estimate the global model. We highlight that this approach requires a relatively high computational overhead due to employing the linear transform at the PS and its inverse at the devices, where this overhead grows with the size of the model parameters. Furthermore, the performance evaluation in Caldas et al. ( 2019 This further reduces the communication cost of FL, which can be particularly limited for transmission over a wireless medium while serving a massive number of devices. Also, it is interesting to investigate the impact of various hyperparameters on the performance of FL with lossy broadcasting of the global model since FL involves transmission over wireless networks with limited bandwidth. We introduce a lossy FL (LFL) algorithm, where at each iteration the PS broadcasts a compressed version of the global model update to all the devices through quantization. To be precise, the PS exploits the knowledge of the last global model estimate available at the devices as side information to quantize the global model update. The devices recover an estimate of the current global model by combining the received quantized global model update with their previous estimate, and perform local training using their estimate, and return the local model updates, again employing



m (θ) with respect to the global model θ ∈ R d , where F m (θ) = 1 Bm u∈Bm f (θ, u) is the loss function at device m, with B m representing device m's local dataset of size B m , B M m=1 B m , and f (•, •) is an empirical loss function. Having access to the global model θ, device m utilizes its local dataset and performs multiple iterations of stochastic gradient descent (SGD) in order to minimize the local loss function F m (θ). It then sends the local model update to the server, which aggregates the local updates from all the devices to update the global model.

There is a fast-growing body of literature on the communication efficiency of FL targeting restricted bandwidth devices. Several studies address this issue by considering communications with rate limitations, and propose different compression and quantization techniques Konecny et al. (2016); McMahan et al. (2017); Konecny & Richtarik (2018); Dowlin et al. (2016); Konecny et al. (2015); Lin et al. (2018b); He et al. (2018); M. M. Amiri & Gündüz (2020), as well as performing local updates to reduce the frequency of communications from the devices to the PS Lin et al. (2018a); Stich (2019). Statistical challenges arise in FL since the data samples may not be independent and identically distributed (iid) across devices. The common sources of the dependence or bias in data distribution are the participating devices being located in a particular geographic region, and/or at a particular time window P. Kairouz et al. (2019). Different approaches have been studied to mitigate the effect of non-iid data in FL McMahan et al. (2017); Hsieh et al. (2019); Li et al. (2020a); Wang et al. (2020); Eichner et al. (2019); Zhao et al. (2018). Also, FL suffers from a significant variability in the system, which is mainly due to the hardware, network connectivity, and available power associated with different devices Li et al. (2019). Active device selection schemes have been introduced to alleviate significant variability in FL systems, where a subset of devices share the resources and participate at each iteration of training Kang et al. (2019); Nishio & Yonetani (2019); Amiri et al. (2020b); Yang et al. (2020; 2019). There have also been efforts in developing convergence guarantees for FL under various scenarios, considering iid data across the devices Stich (2019); Wang & Joshi (2019); Woodworth et al. (2019); Zhou & Cong (2018); Koloskova et al. (2020), non-iid data Koloskova et al. (2020); Li et al. (2020a); Haddadpour & Mahdavi (2019); Li et al. (2020c), participation of all the devices Khaled et al. (2020); Wang et al. (2019); Yu et al. (2018); Huo et al. (2020), or only a subset of devices at each iteration Li et al. (2020b); Karimireddy et al. (2020); Rizk et al. (2020); Li et al. (2020c); Amiri et al. (2020a), and FL under limited communication constraints Amiri et al. (2020a); Recht et al. (2011); Alistarh et al. (2018). FL with compressed global model transmission has been studied recently in Caldas et al. (2019); Tang et al. (

) is limited to the experimental results On the other hand, inTang et al. (2019)  the PS broadcasts quantized global model with error accumulation to compensate the quantization error.Our contributionsWith the exception of Caldas et al. (2019); Tang et al. (2019), the literature on FL considers perfect broadcasting of the global model from the PS to the devices. With this assumption, no matter what type of local update or device-to-PS communication strategy is used, all the devices are synchronized with the same global model at each iteration. In this paper, we instead consider broadcasting a quantized version of the global model update by the PS, which provides the devices with a lossy estimate of the global model (rather than its accurate estimate) with which to perform local training.

