AQUILA: COMMUNICATION EFFICIENT FEDERATED LEARNING WITH ADAPTIVE QUANTIZATION OF LAZILY-AGGREGATED GRADIENTS

Abstract

The development and deployment of federated learning (FL) have been bottlenecked by heavy communication overheads of high-dimensional models between the distributed device nodes and the central server. To achieve better errorcommunication trade-offs, recent efforts have been made to either adaptively reduce the communication frequency by skipping unimportant updates, e.g., lazy aggregation, or adjust the quantization bits for each communication. In this paper, we propose a unifying communication efficient framework for FL based on adaptive quantization of lazily-aggregated gradients (AQUILA), which adaptively balances two mutually-dependent factors, the communication frequency, and the quantization level. Specifically, we start with a careful investigation of the classical lazy aggregation scheme and formulate AQUILA as an optimization problem where the optimal quantization level is selected by minimizing the model deviation caused by update skipping. Furthermore, we devise a new lazy aggregation strategy to better fit the novel quantization criterion and retain the communication frequency at an appropriate level. The effectiveness and convergence of the proposed AQUILA framework are theoretically verified. The experimental results demonstrate that AQUILA can reduce around 60% of overall transmitted bits compared to existing methods while achieving identical model performance in a number of non-homogeneous FL scenarios, including Non-IID data and heterogeneous model architecture.

1. INTRODUCTION

With the deployment of ubiquitous sensing and computing devices, the Internet of things (IoT), as well as many other distributed systems, have gradually grown from concept to reality, bringing dramatic convenience to people's daily life (Du et al., 2020; Liu et al., 2020; Hard et al., 2018) . To fully utilize such distributed computing resources, distributed learning provides a promising framework that can achieve comparable performance with the traditional centralized learning scheme. However, the privacy and security of sensitive data during the updating and transmission processes in distributed learning have been a growing concern. In this context, federated learning (FL) (McMahan et al., 2017) has been developed, allowing distributed devices to collaboratively learn a global model without privacy leakage by keeping private data isolated and masking transmitted information with secure approaches. On account of its privacy-preserving property and great potentiality in some distributed but privacy-sensitive fields such as finance and health, FL has attracted tremendous attention from both academia and industry in recent years. Unfortunately, in many FL applications, such as image classification and objective recognition, the trained model tends to be high-dimensional, resulting in significant communication costs. Hence, communication efficiency has become one of the key bottlenecks of FL. To this end, Sun et al. (2020) proposes the lazily-aggregated quantization (LAQ) method to skip unnecessary parameter uploads by estimating the value of gradient innovation -the difference between the current unquantized gradient and the previously quantized gradient. Moreover, Mao et al. ( 2021) devises an adaptive quantized gradient (AQG) strategy based on LAQ to dynamically select the quantization level within some artificially given numbers during the training process. Nevertheless, the AQG is still not sufficiently adaptive because the pre-determined quantization levels are difficult to choose in complicated FL environments. In another separate line of work, Jhunjhunwala et al. ( 2021) introduces an adaptive quantization rule for FL (AdaQuantFL), which searches in a given range for an optimal quantization level and achieves a better error-communication trade-off. Most previous research has investigated optimizing communication frequency or adjusting the quantization level in a highly adaptive manner, but not both. Intuitively, we ask a question, can we adaptively adjust the quantization level in the lazy aggregation fashion to simultaneously reduce transmitted amounts and communication frequency? In the paper, we intend to select the optimal quantization level for every participated device by optimizing the model deviation caused by skipping quantized gradient updates (i.e., lazy aggregation), which gives us a novel quantization criterion cooperated with a new proposed lazy aggregation strategy to reduce overall communication costs further while still offering a convergence guarantee. The contributions of this paper are trifold. • We propose an innovative FL procedure with adaptive quantization of lazily-aggregated gradients termed AQUILA, which simultaneously adjusts the communication frequency and quantization level in a synergistic fashion. • Instead of naively combining LAQ and AdaQuantFL, AQUILA owns a completely different device selection method and quantitative level calculation method. Specifically, we derive an adaptive quantization strategy from a new perspective that minimizes the model deviation introduced by lazy aggregation. Subsequently, we present a new lazy aggregation criterion that is more precise and saves more device storage. Furthermore, we provide a convergence analysis of AQUILA under the generally non-convex case and the Polyak-Łojasiewicz condition. • Except for normal FL settings, such as independent and identically distributed (IID) data environment, we experimentally evaluate the performance of AQUILA in a number of non-homogeneous FL settings, such as non-independent and non-identically distributed (Non-IID) local dataset and various heterogeneous model aggregations. The evaluation results reveal that AQUILA considerably mitigates the communication overhead compared to a variety of state-of-art algorithms. 

2. BACKGROUND AND RELATED WORK

θ k+1 := θ k - α M m∈M ∇f m (θ k ). Definition 2.1 (Quantized gradient innovation). For more efficiency, each device only uploads the quantized deflection between the full gradient ∇f m (θ k ) and the last quantization value q k-1 m utilizing a quantization operator Q : R d → R d , i.e., ∆q k m = Q(∇f m (θ k ) -q k-1 m ). For communication frequency reduction, the lazy aggregation strategy allows the device m ∈ M to upload its newly-quantized gradient innovation at epoch k only when the change in local gradient is



Consider an FL system with one central parameter server and a device set M with M = |M| distributed devices to collaboratively train a global model parameterized by θ ∈ R d . Each device m ∈ M has a private local dataset D m = {(x nm )} of n m samples. The federated training process is typically performed by solving the following optimization problem min θ) with f m (θ) = [l (h θ (x), y)] (x,y)∼Dm , (1) where f : R d → R denotes the empirical risk, f m : R d → R denotes the local objective based on the private data D m of the device m, l denotes the local loss function, and h θ denotes the local model. The FL training process is conducted by iteratively performing local updates and global aggregation as proposed in (McMahan et al., 2017). First, at communication round k, each device m receives the global model θ k from the parameter server and trains it with its local data D m . Subsequently, it sends the local gradient ∇f m (θ k ) to the central server, and the server will update the global model with learning rate α by

