AQUILA: COMMUNICATION EFFICIENT FEDERATED LEARNING WITH ADAPTIVE QUANTIZATION OF LAZILY-AGGREGATED GRADIENTS

Abstract

The development and deployment of federated learning (FL) have been bottlenecked by heavy communication overheads of high-dimensional models between the distributed device nodes and the central server. To achieve better errorcommunication trade-offs, recent efforts have been made to either adaptively reduce the communication frequency by skipping unimportant updates, e.g., lazy aggregation, or adjust the quantization bits for each communication. In this paper, we propose a unifying communication efficient framework for FL based on adaptive quantization of lazily-aggregated gradients (AQUILA), which adaptively balances two mutually-dependent factors, the communication frequency, and the quantization level. Specifically, we start with a careful investigation of the classical lazy aggregation scheme and formulate AQUILA as an optimization problem where the optimal quantization level is selected by minimizing the model deviation caused by update skipping. Furthermore, we devise a new lazy aggregation strategy to better fit the novel quantization criterion and retain the communication frequency at an appropriate level. The effectiveness and convergence of the proposed AQUILA framework are theoretically verified. The experimental results demonstrate that AQUILA can reduce around 60% of overall transmitted bits compared to existing methods while achieving identical model performance in a number of non-homogeneous FL scenarios, including Non-IID data and heterogeneous model architecture.

1. INTRODUCTION

With the deployment of ubiquitous sensing and computing devices, the Internet of things (IoT), as well as many other distributed systems, have gradually grown from concept to reality, bringing dramatic convenience to people's daily life (Du et al., 2020; Liu et al., 2020; Hard et al., 2018) . To fully utilize such distributed computing resources, distributed learning provides a promising framework that can achieve comparable performance with the traditional centralized learning scheme. However, the privacy and security of sensitive data during the updating and transmission processes in distributed learning have been a growing concern. In this context, federated learning (FL) (McMahan et al., 2017) has been developed, allowing distributed devices to collaboratively learn a global model without privacy leakage by keeping private data isolated and masking transmitted information with secure approaches. On account of its privacy-preserving property and great potentiality in some distributed but privacy-sensitive fields such as finance and health, FL has attracted tremendous attention from both academia and industry in recent years. Unfortunately, in many FL applications, such as image classification and objective recognition, the trained model tends to be high-dimensional, resulting in significant communication costs. Hence, communication efficiency has become one of the key bottlenecks of FL. To this end, Sun et al. (2020) proposes the lazily-aggregated quantization (LAQ) method to skip unnecessary parameter uploads by estimating the value of gradient innovation -the difference between the current unquantized gradient and the previously quantized gradient. Moreover, Mao et al. (2021) devises an adaptive quantized gradient (AQG) strategy based on LAQ to dynamically select the quantization level within some artificially given numbers during the training process. Nevertheless, the AQG is still not sufficiently adaptive because the pre-determined quantization levels are difficult to choose in complicated FL environments. In another separate line of work, Jhunjhunwala et al. (2021) introduces an adaptive quantization rule for FL (AdaQuantFL), which searches in a given range for an optimal quantization level and achieves a better error-communication trade-off. Most previous research has investigated optimizing communication frequency or adjusting the quantization level in a highly adaptive manner, but not both. Intuitively, we ask a question, can we adaptively adjust the quantization level in the lazy aggregation fashion to simultaneously reduce transmitted amounts and communication frequency? In the paper, we intend to select the optimal quantization level for every participated device by optimizing the model deviation caused by skipping quantized gradient updates (i.e., lazy aggregation), which gives us a novel quantization criterion cooperated with a new proposed lazy aggregation strategy to reduce overall communication costs further while still offering a convergence guarantee. The contributions of this paper are trifold. • We propose an innovative FL procedure with adaptive quantization of lazily-aggregated gradients termed AQUILA, which simultaneously adjusts the communication frequency and quantization level in a synergistic fashion. • Instead of naively combining LAQ and AdaQuantFL, AQUILA owns a completely different device selection method and quantitative level calculation method. Specifically, we derive an adaptive quantization strategy from a new perspective that minimizes the model deviation introduced by lazy aggregation. Subsequently, we present a new lazy aggregation criterion that is more precise and saves more device storage. Furthermore, we provide a convergence analysis of AQUILA under the generally non-convex case and the Polyak-Łojasiewicz condition. • Except for normal FL settings, such as independent and identically distributed (IID) data environment, we experimentally evaluate the performance of AQUILA in a number of non-homogeneous FL settings, such as non-independent and non-identically distributed (Non-IID) local dataset and various heterogeneous model aggregations. The evaluation results reveal that AQUILA considerably mitigates the communication overhead compared to a variety of state-of-art algorithms. 

2. BACKGROUND AND RELATED WORK

(m) 1 , y (m) 1 ), • • • , (x (m) nm , y nm )} of n m samples. The federated training process is typically performed by solving the following optimization problem  min θ∈R d f (θ) = 1 M M m=1 f m (θ) with f m (θ) = [l (h θ (x), y)] (x,y)∼Dm , θ k+1 := θ k - α M m∈M ∇f m (θ k ). Definition 2.1 (Quantized gradient innovation). For more efficiency, each device only uploads the quantized deflection between the full gradient ∇f m (θ k ) and the last quantization value q k-1 m utilizing a quantization operator Q : R d → R d , i.e., ∆q k m = Q(∇f m (θ k ) -q k-1 m ). For communication frequency reduction, the lazy aggregation strategy allows the device m ∈ M to upload its newly-quantized gradient innovation at epoch k only when the change in local gradient is sufficiently larger than a threshold. Hence, the quantization of the local gradient q k m of device m at epoch k can be calculated by q k m :=    q k-1 m , if Q(∇f m θ k -q k-1 m ) adjusting the quantization levels during the FL training process. Specifically, AdaQuantFL computes the optimal quantization level (b k ) * by (b k ) * = ⌊ f (θ 0 )/f (θ k ) • b 0 ⌋ , where f (θ 0 ) and f (θ k ) are the global objective loss defined in (1). However, AdaQuantFL transmits quantized gradients at every communication round. In order to skip unnecessary communication rounds and adaptively adjust the quantization level for each communication jointly, a naive approach is to quantize lazily aggregated gradients with AdaQuantFL. Nevertheless, it fails to achieve efficient communication for several reasons. First, given the descending trend of training loss, AdaQuantFL's criterion may lead to a high quantization bit number even exceeding 32 bits in the training process (assuming a floating point is represented by 32 bits in our case), which is too large for cases where the global convergence is already approaching and makes the quantization meaningless. Second, a higher quantization level results in a smaller quantization error, leading to a lower communication threshold in the lazy aggregation criterion (4) and thus a higher transmission frequency. Consequently, it is desirable to develop a more efficient adaptive quantization method in the lazilyaggregated setting to improve communication efficiency in FL systematically.

3. ADAPTIVE QUANTIZATION OF LAZILY-AGGREGATED GRADIENTS

Given the above limitations of the naive joint use of the existing adaptive quantization criterion and lazy aggregation strategy, this paper aims to design a unifying procedure for communication efficiency optimization where the quantization level and communication frequency are considered synergistically and interactively.

3.1. OPTIMAL QUANTIZATION LEVEL

First, we introduce the definition of a deterministic rounding quantizer and a fully-aggregated model.  ψ k m i =     ∇f m (θ k ) i -q k-1 m i + R k m 2τ k m R k m + 1 2     , ∀i ∈ {1, 2, ..., d}, where ∇f (θ k m ) denotes the current unquantized gradient,  R k m = ∥∇f m (θ k ) -q k-1 m ∥ ∞ denotes = θ k - α M m∈M q k-1 m + ∆q k m . Lemma 3.1. The influence of lazy aggregation at communication round k can be bounded by θk -θ k 2 2 ⩽ 4α 2 |M k c | M 2 m∈M k c ∇f m (θ k )-q k-1 m 2 -τ k m R k m 1 2 2 +4(R k m ) 2 d+ d 2 . (8) Corresponding to Lemma 3.1, since R k m is independent of τ k m , we can formulate an optimization problem to minimize the upper bound of this model deviation caused by update skipping for each device m: minimize 0<τ k m ⩽1 ∇f m (θ k ) -q k-1 m 2 -τ k m R k m 1 2 2 subject to τ k m = 1 2 b k m -1 . ( ) Solving the below optimization problem gives AQUILA an adaptive strategy (10) that selects the optimal quantization level based on the quantization range R k m , the dimension d of the local model, the current gradient ∇f m (θ k ), and the last uploaded quantized gradient q k-1 m : (b k m ) * =     log 2   R k m √ d ∇f m (θ k ) -q k-1 m 2 + 1       . ( ) The superiority of (10) comes from the following three aspects. First, since R k m ⩾ [∇f m (θ k )] i - [q k-1 m ] i ⩾ -R k m , the optimal quantization level (b k m ) * must be greater than or equal to 1. Second, AQUILA can personalize an optimal quantization level for each device corresponding to its own gradient, whereas, in AdaQuantFL, each device merely utilizes an identical quantization level according to the global loss. Third, the gradient innovation and quantization range R k m tend to fluctuate along with the training process instead of keeping descending, and thus prevent the quantization level from increasing tremendously compared with AdaQuantFL.

3.2. PRECISE LAZY AGGREGATION CRITERION

Definition 3.3 (Quantization error). The global quantization error ε k is defined by the subtraction between the current unquantized gradient ∇f (θ k ) and its quantized value q k-1 + ∆q k , i.e., ε k = ∇f (θ k ) -q k-1 -∆q k , where ∇f (θ k ) = m∈M ∇f m (θ k ), q k-1 = m∈M q k-1 m , ∆q k = m∈M ∆q k m . To better fit the larger quantization errors induced by fewer quantization bits in (10), AQUILA possesses a new communication criterion to avoid the potential expansion of the devices group being skipped: ∆q k m 2 2 + ε k m 2 2 ⩽ β α 2 θ k -θ k-1 2 2 , ∀m ∈ M k c , where β ⩾ 0 is a tuning factor. Note that this skipping rule is employed at epoch k, in which each device m calculates its quantized gradient innovation ∆q k m and quantization error ε k m , then utilizes this rule to decide whether uploads ∆q k m . The comparison of AQUILA's skip rule and LAQ's is also shown in Appendix A.2. Instead of storing a large number of previous model parameters as LAQ, the strength of ( 12) is that AQUILA directly utilizes the global model for two adjacent rounds as the skip condition, which does not need to estimate the global gradient (more precise), requires fewer hyperparameters to adjust, and considerably reduces the storage pressure of local devices. This is especially important for small-capacity devices (e.g., sensors) in practical IoT scenarios. Algorithm 1 Communication Efficient FL with AQUILA Input: the number of communication rounds K, the learning rate α. Initialize: the initial global model parameter θ 0 . 1: Server broadcasts θ 0 to all devices. ▷ For the initial round k = 0. 2: for each device m ∈ M in parallel do 3: Calculates local gradient ∇f m (θ 0 ).

4:

Compute (b 0 m ) * by setting q k-1 m = 0 in (10) and the quantized gradient innovation ∆q 0 m , and transmits it back to the server side. 5: end for 6: for k = 1, 2, ..., K do 7: Server broadcasts θ k to all devices.

8:

for each device m ∈ M in parallel do 9: Calculates local gradient ∇f m (θ k ), the optimal local quantization level (b k m ) * by ( 10), and the quantized gradient innovation ∆q k m . 10: if (12) does not hold for device m then ▷ If satisfies, skip uploading. 11: device m transmits ∆q k m to the server. 12: end if 13: end for 14: Server updates θ k+1 by the saving previous global quantized gradient q k-1 m and the received quantized gradient innovation ∆q k m : θ k+1 := θ k -α q k-1 + 1/M m∈M k ∆q k m . 15: Server saves the average quantized gradient q k for the next aggregation. 16: end for The detailed process of AQUILA is comprehensively summarized in Algorithm 1. At epoch k = 0, each device calculates b 0 m by setting q k-1 0 = 0 and uploads ∆q k 0 to the server since the ( 12) is not satisfied. At epoch k ∈ {1, 2, ..., K}, the server first broadcasts the global model θ k to all devices. Each device m computes ∇f (θ k m ) with local training data and then utilizes it to calculate an optimal quantization level by (10). Subsequently, each device computes its gradient innovation after quantization and determines whether or not to upload based on the communication criterion (12). Finally, the server updates the new global model θ k+1 with up-to-date quantized gradients q k-1 m + ∆q k m for those devices who transmit the uploads at epoch k, while reusing the old quantized gradients q k-1 m for those who skip the uploads.

4. THEORETICAL DERIVATION AND ANALYSIS OF AQUILA

As aforementioned, we bound the model deviation caused by skipping updates with respect to quantization bits. Specifically, if the communication criterion (12) holds for the device m at epoch k, it does not contribute to epoch k's gradient. Otherwise, the loss caused by device m will be minimized with the optimal quantization level selection criterion (10). In this section, the theoretical convergence derivation of AQUILA is based on the following standard assumptions. Assumption 4.1 (L-smoothness). Each local objective function f m is L m -smooth, i.e., there exist a constant L m > 0, such that ∀x, y ∈ R d , ∥∇f m (x) -∇f m (y)∥ 2 ⩽ L m ∥x -y∥ 2 , ( ) which implies that the global objective function f is L-smooth with L ≤ L = 1 m m i=1 L m . Assumption 4.2 (Uniform lower bound). For all x ∈ R d , there exist f * ∈ R such that f (x) ≥ f * . Lemma 4.1. Following the assumption that the function f is L-smooth, we have f (θ k+1 )-f (θ k ) ⩽ - α 2 ∇f (θ k ) 2 2 +α    1 M m∈M k c ∆q k m 2 2 + ε k 2 2   + L 2 - 1 2α θ k+1 -θ k 2 2 . ( ) 4.1 CONVERGENCE ANALYSIS FOR GENERALLY NON-CONVEX CASE. Theorem 4.1. Suppose Assumptions 4.1, 4.2, and B.1 (29) be satisfied. If M k c ̸ = ∅, the global objective function f satisfies f (θ k+1 )-f (θ k )⩽- α 2 ∇f (θ k ) 2 2 + L 2 - 1 2α θ k+1 -θ k 2 2 + βγ α θ k -θ k-1 2 2 . ( ) Corollary 4.1. Let all the assumptions of Theorem 4.1 hold and L 2 -1 2α + βγ α ⩽ 0, then the AQUILA requires K = O 2ω 1 αϵ 2 (16) communication rounds with ω 1 = f θ 1 -f (θ * )+ βγ α θ 1 -θ 0 2 2 to achieve min k ∥∇f (θ k )∥ 2 2 ⩽ ϵ 2 . Compared to LAG. Corresponding to Eq.( 70) in Chen et al. (2018) , LAG defines a Lyapunov function V k := f (θ k ) -f (θ * ) + D d=1 β d ∥θ k+1-d -θ k-d ∥ 2 2 and claims that it satisfies V k+1 -V k ≤ - α 2 -c (α, β 1 ) (1 + ρ)α 2 ∇f (θ k ) 2 2 , ( ) where c (α,  β 1 ) = L/2 -1/(2α) + β 1 , β 1 = Dξ/(2αη), ξ < 1/D, K LAG = O 2ω 1 (α -2c (α, β 1 ) (1 + ρ)α 2 ) ϵ 2 communication rounds to converge. Since the non-negativity of the term c (α, β 1 ) (1 + ρ)α 2 , we can readily derive that α < α -2c (α, β 1 ) (1 + ρ)α 2 , which demonstrates AQUILA achieves a better convergence rate than LAG with the appropriate selection of α.

4.2. CONVERGENCE ANALYSIS UNDER POLYAK-ŁOJASIEWICZ CONDITION.

Assumption 4.3 (µ-PŁ condition). Function f satisfies the PL condition with a constant µ > 0, that is, ∇f (θ k ) 2 2 ⩾ 2µ(f (θ k ) -f (θ * )). ( ) Theorem 4.2. Suppose Assumptions 4.1, 4.2, and 4.3 be satisfied and M k c ̸ = ∅, if the hyperparame- ters satisfy βγ α ⩽ (1 -αµ) 1 2α -L 2 , then the global objective function satisfies f (θ k+1 )-f (θ k )⩽-αµ(f (θ k )-f (θ * ))+ L 2 - 1 2α θ k+1 -θ k 2 2 + βγ α θ k -θ k-1 2 2 , and the AQUILA requires K = O - 1 log(1 -αµ) log ω 1 ϵ (21) communication round with ω 1 = f (θ 1 ) -f (θ * ) + 1 2α -L 2 θ 1 -θ 0 2 2 to achieve f (θ K+1 ) - f (θ * ) + ( 1 2α -L 2 )∥θ K+1 -θ K ∥ 2 2 ⩽ ϵ. Compared to LAG. According to Eq.( 50) in Chen et al. (2018) , we have that V K ≤ 1 -αµ + αµ Dξ K V 0 , ( ) where ξ < 1/D. Thus, we have that LAG requires K LAG = O - 1 log(1 -αµ + αµ √ Dξ) log ω 1 ϵ (23) communication rounds to converge. Compared to Theorem 4.2, we can derive that log(1 -αµ) < log(1 -αµ + αµ √ Dξ), which indicates that AQUILA has a faster convergence than LAG under the PŁ condition. Remark. We want to emphasize that LAQ introduces the Lyapunov function into its proof, making it extremely complicated. In addition, LAQ can only guarantee that the final objective function converges to a range of the optimal solution rather than an accurate optimum f (θ * ). Nevertheless, as discussed in Section 3.2, we utilize the precise model difference in AQUILA as a surrogate for the global gradient and thus simplify the proof.

5.1. EXPERIMENT SETUP

In this paper, we evaluate AQUILA on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , and WikiText-2 dataset (Merity et al., 2016) , considering IID, Non-IID data scenario, and heterogeneous model architecture (which is also a crucial challenge in FL) simultaneously. The FL environment is simulated in Python 3.9 with PyTorch 11.1 (Paszke et al., 2019) implementation. For the diversity of the neural network structures, we train ResNet-18 (He et al., 2016) at CIFAR-10 dataset, MobileNet-v2 (Sandler et al., 2018) at CIFAR-100 dataset, and Transformer (Vaswani et al., 2017) at WikiText-2 dataset. As for the FL system setting, in the majority of our experiments, the whole system exists M = 10 total devices. However, considering the large-scale feature of FL, we also validate AQUILA on a larger system of M = 100/80 total devices for CIFAR / WikiText-2 dataset. The hyperparameters and additional details of our experiments are revealed in Appendix A.3.  (f) (e) (d) (b) (a) (c)

5.2. HOMOGENEOUS ENVIRONMENT

We first evaluate AQUILA with homogeneous settings where all local models share the same model architecture as the global model. To better demonstrate the effectiveness of AQUILA, its performance is compared with several state-of-the-art methods, including AdaQuantFL, LAQ with fixed levels, LENA (Ghadikolaei et al., 2021) , MARINA (Gorbunov et al., 2021) , and the naive combination of AdaQuantFL with LAQ. Note that based on this homogeneous setting, we conduct both IID and Non-IID evaluations on CIFAR-10 and CIFAR-100 dataset, and an IID evaluation on WikiText-2. To simulate the Non-IID FL setting as (Diao et al., 2020) , each device is allocated two classes of data in CIFAR-10 and 10 classes of data in CIFAR-100 at most, and the amount of data for each label is balanced. The experimental results are presented in Fig. 1 , where 100% implies all local models share a similar structure with the global model (i.e., homogeneity), 100% (80 devices) denotes the experiment is conducted in an 80 devices system, and LAdaQ represents the naive combination of AdaQuantFL and LAQ. For better illustration, the results have been smoothed by their standard deviation. The solid lines represent values after smoothing, and transparent shades of the same colors around them represent the true values. Additionally, Table 2 shows the total number of bits transmitted by all devices throughout the FL training process. The comprehensive experimental results are established in Appendix A.4.

5.3. NON-HOMOGENEOUS SCENARIO

In this section, we also evaluate AQUILA with heterogeneous model structures as HeteroFL (Diao et al., 2020) , where the structures of local models trained on the device side are heterogeneous. Suppose the global model at epoch k is θ k and its size is d = w g * h g , then the local model of each device m can be selected by θ k m = θ k [: w m , : h m ], where w m = r m w g and h m = r m h g , respectively. In this paper, we choose model complexity levels r m = 0.5. Performance Analysis. First of all, AQUILA achieves a significant transmission reduction compared to the naive combination of LAQ and AdaQuantFL in all datasets, which demonstrates the superiority of AQUILA's efficiency. Specifically, Table 2 indicates that AQUILA saves 57.49% of transmitted bits in the system of 80 devices at the WikiText-2 dataset and reduces 23.08% of transmitted bits in the system of 100 devices at the CIFAR-100 dataset, compared to the naive combination. And other results in Table 3 also show an obvious reduction in terms of the total transmitted bits required for convergence. (f) (e) (d) (b) (a) (c) Second, in Fig. 1 and Fig. 2 , the changing trend of AQUILA's communication bits per each round clearly verifies the necessity and effectiveness of our well-designed adaptive quantization level and skip criterion. In these two figures, the number of bits transmitted in each round of AQUILA fluctuates a bit, indicating the effectiveness of AQUILA's selection rule. Meanwhile, the value of transmitted bits remains at quite a low level, suggesting that the adaptive quantization principle makes training more efficient. Moreover, the figures also inform that the quantization level selected by AQUILA will not continuously increase during training instead of being as AdaQuantFL. In addition, based on these two figures, we can also conclude that AQUILA converges faster under the same communication costs. Finally, AQUILA is capable of adapting to a wide range of challenging FL circumstances. In the Non-IID scenario and heterogeneous model structure, AQUILA still outperforms other algorithms by significantly reducing overall transmitted bits while maintaining the same convergence property and objective function value. In particular, AQUILA reduces 60.4% overall communication costs compared to LENA and 57.2% compared to MARINA on average. These experimental results in non-homogeneous FL settings prove that AQUILA can be stably employed in more general and complicated FL scenarios.

5.4. ABLATION STUDY ON THE IMPACT OF TUNING FACTOR β

One key contribution of AQUILA is presenting a new lazy aggregation criterion (12) to reduce communication frequency. In this part, we evaluate the effects of the loss performance of different tuning factor β value in Fig. 3 . As β grows within a certain range, the convergence speed of the model will slow down (due to lazy aggregation). Still, it will eventually converge to the same model performance while considerably reducing the communication overhead. Nevertheless, increasing the value of β will lead to a decrease in the final model performance since it skips so many essential uploads that make the training deficient. The accuracy (perplexity) comparison of AQUILA with various selections of the tuning factor β is shown in Fig. 10 , which indicates the same trend.To sum up, we should choose the value of factor β to maintain the model's performance and minimize the total transmitted amount of bits. Specifically, we select the value of β = 0.1, 0.25, 1.25 for CIFAR-10, CIFAR-100, and WikiText-2 datasets for our evaluation, respectively.  (f) (e) (d) (b) (a) (c)

6. CONCLUSIONS AND FUTURE WORK

This paper proposes a communication-efficient FL procedure to simultaneously adjust two mutuallydependent degrees of freedom: communication frequency and quantization level. With the close cooperation of the novel adaptive quantization and adjusted lazy aggregation strategy derived in this paper, the proposed AQUILA has been proven to be capable of reducing the transmitted costs while maintaining the convergence guarantee and model performance compared to existing methods. The evaluation with Non-IID data distribution and various heterogeneous model architectures demonstrates that AQUILA is compatible in a non-homogeneous FL environment.

REPRODUCIBILITY

We present the overall theorem statements and proofs for our main results in the Appendix, as well as necessary experimental plotting figures. Furthermore, we submit the code of AQUILA in the supplementary material part, including all the hyperparameters and a requirements to help the public reproduce our experimental results. Our algorithm is straightforward, well-described, and easy to implement.

ETHICS STATEMENT

All evaluations of AQUILA are performed on publicly available datasets for reproducibility purposes. This paper empirically studies the performance of various state-of-art algorithms, therefore, probably introduces no new ethical or cultural problems. This paper does not utilize any new dataset.

A APPENDIX

The appendix includes supplementary experimental results, mathematical proof of the aforementioned theorems, and a detailed derivation of the novel adaptive quantization criterion and lazy aggregation strategy. Compared to Fig. 1 and Fig. 2 in the main text, the result figures in the appendix show a more comprehensive evaluation with AQUILA, which contains more detailed information including but not limited to accuracy vs steps and training loss vs steps curves. A.1 OVERALL FRAMEWORK OF AQUILA The cooperation of the novel adaptive quantization criterion (10) and lazy aggregation strategy ( 12) is illustrated in Fig. 4a . Compared to the naive combination of AdaQuantFL and LAQ, where the mutual influence between adaptive quantization and lazy aggregation has not been considered, as shown in Fig. 4b , AQUILA adaptively optimizes the allocation of quantization bits throughout training to promote the convergence of lazy aggregation, and at the same time utilizes the lazy aggregation strategy to improve the efficiency of adaptive quantization by compress the transmission with a lower quantization level. 

A.2 EXPLANATION OF THE QUANTIZER AND THE SKIP RULE OF LAQ'S

The quantizer ( 6) is a deterministic quantizer that, at each dimension, maps the gradient innovation to the closest point at a one-dimensional grid. The range of the grid is R k m , and the granularity is determined by quantization level τ k m . Each dimension of gradient innovation is mapped to an integer in {0, 1, 2, 3, . . . , 2 b -1}. More precisely, the 1/2 ensures mapping to the closest integer instead of flooring to a smaller integer. The R k m in the numerator ensures that the mapped integer is non-negative. As a result, when the gradient innovation is transmitted to the central server, 32 bits are used for the range, and b * d bits are used for the mapped integer. Thus, 32 + b * d bits are transmitted in total. The difference between ( 6) and (32) (Lemma B.2) is that (6) encodes the raw gradient innovation vector to an integer vector, whilst (32) decodes the integer vector to a quantized gradient innovation vector. Specifically, in the training process, each client utilizes (6) to encode the gradient innovation to an integer at each dimension, and afterwards, the integer vector ψ k m and τ k m are sent to a central server. After receiving them, the central server can decode the quantized gradient innovation as (32) states. The skip rule of LAQ is measured by the summation of the accumulated model difference and quantization error: ∥∆q k m ∥ 2 2 ⩽ 1 α 2 M 2 D d ′ =1 ξ d ′ θ k+1-d ′ -θ k-d ′ 2 2 + 3 ε k m 2 2 + εk-1 m 2 2 , ( ) where ξ d ′ is a series of manually selected scalars and D is also predetermined. ε k m is the quantization error of client m at epoch k, and εk-1 m is the quantization of client m at last time it uploads its gradient innovation. Please refer to Sun et al. (2020) for more details on (24). In order to compute the LAQ skip threshold, each client has to store enormous previous information. The difference of AQUILA skipping criterion and LAQ skipping criterion is as follows. First, the AQUILA threshold is easier to compute for a local client. Compared to the LAQ skipping criterion, AQUILA skipping criterion is more concise and thus requires less storage and computing power. Second, the AQUILA criterion is easier to tune because much fewer hyperparameters are introduced. Compared to the LAQ criterion in which α, D and {ξ d ′ } D d ′ =1 are all manually selected, whilst only two hyperparameters α and β are introduced in the AQUILA criterion. Third, with the given threshold, AQUILA has a good theoretical property. The theoretical analysis of AQUILA is easier to follow with no Lyapunov function introduced as in LAQ. And the result also shows that AQUILA can achieve a better convergence rate under the non-convex case and the PL condition.

A.3 EXPERIMENT SETUP

In this section, we provide some extra hyperparameter settings for our evaluation. For the LAQ, we set D = 10 and ξ 1 = ξ 2 = • • • = ξ D = 0.8/D as the same as the setting in their paper. For LENA, we set β LEN A = 40 in their trigger condition. And for MARINA, we calculate the uploading probability of Bernoulli distribution as p = ξ Q /d as announced in their paper. In addition, we choose the CrossEntropy function as our objective function in the experiment part. Table 1 shows the hyperparameter details of our evaluation. This section will cover all the experimental results in our paper. Total Transmitted Bits (GB) If the factor of ε k 2 2 in (45) is less than or equal to 0, that is, (αL -1) 1 + p -1 + 2 ⩽ 0, then the factor of ∥∇f (θ k )∥ 2 2 will be less than -α 2 , which indicates that f (θ k+1 ) -f (θ k ) ⩽ - α 2 ∇f (θ k ) 2 2 . ( ) Note that it is not difficult to demonstrate that (46) and L 2 -1 2α + βγ α ⩽ 0 can actually be satisfied at the same time. For instance, we can set p = 0.1, α = 0.1, β = 0.25, γ = 2, L = 2.5 that satisfies both of them. F MISSING PROOF OF THEOREM 4.2.  f (θ k+1 ) -f (θ k ) ⩽ - α 2 ∇f (θ k ) 2 2 + L 2 - 1 2α θ k+1 -θ k 2 2 + βγ α θ k -θ k-1 2 2 (19) ⩽ -αµ(f (θ k ) -f (θ * )) + L 2 - 1 2α θ k+1 -θ k 2 2 + βγ α θ k -θ k-1 2 2 , (48) which is equivalent to f (θ k+1 ) -f (θ * ) (19) ⩽ (1 -αµ)(f (θ k ) -f (θ * )) + L 2 - 1 2α θ k+1 -θ k 2 2 + βγ α θ k -θ k-1 2 2 . (49)



Consider an FL system with one central parameter server and a device set M with M = |M| distributed devices to collaboratively train a global model parameterized by θ ∈ R d . Each device m ∈ M has a private local dataset D m = {(x

where f : R d → R denotes the empirical risk, f m : R d → R denotes the local objective based on the private data D m of the device m, l denotes the local loss function, and h θ denotes the local model. The FL training process is conducted by iteratively performing local updates and global aggregation as proposed in (McMahan et al., 2017). First, at communication round k, each device m receives the global model θ k from the parameter server and trains it with its local data D m . Subsequently, it sends the local gradient ∇f m (θ k ) to the central server, and the server will update the global model with learning rate α by

the quantization range, b k m denotes the quantization level, and τ k m := 1/(2 b k m -1) denotes the quantization granularity. More explanations on this quantizer are exhibited on Appendix A.2. Definition 3.2 (Fully-aggregated model). The fully-aggregated model θ without lazy aggregation at epoch k is computed by θk+1

Figure 1: Comparison of AQUILA with other communication-efficient algorithms on IID and Non-IID settings with homogeneous model structure. (a)-(c): training loss v.s. total transmitted bits, (d)-(f): transmitted bits per epoch v.s. global epoch.

Figure 2: Comparison of AQUILA with other communication-efficient algorithms on IID and Non-IID settings with heterogeneous model structure. (a)-(c): training loss v.s. total transmitted bits, (d)-(f): transmitted bits per epoch v.s. global epoch.

Figure 3: Comparison of AQUILA with various selections of the tuning factor β in three datasets.

Figure 4: The schematic illustration of the communication-efficient FL with AQUILA in comparison with the naive combination of AdaQuantFL and LAQ. The blue lines indicating the transmission of quantized gradient innovation in AQUILA are drawn in different thicknesses to represent various sizes of quantized gradient innovation, considering the heterogeneous FL environment as in our evaluation part. For instance, the gradient innovation ∆q k 3 of a PC is larger than ∆q k 2 of a mobile phone.

Figure 5: Comparison on 100 devices at CIFAR-10 and CIFAR-100, 80 devices at WikiText-2 dataset in the IID, homogeneous scenario.

Figure 10: Accuracy (Perplexity) comparison of AQUILA with various selections of the tuning factor β in three datasets.

Based on the intermediate result (40) of Theorem 4.1 and Assumption 4.4 (µ-PŁ condition), we have

The hyperparameters for CIFAR-10, CIFAR-100, and WikiText-2 datasets in the FL training process.

Numerical numbers of total communication bits in the homogeneous environment.

Numerical numbers of total communication bits in the heterogeneous environment.

annex

Suppose βγ α ⩽ (1 -αµ) 1 2α -L 2 , we can show thatTherefore, after multiply k = 1, 2, • • • , K, we havewhich demonstrates AQUILA requires

