FADE: ENABLING LARGE-SCALE FEDERATED AD-VERSARIAL TRAINING ON RESOURCE-CONSTRAINED EDGE DEVICES

Abstract

Federated adversarial training can effectively complement adversarial robustness into the privacy-preserving federated learning systems. However, the high demand for memory capacity and computing power makes large-scale federated adversarial training infeasible on resource-constrained edge devices. Few previous studies in federated adversarial training have tried to tackle both memory and computational constraints at the same time. In this paper, we propose a new framework named Federated Adversarial Decoupled Learning (FADE) to enable AT on resource-constrained edge devices. FADE decouples the entire model into small modules to fit into the resource budget of each edge device respectively, and each device only needs to perform AT on a single module in each communication round. We also propose an auxiliary weight decay to alleviate objective inconsistency and achieve better accuracy-robustness balance in FADE. FADE offers a theoretical guarantee for convergence and adversarial robustness, and our experimental results show that FADE can significantly reduce the consumption of memory and computing power while maintaining accuracy and robustness.

1. INTRODUCTION

As a privacy-preserving distributed learning paradigm, Federated Learning (FL) makes a meaningful step toward the practice of secure and trustworthy artificial intelligence (Konečnỳ et al., 2015; 2016; McMahan et al., 2017; Kairouz et al., 2019) . In contrast to traditional centralized training, FL pushes the training to edge devices (clients), and client models are locally trained and uploaded to the server for aggregation. Since no private data is shared with other clients or the server, FL substantially improves the data privacy during the training process. While FL can preserve the privacy of the participants, other threats can still impact the reliability of the machine learning model running on the FL system. One of such threats is adversarial samples, which aim to cause misclassifications of the model by adding imperceptible noise into the input data (Szegedy et al., 2013; Goodfellow et al., 2014) . Previous research has shown that performing adversarial training (AT) on a large model is an effective method to attain robustness against adversarial samples while maintaining high accuracy on clean samples (Liu et al., 2020) . However, large-scale AT also puts high demand for both memory capacity and computing power, which is affordable for some edge devices with limited resources, such as mobile phones and IoT devices, in FL scenarios (Kairouz et al., 2019; Li et al., 2020; Wong et al., 2020; Zizzo et al., 2020; Hong et al., 2021) . Table 1 shows that strong robustness of the whole FL system cannot be attained by allowing only a small portion (e.g., 20%) of the clients to perform AT. Therefore, enabling resourceconstrained edge devices (which usually contribute to the majority of the participants in cross-device FL (Kairouz et al., 2019) ) to perform AT is necessary for achieving strong robustness in FL. Some previous works have tried to tackle client-wise systematic heterogeneity in FL (Li et al., 2018; Lu et al., 2020; Wang et al., 2020b; Xie et al., 2019) . The most common method to deal with the slow devices is to allow them performing less epochs of local training than the others (Li et al., 2018; Wang et al., 2020b) . While this method can reduce the computational costs on the slow devices, the memory capacity limitation on edge devices has not been well discussed in these works. To tackle both memory capacity and computational constraints, recent studies propose a novel training scheme named Decoupled Greedy Learning (DGL) which decouples the entire neural network into several small modules and trains each module separately (Belilovsky et al., 2019; Wang et al., 2021) . DGL can be naturally deployed in FL since the training of decoupled modules can be parallelized on different computing nodes (Belilovsky et al., 2020) . However, vanilla DGL only supports a unique model partition on all computing nodes, which cannot fit into different resource budgets of different clients in heterogeneous FL. Additionally, no previous studies have explored whether DGL can be combined with AT to confer joint adversarial robustness to the entire model. It is not trivial to achieve joint robustness of the entire model when applying AT in DGL, since modules are trained separately in DGL with different locally supervised losses. In this paper, we propose Federated Adversarial DEcoupled Learning (FADE), which is the first adversarial decoupled learning scheme for heterogeneous FL. Our main contributions are: 1. We propose a more flexible decoupled learning scheme for heterogeneous Federated Learning, which allows different model partitions on devices with different resource budgets. We give a theoretical guarantee for the convergence of our Federated Decoupled Learning. 2. We propose Federated Adversarial DEcoupled Learning (FADE) to attain theoretically guaranteed joint adversarial robustness of the entire model. Our experimental results show that FADE can significantly reduce the memory and computational requirements while maintaining the natural accuracy and adversarial robustness as joint training. 3. We analyze the trade-off between objective consistency (natural accuracy) and adversarial robustness (adversarial accuracy) in FADE, and we propose an effective method to achieve a better accuracy-robustness balance point with the weight decay on auxiliary models.

2. PRELIMINARY

Federated Learning (FL) In FL, different clients collaboratively train a shared global model w with locally stored data (McMahan et al., 2017) . The objective of FL can be formulated as: min w L(w) = 1 i |D i | N k=1 (x,y)∈D k l(x, y; w) = N k=1 q k L k (w), where L k (w) = 1 |D k | (x,y)∈D k l(x, y; w) = E (x,y)∼D k [l(x, y; w)] , and l is the task loss, e.g., cross-entropy loss for classification task. D k is the dataset of client k and its weight q k = |D k |/( i |D i |). To solve for the optimal solution of this objective, in each communication round, FL first samples a subset of clients S (t) to perform local training. These clients initialize their models with the global model w t) , and then run τ iterations of local SGD. After all these clients complete training in this round, their models are uploaded and averaged to become the new global model (McMahan et al., 2017) . We summarize this procedure as follows: (t,0) k = w ( w (t+1) k = w (t) -η t τ -1 j=0 ∇L k (w (t,j) k ), w (t+1) = 1 i∈S (t) q i k∈S (t) q k w (t+1) k , where w (t,j) k is the local model of client k at the j-th iteration of round t.

Adversarial Training (AT)

The goal of AT is to achieve robustness against small perturbation in the inputs. We define (ϵ, c)-robustness as follows: Definition 1. We say a model w is (ϵ, c)-robust in a loss function l at input x if ∀δ ∈ {δ : ∥δ∥ p ≤ ϵ}, l(x + δ, y; w) -l(x, y; w) ≤ c, where ∥ • ∥ p is the ℓ p norm of a vectorfoot_0 , and ϵ is the perturbation tolerance. AT trains a model with adversarial samples to achieve adversarial robustness, which can be formulated as a min-max problem (Goodfellow et al., 2014; Madry et al., 2017) : min w max δ:∥δ∥p≤ϵ l(x + δ, y; w). To solve Eq. 6, people usually alternatively solve the inner maximization and the outer minimization. While solving the inner maximization, Projected Gradient Descent (PGD) is shown to introduce the strongest robustness in AT (Madry et al., 2017; Wong et al., 2020; Wang et al., 2021) .

Decoupled Greedy Learning (DGL)

The key idea of DGL is to decouple the entire model into several non-overlapping small modules. By introducing a locally supervised loss to each module, we can load and train each module independently without accessing the other parts of the entire model (Belilovsky et al., 2019; 2020) . This enables devices with small memory to train large models. As shown in Fig. 1 , each module m usually contains one or multiple adjacent layers w m of the backbone neural network, together with a small auxiliary model θ m that provides locally supervised loss. We denote Θ m = (w m , θ m ) to be all the parameters in module m. Module m accepts the features z m-1 from the previous module as the input, and it outputs features z m = f m (z m-1 ; w m ) for the following modules, as well as a locally supervised loss l m (z m-1 , y; Θ m ). At epoch t, the averaged locally supervised loss L (t) m will be used for training this module:  L (t) m (Θ (t) m ) = E (z (t) m-1 ,y) l m (z (t) m-1 , y; Θ (t) m ) . w m+1 , • • • , w M ) = l(x, y; w 1 , • • • , w M ). Without specifying, we will omit all parameters (w m , θ m and Θ m ) in the following sections for notation simplicity.

3. FEDERATED ADVERSARIAL DECOUPLED LEARNING

In this section, we present our method, Federated Adversarial Decoupled Learning (FADE), which aims at enabling all clients with different computing resources to participate in adversarial training. We first introduce Federated Decoupled Learning (FDL) with flexible model partitions for heterogeneous FL in Section 3.1, and we also give a convergence analysis for it. In Section 3.2, we integrate AT into FDL to achieve joint adversarial robustness of the entire model, and we give a theoretical guarantee for its robustness. In Section 3.3, we discuss the objective inconsistency in FDL and propose an effective method to achieve a better accuracy-robustness balance point.

3.1. FEDERATED DECOUPLED LEARNING

In cross-device FL, the main participants are usually small edge devices who have limited hardware resources and may not be able to afford large-scale AT that requires large memory and high computing power (Li et al., 2018; Kairouz et al., 2019; Li et al., 2020; Wang et al., 2020b) . A solution In contrast to the original unique partition (Belilovsky et al., 2020) , we allow different model partitions among devices according to their resource budgets. In each communication round, each device randomly selects one module (highlighted) for training, and then the updates of each layer will be averaged respectively. to tackle the resource constraints on edge devices is to deploy DGL in FL, where each device only needs to load and train a single module instead of the entire model in each communication round. However, the vanilla DGL only supports a unique model partition on all the devices (Belilovsky et al., 2020) . Considering the systematic heterogeneity, we would prefer various model partitions to fit into different resource budgets of different clients. A device with limited resources (such as a IoT device) can train small modules of the entire model, while a device with more resources (such as a mobile phone or a computer) can train larger modules or even the entire model. Accordingly, we propose our Federated Decoupled Learning (FDL) framework as shown in Fig. 2 . We denote the set of all modules on client k as M k , while M i ̸ = M j if client i is using a different model partition from client j. Here, we consider the update and aggregation rule for each layer n with parameter ω n in the model, since one single layer is the "atom" in FDL and cannot be further decoupled. We use m k (n) to denote the module on client k that contains this layer, and we define L n,k = L m k (n),k as the locally supervised loss for training this layer. In each communication round t, each client k randomly samples a module m t k from M k for training (Eq. 8). After the local training, the updates of each layer n will be averaged over clients in S (t) n respectively, where S (t) n is the set of clients whose trained module m t k contains layer n in this round(Eq. 9). ω (t+1) n,k = ω (t) n -η t τ -1 j=0 ∇ ωn L (t) n,k , if n ∈ m t k ; ω (t) n , elsewhere. ( 8) ω (t+1) n = 1 i∈S (t) n q i k∈S (t) n q k ω (t+1) n,k , where S (t) n = {k ∈ S (t) : n ∈ m t k }. ( ) Theorem 1 guarantees the convergence of FDL, while the full version with proof is in Appendix A. Theorem 1. Under some common assumptions, for any layer n in the entire model, its locally supervised loss L n = k q k L n,k can converge in Federated Decoupled Learning: lim T →∞ inf t≤T E ∥∇ ωn L n ∥ 2 = 0. ( ) Theorem 1 can guarantee the convergence of locally supervised loss L n . However, because of the existence of the objective inconsistency ∥∇L -∇L n ∥ ≥ 0, we cannot guarantee the convergence of the joint loss L with this result. We discuss the objective inconsistency in Section 3.3, and we show how we can reduce this gap such that we can make the joint loss gradient ∇L smaller when the locally supervised loss L n converges.

3.2. ADVERSARIAL DECOUPLED LEARNING

Adversarial decoupled learning can achieve local robustness of each module by performing AT in each module m separately on their own locally supervised loss: min Θm max δm-1 l m (z m-1 + δ m-1 , y; Θ m ), subject to ∥δ m-1 ∥ ≤ ϵ m-1 . However, there are two concerns that have not been addressed in adversarial decoupled learning: 1. Since different modules are trained with different locally supervised losses, can local robustness of each module guarantee the joint robustness of the entire (backbone) model? 2. When applying AT on a module m, what value of the perturbation tolerance ϵ m-1 should we use to ensure the joint robustness of the entire model? Theorem 2 reveals the relationship between the local robustness of each module and the joint robustness of the entire model, and it gives a lower bound of the perturbation tolerance ϵ m-1 for each module m to sufficiently guarantee the joint robustness. Theorem 2 is proved in Appendix B.1. Theorem 2. Assume that lm (z m , y) is µ m -strongly convex in z m for each module m. If each module m ≤ M has local (ϵ m-1 , c m )-robustness in l m (z m-1 , y), and ∀m ≤ M, ϵ m ≥ g m µ m + 2c m µ m + g 2 m µ 2 m , where g m = ∥∇ zm lm (z m , y)∥, then we can guarantee that the entire model has a joint (ϵ 0 , c M )-robustness in l(x, y). Remark 1. In Theorem 2, we assume that the loss function of the auxiliary model lm (z m , y) is strongly convex in its input z m . This assumption is realistic since the auxiliary model is usually a very simple model, e.g., only a linear layer followed by cross-entropy loss. We also theoretically analyze the sufficiency of a simple auxiliary model in Section 3.3 (See Remark 2). Theorem 2 shows that a larger µ m and a smaller g m will lead to a stronger joint robustness of the entire model, since the lower bound of ϵ m becomes smaller for ensuring the joint robustness. In Section 3.3, we further discuss how we can control these two parameters to attain better accuracyrobustness balance with the weight decay on the auxiliary model.

3.3. OBJECTIVE INCONSISTENCY AND ACCURACY-ROBUSTNESS TRADE-OFF

As we mentioned in Section 3.1, there exists objective inconsistency between the module and the entire model because the module is trained with locally supervised loss l m instead of the joint loss l (Wang et al., 2021) . The objective inconsistency in FDL is defined by the difference between the gradients of the locally supervised loss (∇ wm l m ) and the joint loss (∇ wm l). The existence of this inconsistency makes the optimal parameters that minimize the locally supervised loss l m does not necessarily minimize the joint loss l. Furthermore, the objective inconsistency can enlarge heterogeneity among clients and hinder the convergence of FL (Li et al., 2019; Wang et al., 2020b) , thus it is important to alleviate the objective inconsistency in FDL to improve its performance. Theorem 3 shows a non-trivial relationship between adversarial robustness and objective inconsistency: strong joint adversarial robustness also implies small objective inconsistency in FDL. We prove Theorem 3 in Appendix B.2.  ∥∇ wm l -∇ wm l m ∥ ≤ ∂z m ∂w m 2(c m + c ′ m )(β m + β ′ m ). Theorem 3 suggests that we can alleviate the objective inconsistency by reducing β m , β ′ m , c m and c ′ m (Regularizing ∥∂z m /∂w m ∥ usually requires second derivative, which introduces high memory Algorithm 1 FADE: Federated Adversarial Decoupled Learning 1: Initialize w (0) and θ (0) m for each module m. 2: for t = 1, 2, • • • , T do 3: Randomly sample a group of clients S (t) for training.

4:

for each client k ∈ S (t) in parallel do 5: Randomly select a module m t k that will be trained in this round.

6:

Request the current global model w (t) The server aggregates ω (t+1) k to get ω (t+1) according to Eq. 9 for each ω. 12: end for and computational overhead, so we do not consider it here). Notice that c ′ m is small given the joint robustness of the backbone network, which is guaranteed by adversarial decoupled learning in Theorem 2. Furthermore, Moosavi-Dezfooli et al. (2019) shows that adversarial robustness also implies a smoother loss function. Therefore, the joint robustness also leads to a small β ′ m . Accordingly, with adversarial decoupled learning, we only need to reduce β m and c m to alleviate the objective inconsistency. We notice that both β m and c m are only related to the auxiliary model, and we show in Appendix B.3 that we can reduce them by adding a large weight decay on the auxiliary model θ m when the auxiliary model is simple (e.g., only a single linear layer). Remark 2. It is noteworthy that we do not use any conditions on the difference between l′ m and lm in both Theorem 2 and 3. This implies that the auxiliary model is not required to perform as well as the joint backbone model. Thus, a simple auxiliary model is sufficient to achieve high joint robustness and low objective inconsistency in adversarial decoupled learning. Based on all analysis above, we propose Federated Adversarial Decoupled Learning (FADE), where we replace the original loss function l m in Eq. 7 by the following adversarial loss with weight decay: l FADE m (z (t) m-1 , y; w (t) m , θ (t) m ) = max δ (t) m-1 l m (z (t) m-1 + δ (t) m-1 , y; w (t) m , θ (t) m ) + λ m ∥θ (t) m ∥ 2 , ( ) where λ m is the hyperparameter that control the weight decay on the auxiliary model θ m . Our framework is summarized in Algorithm 1. Trade-off Between Joint Accuracy and Joint Robustness. As we discussed in Section 3.2 and this section, four parameters (µ m , g m , c m and β m ) that are only related to the auxiliary model θ m can influence the joint robustness and objective consistency. We can see in Appendix B.3 that applying a larger λ m can decrease all of them. Smaller c m and β m can alleviate the objective inconsistency to increase the joint accuracy, and smaller g m can improve the joint robustness. However, smaller µ m will lead to weaker robustness by increasing the lower bound of ϵ m . Therefore, there exists accuracyrobustness trade-off when we apply the weight decay, and the value of λ m plays an important role in balancing the joint accuracy and the joint robustness.

4.1. EXPERIMENT SETTINGS

We conduct our experiments on two datasets, FMNIST (Xiao et al., 2017) and CIFAR-10 ( Krizhevsky et al., 2009) . To simulate the statistical heterogeneity in FL, we partition the whole dataset into N = 100 clients with the same Non-IID data partition as Shah et al. (2021) , where 80% data of each client is from only two classes while 20% is from the other eight classes. We sample C = 30 clients for local training in each communication round. We conduct two groups of For FMNIST, we use a 7-layer CNN (CNN-7) with five convolutional layers and two fully connected layers. We adopt a model partition with 2 modules for CNN-7. For CIFAR-10, we use VGG-11 (Simonyan & Zisserman, 2014) as the model. We adopt two different model partitions for VGG-11, with 2 modules and 3 modules respectively. See Appendix C for more details. In the following sections, we will compare our method FADE with three baselines. Full FedDynAT (Shah et al., 2021) represents the ideal performance of federated adversarial training when all the clients are able to perform AT on the entire model. While FedDynAT with 100% AT is not feasible under our limitation that only a small portion of clients can afford AT on the entire model, we adopt partial FedDynAT where clients with insufficient resources only perform standard training (ST). Another baseline FedRBN (Hong et al., 2021 ) also allows resource-constrained devices performing ST only, and the robustness will be propagated by transferring the batch-normalization statistics from the clients who can afford AT to the clients who only perform ST.

4.2. RESOURCE REQUIREMENTS

We measure the minimum resource requirements of FADE and all baselines on resource-constrained devices. We use the number of loaded parameters as the metric of memory, and we use FLOPs as the metric of computation. For partial FedDynAT and FedRBN, the minimum memory requirement is the number of parameters in the entire model since they always load the entire model for training, and the computing power requirement is the FLOPs of ST on the entire model since the resourceconstrained devices only perform ST. For FADE, the minimum memory requirement is the number of parameters in the largest module, while the computing power requirement is the mean of FLOPs for PGD-10 AT across all modules. The results are shown in Fig. 3 . We can see that FADE can reduce the memory requirement by more than 40% on both CNN-7 and VGG-11, while FADE with 2 modules and 3 modules can reduce the computation by 50% and 67% respectively. Although partial FedDynAT and FedRBN can largely reduce the amount of computation, they are far less efficient than they appear to be when training a large model that exceeds the memory limit, since they need to repeatedly fetch and load small parts of the entire model from the cloud or the external storage during each forward and backward propagation. And we will also see in the following experiments that neither of them can maintain the adversarial robustness, while FADE can still achieve the same level of robustness as full FedDynAT.

4.3. PERFORMANCE OF FADE

We first compare our method with three baselines under the limitation that only 20% clients can afford AT on the entire model, while the other 80% clients can only afford Standard Training (ST) on the entire model or AT on a module. The natural and adversarial accuracy in FMNIST and CIFAR-10 is shown in Table 2 and 3 respectively. While neither partial FedDynAT nor FedRBN can maintain the robustness under the resource constraint, FADE consistently outperforms other baselines and achieves almost the same or even higher accuracy and robustness comparing to full FedDynAT (the constraint-free case). In addition, we mix clients with one module (joint training), clients with two modules and clients with three modules in a ratio of 2:3:5 as the setting "FADE (Mixing)". We can see that FADE still attains high accuracy and robustness in this case, which verifies the compatibility of our flexible FDL framework. We also conducted experiments with different proportions of resource-sufficient clients who perform joint AT on the entire model, and the adversarial accuracy is shown in Fig. 4 . Even in the worst case that none of the clients have enough resources to complete AT on the entire model, FADE can achieve robustness comparable to full FedDynAT. And with only 40% resource-sufficient clients, FADE can already attain the same robustness as full FedDynAT in all our experiments, while the other baselines still have significant robustness gaps from full FedDynAT.

4.4. THE INFLUENCE OF WEIGHT DECAY ON THE AUXILIARY MODEL

As we suggested in Section 3.3, the auxiliary model weight decay hyperparameter λ m acts as an important role that balances natural accuracy and adversarial robustness. To show the influence of this hyperparameter, we conduct experiments in FADE (2 Modules) with different λ m between 0.0001 and 0.1, and we plot the natural and adversarial accuracy in Fig. 5 . We can observe that in all our settings the natural accuracy increases first as we increase λ m , and then goes down quickly. The growing part can be explained by our theory in Section 3.3 that the larger auxiliary weight decay can alleviate the objective inconsistency and improve the performance. However, when we adopt a too large weight decay, the weight decay will drive the model away from optimum and lead to a performance drop, which is also commonly observed in joint training process. For the adversarial accuracy, the effects of λ m become more complicated, since larger λ m can decrease both g m and µ m , which affect the robustness in opposite ways. An increasing adversarial accuracy suggests that the effect of g m dominates, while a decreasing one suggests that the effect of µ m dominates. Similarly to the natural accuracy, we could observe that the adversarial accuracy usually grows first before going down, which implies that the effect of g m is usually stronger when λ m is small. And considering the increasing natural accuracy, adopting a moderately large λ m usually attains a better overall performance on clean and adversarial samples.

5. RELATED WORKS

Federated Learning Client-wise heterogeneity is one of the challenges that hinders the practice of Federated Learning (FL). Many studies have tried to overcome the statistical heterogeneity in data (Karimireddy et al., 2019; Liang et al., 2019; Tang et al., 2022; Wang et al., 2020a) and the systematic heterogeneity in hardware (Li et al., 2021a; 2018; Wang et al., 2020b) . Beyond the heterogeneity, FL is also vulnerable in several kinds of attack, such as model poisoning attack (Bhagoji et al., 2019; Sun et al., 2021) and adversarial sample attack (Zizzo et al., 2020; Shah et al., 2021) . In this paper, we mainly focus on the adversarial sample attack and deal with the challenge in federated adversarial training under client-wise heterogeneity (Hong et al., 2021) . Adversarial Training AT is well known for its high demand for computing resources (Wong et al., 2020) . Several fast AT algorithms have been proposed to reduce the computation in AT (Shafahi et al., 2019; Zhang et al., 2019) , such as replacing PGD with FGSM (Andriushchenko & Flammarion, 2020; Wong et al., 2020) or using other regularization methods for robustness (Moosavi-Dezfooli et al., 2019; Qin et al., 2019) . FADE can be easily combined with these fast AT algorithms to further reduce the computing cost, which we leave as a future work. In addition, AT will decrease the model performance on clean samples, and thus a larger model is usually required to maintain the same natural accuracy (Liu et al., 2020) . This makes AT also memory-demanding. Decoupled Greedy Learning As deeper and deeper neural networks are used for better performance, the low efficiency of end-to-end (joint) training is exposed because it hinders the model parallelization and requires large memory for model parameters and intermediate results (Belilovsky et al., 2020; Hettinger et al., 2017) . As an alternative, Decoupled Greedy Learning (DGL) is proposed, which decouples the whole neural network into several modules and trains them separately without gradient dependency (Belilovsky et al., 2019; 2020; Marquez et al., 2018; Wang et al., 2021) . As a more flexible DGL framework, FADE fits better in heterogeneous FL while offering guarantees in convergence as well as joint adversarial robustness.

A CONVERGENCE ANALYSIS OF FEDERATED DECOUPLED LEARNING

A.1 PRELIMINARY In this section, we analyze the convergence property of Federated Decoupled Learning (FDL). Since FDL partitions the entire model with layers as the smallest unit, we only need to prove the convergence of each layer. We use ω n to denote all the parameters in layer n, and m k (n) to denote the module that contains layer n on client k. We define the parameters other than ω n in module m k (n) as Ω n,k . We also denote the input feature of layer n as z n-1,k = z m k (n) on client k. we define the locally supervised loss of layer n on client k as: l (t,j) n,k (z (t) n-1 , y; ω (t,j) n ) = l m k (n) (z (t) m k (n)-1 , y; ω (t,j) n , Ω (t,j) n,k ), where l (t,j) n,k changes every iteration because of the update of Ω (t,j) n,k . For simplicity, from now on we abridge (z, y) as z. We let z (t) n-1,k follow the distribution with probability density p (t) n-1,k (z) at the j-th iteration of communication round t, and we define its converged density as p * n-1,k (z) with converged previous layers and Ω * n,k (Belilovsky et al., 2020) . With these notations, we define L (t) n,k (ω (t) n ) = E z (t) n-1,k ∼p (t) n-1,k   1 τ τ -1 j=0 l (t,j) n,k (z (t) n-1,k ; ω (t) n )   ; L (t) n (ω (t) n ) = N k=1 q k L (t) n,k (ω (t) n ); (17) L n,k (ω (t) n ) = E z n-1,k ∼p * n-1,k l * n,k (z n-1,k ; ω (t) n ) = E z m k (n)-1,k ∼p * m k (n)-1,k l m k (n) (z m k (n)-1,k ; ω (t) n , Ω * n,k ) ; (18) L n (ω (t) n ) = N k=1 q k L n,k (ω (t) n ); Following Belilovsky et al. (2020) , we use the distance between the current density and the converged density below for our analysis: ρ (t) n-1 ≜ N k=1 q k p (t) n-1,k (z) -p * n-1,k (z) dz, And we also define the following gap between l (t) n,k and l * n,k : ξ (t) n ≜ N k=1 τ -1 j=0 q k τ E z n-1,k ∼p * n-1 ∇l (t,j) n,k (z n-1,k ; ω (t) n ) -∇l * n,k (z n-1,k ; ω (t) n ) 2 (21) We will discuss the convergence of L n (ω n ) for each layer n. Without specifying, all the gradients (∇L or ∇l) in the following analysis are with respect to ω n . Following Belilovsky et al. (2020) and Wang et al. (2020b) , we make the common assumptions below. Assumption 1 (L-smoothness (Belilovsky et al., 2020; Wang et al., 2020b) ). L n is differentiable with respect to ω (t) n and its gradient is L n -Lipschitz for all t. Similarly, L n,k is differentiable with respect to ω (t) n,k and its gradient is L n -Lipschitz for all t and k. Assumption 2 (Robbins-Monro conditions (Belilovsky et al., 2020) ). The learning rates satisfy (Belilovsky et al., 2020; Wang et al., 2020b) ). There exists some positive constant G such that ∀t, j and ∀k, E z (t,j) ∞ t=0 η t = ∞ yet ∞ t=0 η 2 t < ∞. Assumption 3 (Finite variance n-1,k ∼p (t,j) n-1,k ∇l (t,j) n,k (z (t,j) n-1,k ; ω n ) 2 ≤ G and E z n-1,k ∼p * n-1 ∇l * n,k (z n-1,k ; ω n ) 2 ≤ G at any ω n . Assumption 4 (Bounded Dissimilarity (Wang et al., 2020b) ). There exist constants β 2 ≥ 1, κ 2 ≥ 0 such that ∀t and ω n , N k=1 q k ∇L (t) n,k (ω n ) 2 ≤ β 2 N k=1 q k ∇L (t) n,k (ω n ) 2 + κ 2 . Assumption 5 (Convergence of the previous modules and Ω n (Belilovsky et al., 2020) ). We assume that ∞ t=0 ρ (t) n-1 < ∞ and ∞ t=0 ξ (t) n < ∞.

A.2 PROOF OF THEOREM 1

With all above assumptions, we get the following theorem that guarantees the convergence of Federated Decoupled Learning. Theorem 1. Under Assumption 1-5, Federated Decoupled Learning converges as follows: inf t≤T E ∇L n ω (t) n 2 ≤O 1 T t=0 η t + O T t=0 ρ (t) n η t T t=0 η t + O T t=0 ξ (t) n η t T t=0 η t + O T t=0 η 2 t T t=0 η t . ( ) Proof. We consider the SGD scheme in Eq. 9 with learning rate {η t } t : ω (t+1) n = ω (t) n -η t k∈S (t) n q k h (t) n,k k∈S (t) n q k , where S (t) n = S (t) ωn which is defined in Eq. 9. And h (t) n,k is defined as h (t) n,k = τ -1 j=0 ∇l According to the Lipschitz-smooth assumption for the global objective function L n , it follows that E L n ω (t+1) n -L n ω (t) n ≤ -η t E   ∇L n ω (t) n , k∈S (t) n q k h (t) n,k k∈S (t) n q k   T1 + η 2 t L n 2 E    k∈S (t) n q k h (t) n,k k∈S (t) n q k 2    T2 . Similar to the proof in (Wang et al., 2020b) , to bound the T 1 in Inequality 24, we should notice that T 1 =E   ∇L n ω (t) n , k∈S (t) n q k h (t) n,k k∈S (t) n q k   = ∇L n ω (t) n , N k=1 q k Eh (t) n,k = 1 2 ∇L n ω (t) n 2 + 1 2 N k=1 q k Eh (t) n,k 2 - 1 2 ∇L n ω (t) n - N k=1 q k Eh (t) n,k 2 (25) ≥ 1 2 ∇L n ω (t) n 2 - 1 2 ∇L n ω (t) n - N k=1 q k Eh (t) n,k 2 ≥ 1 2 ∇L n ω (t) n 2 -∇L (t) n ω (t) n - N k=1 q k Eh (t) n,k 2 -∇L n ω (t) n -∇L (t) n (ω (t) n ) 2 (26) ≥ 1 2 ∇L n ω (t) n 2 - N k=1 q k Eh (t) n,k -∇L (t) n,k (ω (t) n ) 2 -∇L n ω (t) n -∇L (t) n (ω (t) n ) 2 . Eq. 25 uses the fact: 2⟨a, b⟩ = ∥a∥ 2 + ∥b∥ 2 -∥a -b∥ 2 , and Inequality 26 uses the fact: ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 . Inequality 27 uses L (t) n = N k=1 q k L (t) n,k and Jenson's inequality ∥ m i=1 b i a i ∥ 2 ≤ m i=1 b i ∥a i ∥ 2 . Based on the proof of Lemma 3.2 in Belilovsky et al. (2020) , we have ∇L n ω (t) n -∇L (t) n ω (t) n 2 = k q k E z∼p * n-1,k ∇l * n,k (z; ω (t) n ) - k q k E z∼p (t) n-1,k   1 τ j ∇l (t,j) n,k (z; ω (t) n )   2 = 1 τ k q k j E z∼p * n-1,k ∇l * n,k (z; ω (t) n ) -E z∼p (t) n-1,k ∇l (t,j) n,k (z; ω (t) n ) 2 ≤ 1 τ k q k j E z∼p * n-1,k ∇l * n,k (z; ω (t) n ) -E z∼p (t) n-1,k ∇l (t,j) n,k (z; ω (t) n ) 2 ≤ 2 τ k q k j ∇l (t,j) n,k (z; ω (t) n )p (t) n-1,k (z)dz -∇l (t,j) n,k (z; ω (t) n )p * n-1,k (z)dz 2 + 2 τ k q k j E z∼p * n-1,k ∇l (t,j) n,k (z; ω (t) n ) -∇l * n,k (z; ω (t) n ) 2 ≤ 2 τ k q k j ∇l (t,j) n,k (z; ω (t) n ) |p (t) n-1,k (z) -p * n-1,k (z)||p (t) n-1,k (z) -p * n-1,k (z)|dz 2 + 2ξ (t) n ≤ 2 τ k q k j ∇l (t,j) n,k (z; ω (t) n ) 2 |p (t) n-1,k (z) -p * n-1 (z)|dz |p (t) n-1,k (z) -p * n-1,k (z)|dz + 2ξ (t) n ≤ 2 τ k q k |p (t) n-1,k (z) -p * n-1,k (z)|dz j ∇l (t,j) n,k (z; ω (t) n ) 2 p (t) n-1,k (z) + p * n-1,k (z) dz + 2ξ (t) n ≤4Gρ (t) n + 2ξ (t) n (28) Hence, we have T 1 ≥ 1 2 ∇L n ω (t) n 2 - N k=1 q k Eh (t) n,k -∇L (t) n,k (ω (t) n ) 2 -4Gρ (t) n -2ξ (t) n . ( ) Similar to the proof in Section C.3, we have the following bound for T 2 : T 2 ≤2   E k∈S (t) n q k h (t) n,k k∈S (t) n q k - k∈S (t) n q k Eh (t) n,k k∈S (t) n q k 2    + 2   E k∈S (t) n q k Eh (t) n,k k∈S (t) n q k 2    ≤4   E k∈S (t) n q k h (t) n,k k∈S (t) n q k 2 + E k∈S (t) n q k Eh (t) n,k k∈S (t) n q k 2    + 2   E k∈S (t) n q k Eh (t) n,k k∈S (t) n q k 2    ≤4   E k∈S (t) n q k ∥h (t) n,k ∥ 2 k∈S (t) n q k + E k∈S (t) n q k E∥h (t) n,k ∥ 2 k∈S (t) n q k   + 2   E k∈S (t) n q k Eh (t) n,k k∈S (t) n q k 2    ≤8τ 2 G + 2   E k∈S (t) n q k Eh (t) n,k k∈S (t) n q k 2    (30) ≤8τ 2 G + 6 N k=1 q k Eh (t) n,k -∇L (t) n,k (ω (t) n ) 2 + 6∥∇L (t) n (ω (t) n )∥ 2 + 6 S t β 2 ∥∇L (t) n (ω (t) n )∥ 2 + κ 2 , ( ) where S t = S (t) n . Inequality 30 is based on the Assumption 3 and the definition of h (t) n,k , while Inequality 31 is from Lemma 5 in Wang et al. (2020b) . According to Assumption 3, for all t, j, k, ω n , we have E z∼p (t) n-1,k ∇l (t,j) n,k (z; ω n ) -E z∼p (t) n-1,k ∇l (t,j) n,k (z; ω n ) 2 ≤E z∼p (t) n-1,k 2 ∇l (t,j) n,k (z; ω n ) 2 + 2∥E z∼p t) n-1,k ∇l (t,j) n,k (z; ω n ) ∥ 2 ≤4E z∼p (t) n-1,k ∇l (t,j) n,k (z; ω n ) 2 ≤4G. With the results in Inequality 32 and in Section C.5 of Wang et al. (2020b) , we have the following bound 1 2 N k=1 q k Eh (t) n,k -∇L (t) n,k (ω (t) n ) 2 ≤ 4η 2 t L 2 n G 1 -D (τ 2 -1) + Dβ 2 2(1 -D) ∇L (t) n (ω (t) n ) 2 + Dκ 2 2(1 -D) , where D = 4η 2 t L 2 n τ (τ -1) < 1. If D ≤ 1 12β 2 +1 , then it follows that 1 1-D ≤ 1 + 1 12β 2 ≤ 2 and 3Dβ 2 1-D ≤ 1 4 . In this case, we can further simplify the inequality: 6 N k=1 q k Eh (t) n,k -∇L (t) n,k (ω (t) n ) 2 ≤96η 2 t L 2 n G(τ 2 -1) + 1 2 ∇L (t) n (ω (t) n ) 2 + 48η 2 t L 2 n κ 2 τ (τ -1) ≤96η 2 t L 2 n G(τ 2 -1) + ∇L n (ω (t) n ) 2 + 4Gρ (t) n + 2ξ (t) n + 48η 2 t L 2 n κ 2 τ (τ -1). ( ) Then we can bound T 2 as follows: T 2 ≤8τ 2 G + 6 N k=1 q k Eh (t) n,k -∇L (t) n,k (ω (t) n ) 2 + 6∥∇L (t) n (ω (t) n )∥ 2 + 6 S t β 2 ∥∇L (t) n (ω (t) n )∥ 2 + κ 2 ≤8τ 2 G + 96η 2 t L 2 n G(τ 2 -1) + ∇L n (ω (t) n ) 2 + 4Gρ (t) n + 2ξ (t) n + 48η 2 t L 2 n κ 2 τ (τ -1) + 6∥∇L n (ω (t) n )∥ 2 + 48Gρ (t) n + 24ξ (t) n + 6 S t β 2 ∥∇L n (ω (t) n )∥ 2 + 8β 2 Gρ (t) n + 4β 2 ξ (t) n + κ 2 , ( ) where Inequality 35 uses the difference bound in Inequality 28. Plugging Inequality 29 and Inequality 35 back into Inequality 24, and with S t ≥ 1, we have E L n ω (t+1) n -L n ω (t) n ≤ - 1 2 η t ∇L n ω (t) n 2 + 4η t Gρ (t) n + 2η t ξ (t) n + η t 16η 2 t L 2 n G(τ 2 -1) + 1 6 ∇L n (ω (t) n ) 2 + 2 3 Gρ (t) n + 1 3 ξ (t) n + 8η 2 t L 2 n κ 2 τ (τ -1) + η 2 t L n 2 8τ 2 G 2 + 96η 2 t L 2 n G(τ 2 -1) + ∇L n (ω (t) n ) 2 + 4Gρ (t) n + 2ξ (t) n + 48η 2 t L 2 n κ 2 τ (τ -1) + 6∥∇L n (ω (t) n )∥ 2 + 48Gρ (t) n + 24ξ (t) n + 6(β 2 ∥∇L n (ω (t) n )∥ 2 + 8β 2 Gρ (t) n + 4β 2 ξ (t) n + κ 2 ) (36) ≤ - 5 12 η t - 7η 2 t L n 2 -3η 2 t L n β 2 ∇L n ω (t) n 2 + η t 14 3 Gρ (t) n + 7 3 ξ (t) n + 16η 2 t L 2 n G(τ 2 -1) + 8η 2 t L 2 n κ 2 τ (τ -1) + η 2 t L n 2 8τ 2 G 2 + 52Gρ (t) n + 26ξ (t) n + 96η 2 t L 2 n G(τ 2 -1) + 48η 2 t L 2 n κ 2 τ (τ -1) + 48β 2 Gρ (t) n + 24β 2 ξ (t) n + 6κ 2 . When we set η t ≤ min{ 1 (21+18β 2 )Ln , 1}, we can get 1 4 η t ∥∇L n ω (t) n ∥ 2 ≤L n ω (t) n -E L n ω (t+1) n + 14 3 G + 26L n G + 24β 2 L n G η t ρ (t) n + 7 3 + 13L n + 12β 2 L n η t ξ (t) n + 16 L 2 n G(τ 2 -1) + 8 L 2 n κ 2 τ (τ -1) + 4L n τ 2 G 2 + 48 L 2 n L n G(τ 2 -1) + 24 L 2 n L n κ 2 τ (τ -1) + 3L n κ η 2 t . Taking the expectation and averaging across all rounds, one can obtain 1 4 T t=0 η t ∇L n ω (t) n 2 ≤ L n ω (0) n -L n ω (T +1) n + A T t=0 η 2 t + B T t=0 η t ρ (t) n + C T t=0 η t ξ (t) n ≤L n ω (0) n + A T t=0 η 2 t + B T t=0 η t ρ (t) n + C T t=0 η t ξ (t) n , where A, B and C are some positive constants. Now we get our final result: inf t≤T E ∇L n ω (t) n 2 ≤ 1 T t=0 η t T t=0 η t ∇L n ω (t) n 2 ≤O 1 T t=0 η t + O T t=0 ρ (t) n η t T t=0 η t + O T t=0 ξ (t) n η t T t=0 η t + O T t=0 η 2 t T t=0 η t . It is simple to verify that 1 T t=0 ηt → 0 and T t=0 η 2 t T t=0 ηt → 0 if T → ∞. As for T t=0 ρ (t) n ηt T t=0 ηt , according to the Cauchy-Schwartz inequality, we have T t=0 ρ (t) n η t = T t=0 ρ (t) n ρ (t) n η t ≤ T t=0 ρ (t) n T t=0 ρ (t) n η 2 t ≤ T t=0 ρ (t) n T t=0 ρ (t) n T t=0 η 2 t < ∞. Hence, we also have T t=0 ρ (t) n ηt T t=0 ηt → 0 if T → ∞. Similarly, we get the same result for T t=0 ξ (t) n ηt T t=0 ηt . In conclusion, we get the result in Section 3.1: , where g m = ∥∇ zm lm (z m , y)∥, lim T →∞ inf t≤T E ∇L n ω (t) n 2 = 0. ( then we can guarantee that the entire model has a joint (ϵ 0 , c M )-robustness in l(x, y). Proof. We only need to prove the joint robustness of the concatenation of module m and (m + 1) given the local robustness of them separately, and then we can use deduction to get the joint robustness of the entire model given the local robustness of all modules. For a module m and any perturbation δ m-1 ∈ {δ m-1 : ∥δ m-1 ∥ ≤ ϵ m-1 } at its input, let r = f m (z m-1 + δ m-1 ) -f m (z m-1 ). Given µ m -strongly convexity and (ϵ m-1 , c m )-robustness in lm (z m , y), we have ∇ zm lm (z m , y) T r + µ m 2 ∥r∥ 2 ≤ lm (z m + r, y) -lm (z m , y) ≤ c m (40) ⇒ µ m 2   r + ∇ zm lm (z m , y) µ m 2 - ∥∇ zm lm (z m , y)∥ 2 µ 2 m   ≤ c m (41) ⇒ r + ∇ zm lm (z m , y) µ m ≤ 2c m µ m + ∥∇ zm lm (z m , y)∥ 2 µ 2 m (42) ⇒∥r∥ ≤ ∥∇ zm lm (z m , y)∥ µ m + 2c m µ m + ∥∇ zm lm (z m , y)∥ 2 µ 2 m = g m µ m + 2c m µ m + g 2 m µ 2 m . And we know that  ϵ m ≥ g m µ m + 2c m µ m + g 2 m µ 2 m ≥ ∥r∥, (z m , y) = L(σ(W T m z m + b m ), y), where L(p, y) = -i=1 y i log(p i ) and σ(q) i = exp(q i )/( j exp(q j )) are cross-entropy loss and softmax function respectively. Let p m = σ(W T m z m + b m ), we know that ∇ zm l(z m , y) = W m (p m -y), and H m = ∇ 2 zm l(z m , y) = W m J m W T m , where J m = diag(p m ) -p m p T m ( ) is the Jacobi of the softmax function. We have the following properties related to the robustness and objective consistency in Theorem 2 and Theorem 3: 1. (First Order Property) Smaller ∥W m ∥ leads to smaller g m and c m . g m = ∥∇ zm lm (z m , y)∥ = ∥W m (p m -y)∥ ≤ √ 2∥W m ∥, c m = max ∥δm∥≤r | lm (z m + δ m , y) -lm (z m , y)| ≤ √ 2r∥W m ∥. 2. (Second Order Property) Smaller ∥W m ∥ F leads to smaller µ m and β m . i λ i (H m ) = tr(H m ) = tr(W m J m W T m ) = tr(W T m W m J m ) ≤ ∥W m ∥ 2 F ( j p m,j -p 2 m,j ), where λ i (H m ) means the eigenvalues of H m in increasing order. λ 1 (H m ) = µ m and λ -1 (H m ) = β m . We notice that when increasing λ m , namely, decreasing ∥W m ∥ and ∥W m ∥ F , we will decrease g m , c m , µ m and β m . According to Theorem 2, smaller g m will lead to stronger robustness while smaller µ m will lead to weaker robustness. And according to Theorem 3, smaller c m and β m can lead to smaller objective inconsistency and thus better natural accuracy.

C EXPERIMENT SETTINGS AND DETAILS

We run all the experiments on a sever with a single NVIDIA TITAN RTX GPU and an Intel Xeon Gold 6254 CPU.

C.1 DETAILS OF BASELINES

FedDynAT (Shah et al., 2021) . FedDynAT proposes to use an annealing number of local training iterations to alleviate the slow convergence issue of Federated Adversarial Training (FAT) (Zizzo et al., 2020) . More specifically, they anneal the number of local training iterations as τ t = τ 0 γ t/F E E where τ t is the number of local training iterations at round t, γ E is the decay rate and F E is the decay period. When implementing FedDynAT, we use FedNOVA instead of FedCurv (Shoham et al., 2019) to avoid extra communication in our resource-constrained settings. FedRBN (Hong et al., 2021) . FedRBN adopts Dual Batch Normalization (DBN) layers (Xie et al., 2020) with two sets of batch normalization (BN) statistics for clean samples and adversarial samples respectively. When propagating the robustness from the clients who perform AT to the clients . λ RBN is the hyperparameter and ϵ is a small constant. With these evaluations of the adversarial BN statistics, the ST clients can also attain some adversarial robustness without performing AT. 



For simplicity, without specifying p, we use ∥ • ∥ for ℓ2 norm. Our conclusions in the following sections can be extended to any ℓp norm with the equivalence of vector norms. CONCLUSIONSIn this paper, we proposed Federated Adversarial Decoupled Learning (FADE), a novel framework to reduce the memory and computing power requirements for resource-constrained edge devices in large-scale federated adversarial training. Our theory guarantees the convergence and joint adversarial robustness of FADE, and we develop an effective regularizer to reduce the objective inconsistency in FADE based on the theory. Our experimental results show that FADE can significantly reduce both memory and computing power consumption on small edge devices, while maintaining almost the same accuracy as the joint federated adversarial training on both clean and adversarial samples.



Figure 1: An illustration of the module m.

Different from joint training, the input of one module can be various in different epochs in DGL since we may keep updating the previous modules during training. Thus we use z (t) m-1 to denote the inputs of module m in epoch t, and only the input of the first module z (t) 0 = x is invariant. For each module m ∈ {1, 2, • • • , M } in the entire model, we define the loss function of its auxiliary model as lm (z m , y; θ m ) = l m (z m-1 , y; Θ m ), and the loss function of its following layers in the backbone network as l′ m (z m , y;

Figure2: A framework of Federated Decoupled Learning. In contrast to the original unique partition(Belilovsky et al., 2020), we allow different model partitions among devices according to their resource budgets. In each communication round, each device randomly selects one module (highlighted) for training, and then the updates of each layer will be averaged respectively.

Assume that lm (z m , y) and l′ m (z m , y) are β m , β ′ m -smooth in z m for a module m. If there exist c m , c ′ m , and r ≥ 2 the auxiliary model has (r, c m )-robustness in lm (z m , y), and the backbone network has (r, c ′ m )-robustness in l′ m (z m , y), then we have:

Figure 3: Minimum memory and computational requirements of baselines and FADE. The results are shown as the percentage of the resource requirement of full FedDynAT with PGD-10 AT.

Figure 4: Adversarial accuracy when training with different portions of resource-sufficient clients.

Figure 5: Natural (blue lines with triangle markers) and adversarial (red lines with circle markers) accuracy with different auxiliary weight decay hyperparameter λ m .

Assume that lm (z m , y) is µ m -strongly convex in z m for each module m. If each module m ≤ M has local (ϵ m-1 , c m )-robustness in l m (z m-1 , y), and ∀m ≤ M, ϵ m ≥

CASE STUDY: LINEAR AUXILIARY OUTPUT MODEL For a linear auxiliary output model θ m = {W m , b m }, the cross-entropy loss is given as l

The model architecture and the 3-module partition of VGG-11. kernel size = 3, padding = 1, stride = 1) BN2D, ReLU, MaxPool2D (kernel size = 2, stride = 2

Results of partial federated adversarial training with 100 clients. "20% AT + 80% ST" means that 20% clients perform AT while 80% clients perform standard training (ST).

and the auxiliary model θ

The natural accuracy (clean samples) and adversarial accuracy (adversarial samples) on FMNIST. Results are reported in the mean and the standard deviation over 3 random seeds.

The natural accuracy (clean samples) and adversarial accuracy (adversarial samples) on CIFAR-10. Results are reported in the mean and the standard deviation over 3 random seeds.

The hyperparameters of FADE. The last module does not have an auxiliary model since it directly uses the loss of the backbone network, thus it does not have a weight decay hyperparameter. FedNOVA 8 /255 2 /255 0.01 4 /255 1 /255 0.01 3 /255 0.75 /255 n/a FedBN 8 /255 2 /255 0.001 4 /255 1 /255 0.001 3 /255 0.75 /255 n/a who perform ST, they use the adversarial BN statistics of AT clients to evaluate the adversarial BN statistics of ST clients as follows: ST and µ n ST are the means in adversarial BN and natural BN respectively on a ST client, and (σ a ST ) 2 and (σ n ST ) 2 are the variances. Similarly, for AT clients we have µ a AT , µ n AT , (σ a AT ) 2 and (σ n AT ) 2

annex

 2021), we adopt the same setting where λ RBN = 0.1. We loose the requirement of a noise detector and allow an optimal noise detector for FedRBN such that it can always use the correct BN statistics during test (This makes its robustness stronger than that with a real noise detector).

C.3 MODEL ARCHITECTURES AND MODEL PARTITIONS

The model architectures and model partitions we use in our experiments are shown in Table 5 , 6 and 7. We skip all the Batch Normalization layers in the model when training with FedNOVA. We also show the number of parameters in the tables. We could see that in most case the auxiliary models are small enough and they only introduce negligible extra parameters and computation. Therefore it will not increase the memory and computing power requirements on resource-constraint devices.

