FADE: ENABLING LARGE-SCALE FEDERATED AD-VERSARIAL TRAINING ON RESOURCE-CONSTRAINED EDGE DEVICES

Abstract

Federated adversarial training can effectively complement adversarial robustness into the privacy-preserving federated learning systems. However, the high demand for memory capacity and computing power makes large-scale federated adversarial training infeasible on resource-constrained edge devices. Few previous studies in federated adversarial training have tried to tackle both memory and computational constraints at the same time. In this paper, we propose a new framework named Federated Adversarial Decoupled Learning (FADE) to enable AT on resource-constrained edge devices. FADE decouples the entire model into small modules to fit into the resource budget of each edge device respectively, and each device only needs to perform AT on a single module in each communication round. We also propose an auxiliary weight decay to alleviate objective inconsistency and achieve better accuracy-robustness balance in FADE. FADE offers a theoretical guarantee for convergence and adversarial robustness, and our experimental results show that FADE can significantly reduce the consumption of memory and computing power while maintaining accuracy and robustness.

1. INTRODUCTION

As a privacy-preserving distributed learning paradigm, Federated Learning (FL) makes a meaningful step toward the practice of secure and trustworthy artificial intelligence (Konečnỳ et al., 2015; 2016; McMahan et al., 2017; Kairouz et al., 2019) . In contrast to traditional centralized training, FL pushes the training to edge devices (clients), and client models are locally trained and uploaded to the server for aggregation. Since no private data is shared with other clients or the server, FL substantially improves the data privacy during the training process. While FL can preserve the privacy of the participants, other threats can still impact the reliability of the machine learning model running on the FL system. One of such threats is adversarial samples, which aim to cause misclassifications of the model by adding imperceptible noise into the input data (Szegedy et al., 2013; Goodfellow et al., 2014) . Previous research has shown that performing adversarial training (AT) on a large model is an effective method to attain robustness against adversarial samples while maintaining high accuracy on clean samples (Liu et al., 2020) . However, large-scale AT also puts high demand for both memory capacity and computing power, which is affordable for some edge devices with limited resources, such as mobile phones and IoT devices, in FL scenarios (Kairouz et al., 2019; Li et al., 2020; Wong et al., 2020; Zizzo et al., 2020; Hong et al., 2021) . Table 1 shows that strong robustness of the whole FL system cannot be attained by allowing only a small portion (e.g., 20%) of the clients to perform AT. Therefore, enabling resourceconstrained edge devices (which usually contribute to the majority of the participants in cross-device FL (Kairouz et al., 2019) ) to perform AT is necessary for achieving strong robustness in FL. Some previous works have tried to tackle client-wise systematic heterogeneity in FL (Li et al., 2018; Lu et al., 2020; Wang et al., 2020b; Xie et al., 2019) . The most common method to deal with the slow devices is to allow them performing less epochs of local training than the others (Li et al., 2018; Wang et al., 2020b) . While this method can reduce the computational costs on the slow devices, the memory capacity limitation on edge devices has not been well discussed in these works. In this paper, we propose Federated Adversarial DEcoupled Learning (FADE), which is the first adversarial decoupled learning scheme for heterogeneous FL. Our main contributions are: 1. We propose a more flexible decoupled learning scheme for heterogeneous Federated Learning, which allows different model partitions on devices with different resource budgets. We give a theoretical guarantee for the convergence of our Federated Decoupled Learning. 2. We propose Federated Adversarial DEcoupled Learning (FADE) to attain theoretically guaranteed joint adversarial robustness of the entire model. Our experimental results show that FADE can significantly reduce the memory and computational requirements while maintaining the natural accuracy and adversarial robustness as joint training. 3. We analyze the trade-off between objective consistency (natural accuracy) and adversarial robustness (adversarial accuracy) in FADE, and we propose an effective method to achieve a better accuracy-robustness balance point with the weight decay on auxiliary models.

2. PRELIMINARY

Federated Learning (FL) In FL, different clients collaboratively train a shared global model w with locally stored data (McMahan et al., 2017) . The objective of FL can be formulated as: min w L(w) = 1 i |D i | N k=1 (x,y)∈D k l(x, y; w) = N k=1 q k L k (w), where L k (w) = 1 |D k | (x,y)∈D k l(x, y; w) = E (x,y)∼D k [l(x, y; w)] , and l is the task loss, e.g., cross-entropy loss for classification task. D k is the dataset of client k and its weight q k = |D k |/( i |D i |). To solve for the optimal solution of this objective, in each communication round, FL first samples a subset of clients S (t) to perform local training. These clients initialize their models with the global model w (t,0) k = w (t) , and then run τ iterations of local SGD. After all these clients complete training in this round, their models are uploaded and averaged to become the new global model (McMahan et al., 2017) . We summarize this procedure as follows: w (t+1) k = w (t) -η t τ -1 j=0 ∇L k (w (t,j) k ), w (t+1) = 1 i∈S (t) q i k∈S (t) q k w (t+1) k , where w (t,j) k is the local model of client k at the j-th iteration of round t.



Results of partial federated adversarial training with 100 clients. "20% AT + 80% ST" means that 20% clients perform AT while 80% clients perform standard training (ST).To tackle both memory capacity and computational constraints, recent studies propose a novel training scheme named Decoupled Greedy Learning (DGL) which decouples the entire neural network into several small modules and trains each module separately(Belilovsky et al., 2019; Wang et al.,  2021). DGL can be naturally deployed in FL since the training of decoupled modules can be parallelized on different computing nodes(Belilovsky et al., 2020). However, vanilla DGL only supports a unique model partition on all computing nodes, which cannot fit into different resource budgets of different clients in heterogeneous FL. Additionally, no previous studies have explored whether DGL can be combined with AT to confer joint adversarial robustness to the entire model. It is not trivial to achieve joint robustness of the entire model when applying AT in DGL, since modules are trained separately in DGL with different locally supervised losses.

