SSELF: ROBUST FEDERATED LEARNING AGAINST STRAGGLERS AND ADVERSARIES

Abstract

While federated learning allows efficient model training with local data at edge devices, two major issues that need to be resolved are: slow devices known as stragglers and malicious attacks launched by adversaries. While the presence of both stragglers and adversaries raises serious concerns for the deployment of practical federated learning systems, no known schemes or known combinations of schemes, to our best knowledge, effectively address these two issues at the same time. In this work, we propose Sself, a semi-synchronous entropy and loss based filtering/averaging, to tackle both stragglers and adversaries simultaneously. The stragglers are handled by exploiting different staleness (arrival delay) information when combining locally updated models during periodic global aggregation. Various adversarial attacks are tackled by utilizing a small amount of public data collected at the server in each aggregation step, to first filter out the model-poisoned devices using computed entropies, and then perform weighted averaging based on the estimated losses to combat data poisoning and backdoor attacks. A theoretical convergence bound is established to provide insights on the convergence of Sself. Extensive experimental results show that Sself outperforms various combinations of existing methods aiming to handle stragglers/adversaries.

1. INTRODUCTION

Large volumes of data collected at various edge devices (i.e., smart phones) are valuable resources in training machine learning models with a good accuracy. Federated learning (McMahan et al., 2017; Li et al., 2019a; b; Konečnỳ et al., 2016) is a promising direction for large-scale learning, which enables training of a shared global model with less privacy concerns. However, current federated learning systems suffer from two major issues. First is the devices called stragglers that are considerably slower than the average, and the second is the adversaries that enforce various adversarial attacks. Regarding the first issue, waiting for all the stragglers at each global round can significantly slow down the overall training process in a synchronous setup. To address this, an asynchronous federated learning scheme was proposed in (Xie et al., 2019a) where the global model is updated every time the server receives a local model from each device, in the order of arrivals; the global model is updated asynchronously based on the device's staleness t -τ , the difference between the current round t and the previous round τ at which the device received the global model from the server. However, among the received results at each global round, a significant portion of the results with large staleness does not help the global model in a meaningful way, potentially making the scheme ineffective. Moreover, since the model update is performed one-by-one asynchronously, the scheme in (Xie et al., 2019a) would be vulnerable to various adversarial attacks; any attempt to combine this type of asynchronous scheme with existing adversary-resilient ideas would not likely be fruitful. There are different forms of adversarial attacks that significantly degrade the performance of current federated learning systems. First, in untargeted attacks, an attacker can poison the updated model at the devices before it is sent to the server (model update poisoning) (Blanchard et al., 2017; Lamport et al., 2019) or can poison the datasets of each device (data poisoning) (Biggio et al., 2012; Liu et al., 2017) , which degrades the accuracy of the model. In targeted attacks (or backdoor attacks) (Chen et al., 2017a; Bagdasaryan et al., 2018; Sun et al., 2019) , the adversaries cause the model to misclassify the targeted subtasks only, while not degrading the overall test accuracy. To resolve these issues, a robust federated averaging (RFA) scheme was recently proposed in (Pillutla et al., 2019) which utilizes the geometric median of the received results for aggregation. However, RFA tends to lose performance rapidly as the portion of adversaries exceeds a certain threshold. In this sense, RFA is not an ideal candidate to be combined with known straggler-mitigating strategies (e.g., ignoring stragglers) where a relatively small number of devices are utilized for global aggregation; the attack ratio can be very high, significantly degrading the performance. To our knowledge, there are currently no existing methods or known combinations of ideas that can effectively handle both stragglers and adversaries at the same time, an issue that is becoming increasingly important in practical scenarios. Contributions. In this paper, we propose Sself, semi-synchronous entropy and loss based filtering/averaging, a robust federated learning strategy which can tackle both stragglers and adversaries simultaneously. In the proposed idea, the straggler effects are mitigated by semi-synchronous global aggregation at the server, and in each aggregation step, the impact of adversaries are countered by a new aggregation method utilizing public data collected at the server. The details of our key ideas are as follows. Targeting the straggler issue, our strategy is to perform periodic global aggregation while allowing the results sent from stragglers to be aggregated in later rounds. The key strategy is a judicious mix of both synchronous and asynchronous approaches. At each round, as a first step, we aggregate the results that come from the same initial models (i.e., same staleness), as in the synchronous scheme. Then, we take the weighted sum of these aggregated results with different staleness, i.e., coming from different initial models, as in the asynchronous approach. Regarding the adversarial attacks, robust aggregation is realized via entropy-based filtering and loss-weighted averaging. This can be employed at the first step of our semi-synchronous strategy described above, enabling protection against model/data poisoning and backdoor attacks. To this end, our key idea is to utilize public IID (independent, identically distributed) data collected at the server. We can imagine a practical scenario where the server has some global data uniformly distributed over classes, as in the setup of (Zhao et al., 2018) . This is generally a reasonable setup since data centers mostly have some collected data (although they can be only a few) of the learning task. For example, different types of medical data are often open to public in various countries. Based on the public data, the server computes entropy and loss of each received model. We use the entropy of each model to filter out the devices whose models are poisoned. In addition, by taking the loss-weighted averaging of the survived models, we can protect the system against local data poisoning and backdoor attacks. We derive a theoretical bound for Sself to ensure acceptable convergence behavior. Experimental results on different datasets show that Sself outperforms various combinations of straggler/adversary defense methods with only a small portion of public data at the server. Related works. The authors of (Li et al., 2019c; Wu et al., 2019; Xie et al., 2019a) have recently tackled the straggler issue in a federated learning setup. The basic idea is to allow the devices and the server to update the models asynchronously. Especially in (Xie et al., 2019a) , the authors proposed an asynchronous scheme where the global model is updated every time the server receives a local model of each device. However, a fair portion of the received models with large staleness does not help the global model in meaningful ways, potentially slowing down the convergence speed. A more critical issue here is that robust methods designed to handle adversarial attacks, such as RFA (Pillutla et al., 2019) , Multi-Krum (Blanchard et al., 2017) or the presently proposed entropy/loss based idea, are hard to be implemented in conjunction with this asynchronous scheme. To combat adversaries, various aggregation methods have been proposed in a distributed learning setup with IID data across nodes (Yin et al., 2018a; b; Chen et al., 2017b; Blanchard et al., 2017; Xie et al., 2018) . The authors of (Chen et al., 2017b) suggests a geometric median based aggregation rule of the received models or the gradients. In (Yin et al., 2018a) , a trimmed mean approach is proposed which removes a fraction of largest and smallest values of each element among the received results. In Multi-Krum (Blanchard et al., 2017) , among N workers in the system, the server tolerates f Byzantine workers under the assumption of 2f + 2 < N . Targeting federated learning with non-IID data, the recently introduced RFA method of (Pillutla et al., 2019) utilizes the geometric median of models sent from devices, similar to (Chen et al., 2017b) . However, as mentioned above, these methods are ineffective when combined with a straggler-mitigation scheme, potentially degrading the performance of learning. Compared to Multi-Krum and RFA, our entropy/loss based scheme can tolerate adversaries even with a high attack ratio, showing remarkable advantages, especially when combined with straggler-mitigation schemes. Finally, we note that the authors of (Xie et al., 2019c) considered both stragglers and adversaries but in a distributed learning setup with IID data across the nodes. Compared to these works, we target non-IID data distribution setup in a federated learning scenario.

2. PROPOSED FEDERATED LEARNING WITH SSELF

We consider the following federated optimization problem: w * = argmin w F (w) = argmin w N k=1 m k m F k (w), where N is the number of devices, m k is the number of data samples in device k, and m = N k=1 m k is the total number of data samples of all N devices in the system. By letting x k,j be the jth data sample in device k, the local loss function of device k, F k (w), is written as F k (w) = 1 m k m k j=1 (w; x k,j ). In the following, we provide solutions aiming to solve the above problem under the existence of stragglers (subsection 2.1) and adversaries (subsection 2.2), and finally propose Sself handling both issues (subsection 2.3).

2.1. SEMI-SYNCHRONOUS SCHEME AGAINST STRAGGLERS

In the t-th global round, the server sends the current model w t to K devices in S t (|S t | = K ≤ N ), which is a set of indices randomly selected from N devices in the system. We let C = K/N be the ratio of devices that participate at each global round. Each device in S t performs E local updates with its own data and sends the updated model back to the server. In conventional federated averaging (FedAvg), the server waits until the results of all K devices in S t arrive and then performs aggregation to obtain w t+1 = k∈St m k k∈S t m k w t (k), where w t (k) is the model after E local updates at device k starting from w t . However, due to the effect of stragglers, waiting for all K devices at the server can significantly slow down the overall training process. In resolving this issue, our idea assumes periodic global aggregation at the server. At each global round t, the server transmits the current model/round (w t , t) to the devices in S t . Instead of waiting for all devices in S t , the server aggregates the models that arrive until a fixed time deadline T d to obtain w t+1 , and moves on to the next global round t + 1. Hence, model aggregation is performed periodically with every T d . A key feature here is that we do not ignore the results sent from stragglers (not arrived by the deadline T d ). These results are utilized at the next global aggregation step, or even later, depending on the delay or staleness. Let U (t) i be the set of devices 1) that are selected from the server at global round t, i.e., U (t) i ⊆ S t and 2) that successfully sent their results to the server at global round i for i ≥ t. Then, we can write S t = ∪ ∞ i=t U (t) i , where U (t) i ∩ U (t) j = ∅ for i = j. Here, U (t) ∞ can be viewed as the devices that are selected at round t but failed to successfully send their results back to the server. According to these notations, the devices whose training results arrive at the server during global round t belong to one of the following t + 1 sets: U (0) t , U (1) t , ..., U (t) t . Note that the result sent from device k ∈ U (i) t is the model after E local updates starting from w i , and we denote this model by w i (k). At each round t, we first perform FedAvg as v (i) t+1 = k∈U (i) t m k k∈U (i) t m k w i (k) for all i = 0, 1, ..., t, where v (i) t+1 is the aggregated result of locally updated models (starting from w i ) received at round t with staleness t -i + 1. Then from v (0) t+1 , v (1) t+1 ,..., v (t) t+1 , we take the weighted averaging of results with different staleness to obtain t i=0 α t (i)v (i) t+1 . Here, α t (i) ∝ k∈U (i) t m k (t-i+1) c is a normalized coefficient that is proportional to the number of data samples in U (i) t and inversely proportional to (t -i + 1) c , for a given hyperparameter c ≥ 0. Hence, we have a larger weight for v (i) t+1 with a smaller t -i + 1 (staleness). This is to give more weights to more recent results. Based on the weighted sum t i=0 α t (i)v (i) t+1 , we finally obtain w t+1 as w t+1 = (1 -γ)w t + γ t i=0 α t (i)v (i) t+1 , where γ combines the aggregated result with the latest global model w t . Now we move on to the next round t + 1, where the server selects S t+1 and sends (w t+1 , t + 1) to these devices. Here, if the server knows the set of active devices (which are still performing computation), S t+1 can be Figure 1 : Overall procedure for Sself at the server. At global round t, the received models belong to one of the following t + 1 sets: U (0) t , U t , ..., U t . After entropy-based filtering, the server performs loss-weighted averaging for the results that belong to U t+1 for all i = 0, 1, ..., t and combining with w t , we finally obtain w t+1 and move on to the next round t + 1. constructed to be disjoint with the active devices. If not, the server randomly chooses S t+1 among all devices in the system and the selected active devices can ignore the current request of the server. The left-hand side of Fig. 1 describes our semi-synchronous scheme. The key characteristics of our scheme can be summarized as follows. First, by periodic global aggregation at the server, our scheme is not delayed by the effect of stragglers. Secondly, our scheme fully utilizes the results sent from stragglers in the future global rounds; we first perform federated averaging for the devices with same staleness (as in the synchronous scheme), and then take the weighted sum of these averaged results with different staleness (as in the asynchronous scheme).

2.2. ENTROPY AND LOSS BASED FILTERING/AVERAGING AGAINST ADVERSARIES

In this subsection, we propose entropy-based filtering and loss-weighted averaging which not only show better performance with or without attacks but also combine well with the semi-synchronous scheme compared to existing adversary-resilient aggregation methods. Our key idea is to utilize the public IID data collected at the server. We can imagine a practical scenario where the server (or data center) has its own data samples as in (Zhao et al., 2018) , e.g., various medical data that are open to public. Using these public data at the server, we provide the following two solutions which can protect the system against model update poisoning, data poisoning and backdoor attacks. 1) Entropy-based filtering: Let n pub be the number of public data samples in the server. We also let x pub,j be the j-th sample in the server. When the server receives the locally updated models from the devices, it measures the entropy of each device k by utilizing the public data as E avg (k) = 1 n pub n pub j=1 E x pub,j (k), where E x pub ,j (k) is the shannon entropy of the model of k-th device on the sample x pub,j written as E x pub,j (k) = - Q q=1 P (q) x pub,j (k) log P (q) x pub,j (k). Here, Q is the number of classes of the dataset and P (q) x pub,j (k) is the probability of prediction for the q-th class on a sample x pub,j , using the model of the k-th device. In supervised learning tasks, the model produces a high-confident prediction for the ground truth label of the trained samples and thus has a low entropy for the prediction. However, if the local model is poisoned, e.g., by reverse sign attack, the model is more likely to predict randomly for all classes and thus has a high entropy. Based on this observation, the server filters out the models that have entropy greater than some threshold value E th . It can be seen later in Section 4 that E th is a hyperparameter that can be easily tuned since there is a huge gap between the entropy values of benign and adversarial devices for all datasets. Note that the above method is robust against model update poisoning even with a large portion of adversaries because it just filters out the results whose entropy is greater than E th . This is a significant advantage compared to the median based method (Pillutla et al., 2019) whose performance is significantly degraded when the attack ratio is high. 2) Loss-weighted averaging: The server also measures the loss of each received model using the public data. Based on the loss values, the server then aggregates the received models as follows: w t+1 = k∈St β t (k)w t (k) where β t (k) ∝ m k {F pub (w t (k))} δ and k∈St β t (k) = 1 (5) Here, w t (k) is the locally updated model of the k-th device at global round t. F pub (w t (k)) is defined as the averaged loss of w t (k) computed on public data at the server, i.e., F pub (w t (k)) = 1 n pub n pub j=1 (w t (k); x pub,j ). Finally, δ(≥ 0) in {F pub (•)} δ is a parameter related to the impact of the loss on public data. We note that setting δ = 0 in (5) makes our loss-weighted averaging method equal to FedAvg of (1). Under the data poisoning or backdoor attacks, the models of malicious devices would have relatively larger losses compared to others. By the definition of β t (k), such devices would be given a small weight and has a less impact on the next global update. By replacing the federated averaging with the above loss-weighted averaging, we are able to build a system which is robust against local data poisoning and backdoor attacks. Choose S t and send the current model and the global round (w t , t) to the devices 3: Wait for T d and then: 4: for i = 0, 1, ..., t do 5: for i = 0, 1, ..., t do 10:  for k ∈ U (i) t do 6: U (i) t ← U (i) t -{k}, if E avg (k) > E th // v (i) t+1 = k∈U (i) t β t (k)w i (k) // w t+1 = (1-γ)w t +γ t i=0 α t (i)v (i) t+1 // Weighted average of results with different staleness 13: end for Algorithm at the Devices: If device k receives (w t , t) from the server, it performs E local updates to obtain w t (k). Then each benign device k sends (w t (k), t) to the server, while an adversarial device transmits a poisoned model depending on the type of attack. The above two methods can be easily combined to tackle model update poisoning, data poisoning and backdoor attack issues. The server first filters out the model-poisoned devices based on the entropy, and then take the loss-weighted average with the survived devices to combat data poisoning and backdoor attacks.

2.3. SEMI-SYNCHRONOUS ENTROPY AND LOSS BASED FILTERING/AVERAGING (SSELF)

The details of overall Sself operation are described in Algorithm 1 and Fig. 1 . At global round t, the server chooses S t and sends (w t , t) to devices. The server collects the results from the devices for a time period T d , and calculates entropy E avg (k) and loss F pub (w t (k)) as in ( 4) and ( 5), respectively. Based on the entropy, the server first filters out the results sent from the model poisoned devices. Then, the server aggregates the models that have the same staleness, to obtain v (i) t+1 for i = 0, 1, ..., t. In this aggregating process, we take loss-weighted averaging as in (5) instead of conventional averaging of FedAvg, to defend the system against data poisoning or backdoor attacks. Now using v (0) t+1 , v (1) t+1 , ..., v (t) t+1 , we finally obtain w t+1 as in (3). Here we note that the server can compute entropy and loss whenever the model is received, i.e., based on the order of arrival. After computing entropy and loss of the last model of global round t, the server just needs to compute the weighted sum of the results. Hence, in practical setups where cloud servers have large enough computing powers, Sself does not cause a significant time delay at the server, compared to FedAvg. The computational complexity of Sself depends on the number of received models at each global round, and the running time for computing the entropy/loss with each model. Although direct comparison with other baselines is tricky, if we assume that the complexity of computing entropy or loss is linear to the number of model parameters as in (Xie et al., 2019b) , Sself has larger complexity than that of RFA by a factor of n pub . The additional computational complexity of Sself compared to RFA is the cost for better robustness against adversaries. At the device-side, each device starts local model update whenever it receives (w t , t) from the server. After performing E local updates, device k transmits (w t (k), t) to the server. These processes at the server and the devices are performed in parallel and asynchronously, until the last global round ends.

3. CONVERGENCE ANALYSIS

In this section, we provide insights on the convergence of Sself with the following standard assumptions in federated learning (Li et al., 2019b; Xie et al., 2019a) . Assumption 1 The global loss fuction F defined in (1) is µ-strongly convex and L-smooth. Assumption 2 Let ξ i t (k) be a set of data samples that are randomly selected from the k-th device during the i-th local update at global round t. Then, E ∇F k (w t (k), ξ i t (k)) -∇F (w t (k)) 2 ≤ ρ 1 holds f or all t and k = 1, . . . , N and i = 1, . . . , E. Assumption 3 The second moments of stochastic gradients in each device is bounded, i.e., E ∇F k (w t (k), ξ i t (k)) 2 ≤ ρ 2 f or all t and k = 1, . . . , N and i = 1, . . . , E. We also have another assumption that describes the bounds on the error for the adversaries. Let B (i) t and M (i) t be the set for benign and adversarial devices of U (i) t respectively, satisfying U (i) t = B (i) t ∪ M (i) t and B (i) t ∩ M (i) t = ∅. Let Ω (i) t = k∈M (i) t β i (k) (6) be the sum of loss weights for the adversarial devices in U (i) t . Now we have the following assumption. Assumption 4 For an adversarial device k ∈ M (i) t , there exists an arbitrarily large Γ such that E[F (w t (k)) -F (w * )] ≤ Γ < ∞ holds for all i = 1, . . . , t. Based on the above assumptions, we state the following theorem which provides the convergence bound of our scheme. The proof can be found in Supplementary Material. Theorem 1 Suppose Assumptions 1, 2, 3, 4 hold and the learning rate η is set to be less than 1 L . If U (t) t = ∅ for all t ∈ {0, 1, ..., T }, then Sself satisfies E[F (w T ) -F (w * )] ≤ ν T [F (w 0 ) -F (w * )] + (1 -ν T )C where ν = 1 -γ + γ(1 -ηµ) E , C = ρ1+ρ2+2µΩmaxΓ 2ηµ 2 , Ω max = max 0≤i≤t,0≤t≤T Ω (i) t . We have the following important observations from Theorem 1. First, we can observe a trade-off between convergence rate ν T and the error term (1 -ν T )C . If we increase γ, the convergence rate improves but the error term increases as in (Xie et al., 2019a) . By adjusting this γ, we can make the convergence speed faster at the beginning of training while reducing the error at the end of training. Another important observation is the impact of the adversaries. If we have a large Ω max for a fixed ν, it can be seen from the definition of C that we have a large error term (1 -ν T )C . However, if the entropy-based filtering method successfully filters out the model poisoned devices, and the loss-weights β i (k) of the adversaries are significantly small for data poisoning and backdoor attacks, we have a small Ω (i) t (close to zero) from ( 6). This means that we have a significantly small Ω max , i.e., a small error term (1 -ν T )C . In the next section, we show via experiments that Sself successfully combats both stragglers and adversaries simultaneously and achieves fast convergence with a small error term.

4. EXPERIMENTS

In this section, we validate Sself on MNIST (LeCun et al., 1998) , FMNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky et al., 2009) . The overall dataset is split into 60,000 training and 10,000 test samples for MNIST and FMNIST, and split into 50,000 training and 10,000 test samples for CIFAR-10. A simple convolutional neural network (CNN) with 2 convolutional layers and 2 fully connected layers is utilized for MNIST, while CNN with 2 convolutional layers and 1 fully connected layer is used for FMNIST. When training with CIFAR-10, we utilized VGG-11. We consider Experiments with stragglers. To confirm the advantage of Sself, we first consider the scenario with only the stragglers. The adversarial attacks are not considered here. We compare Sself with the following methods. First is the wait for stragglers approach where FedAvg is applied after waiting for all the devices at each global round. The second scheme is the ignore stragglers approach where FedAvg is applied after waiting for a certain timeout threshold and ignore the results sent from slow devices. Finally, we consider the asynchronous scheme (FedAsync) (Xie et al., 2019a) where the global model is updated every time the result of each device arrives. For Sself and FedAsync, γ is decayed while the learning rate is decayed in other schemes. In Fig. 2 , we plot the test accuracy versus running time on different datasets and C values. For a fair comparison, the global aggregation at the server is performed with every T d = 1 periodically for Sself and other comparison schemes (ignore stragglers, FedAsync). To model stragglers, each device can have delay of 0, 1, 2 which is determined independently and uniformly random. In other words, at each global round t, we have S t = U (t) t ∪ U (t) t+1 ∪ U (t) t+2 . Our first observation from Fig. 2 is that the ignore stragglers scheme can lose significant data at each round and often converges to a suboptimal point with less accuracy. The wait for stragglers scheme requires the largest running time until convergence due to the delays caused by slow devices. Finally, it is observed that Sself performs the best, even better than the state-of-the-art FedAsync. Experiments with adversaries. Next, we confirm the performance of Sself in Fig. 3 under the scenario with only the adversaries in a synchronous setup. We compare our method with geometric median-based RFA (Pillutla et al., 2019) and FedAvg under the model update/data poisoning and backdoor attacks. Comparison with the Multi-Krum is illustrated in Supplementary Material. For data poisoning attack, we conduct label-flipping (Biggio et al., 2012) , where each label i is flipped to label i + 1. For model update poisoning, each adversarial device takes the opposite sign of all weights and scales up 10 times before transmitting the model to the server. For both attacks, we set C to 0.2 and the portion of adversarial devices is assumed to be r = 0.2 at each global round. For the backdoor, we use the model replacement method (Bagdasaryan et al., 2018) in which adversarial devices transmit the scaled version of the corrupted model to replace the global model with a bad model. We conduct the pixel-pattern backdoor attack (Gu et al., 2017) in which the specific pixels are embedded in a fraction of images, where these images are classified as a targeted label. We embedded 12 white pixels in the top-left corner of the image and the labels of these poisoned images are set to 2. We utilize the Dirichlet distribution with parameter 0.5 for distributing training samples to N = 100 devices. We let C = 0.1, r = 0.1, and the local batch size is set to 64. The number of poisoned images in a batch is set to 20, and we do not decay the learning rate here. In this backdoor scenario, we additionally compare Sself with the norm-thresholding strategy (Sun et al., 2019) , in which the server ignores the models with the norm greater than a pre-defined threshold. We measure the attack success rate of the backdoor task by embedding the pixel-pattern into all test samples (except data with label 2) and then comparing the predicted label with the targeted label 2. We applied backdoor attack in every round after the 10-th global round for MNIST and FMNIST, and after the 1000-th global round for CIFAR-10. Fig. 3 shows the performance of each scheme over global round under three attack scenarios. For both data and model poisoning attacks, it can be seen that Sself shows better performance than other schemes. FedAvg does not work well on all datasets, and the performance of RFA gets worse as the dataset/neural network model become more complex. In the backdoor attack scenario, Sself and the norm-thresholding method have low attack success rates on all datasets. The other schemes cannot defend the backdoor attack having the high attack success rate as global round increases. Experiments with both stragglers and adversaries. Finally in Fig. 4 , we consider the setup with both stragglers and adversaries. We compare Sself with various straggler/adversary defense combinations. Comparison with the Multi-Krum is illustrated in Supplementary Material. We set C = 0.2, r = 0.2 for model/data poisoning while the results on the backdoor attack are also shown in Supplementary Material. The stragglers and adversaries are modeled as in Figs. 2 and 3 , respectively. We have the following observations from Fig. 4 . First, FedAsync (Xie et al., 2019a) does not perform well when combined with entropy-based filtering and loss-weighted averaging, since the model update is conducted one-by-one in the order of arrivals. Due to the same issue, FedAsync cannot be combined with RFA. Our second observation is that the semi-synchronous or ignore stragglers method combined with RFA exhibits poor performance. The reason is that the attack ratio could often be very high (larger than r) for these deadline-based schemes, which degrades the performance of RFA. Compared to RFA, our entropy and loss based filtering/averaging can be applied even with a high attack ratio. It can be also seen that the wait for stragglers scheme combined with RFA suffers from the straggler issue. Overall, the proposed Sself algorithm performs the best, confirming significant advantages of our scheme under the existence of both stragglers and adversaries.

5. CONCLUSION

We proposed Sself, a robust federated learning scheme against both stragglers and adversaries. The semi-synchronous component allows the server to fully utilize the results sent from the stragglers by taking advantages of both synchronous and asynchronous elements. In each aggregation step of the semi-synchronous approach, entropy-based filtering screens out the model-poisoned devices and loss-weighted averaging reduces the impact of data poisoning and backdoor attacks. Extensive experimental results show that Sself enables fast and robust federated learning in practical scenarios with a large number of slow devices and adversaries. A HYPERPARAMETER SETTING

A.1 SCENARIO WITH ONLY STRAGGLERS

The hyperparameter settings for Sself are shown in Table 1 . For the schemes ignore stragglers and wait for stragglers combined with FedAvg, we decayed the learning rate during training. For the FedAsync scheme in (Xie et al., 2019a) , we take a polynomial strategy with hyperparameters a = 0.5, α = 0.8, and decayed γ during training. (Pillutla et al., 2019) , maximum iteration is set to 10. In this setup, the learning rate is decayed for all three schemes (Sself, RFA, FedAvg). Backdoor attack: In this backdoor attack scenario, note that we utilized the Dirichlet distribution with parameter 0.5 for distributing training samples to N = 100 devices. Local batch size is set to 64 and the number of poisoned images is 20. In this experiment, we additionally compared our scheme with the norm-thresholding strategy (Sun et al., 2019) where the threshold value is set to 2. The hyperparameter details for Sself are shown in Table 3 . Data poisoning and model update poisoning attacks: The hyperparameters for Sself are exactly the same as in Table 2 . Backdoor attack: The hyperparameter details are shown in Table 4 . For the comparison schemes, we considered: 1) Semi-synchronous + RFA, 2) FedAsync + ELF (entropy and loss based filtering/averaging), 3) Ignore stragglers + RFA, 4) Wait for stragglers + RFA. Each setting is set to be the same as in the previous experiments. 

B ADDITIONAL EXPERIMENTS UNDER BACKDOOR ATTACK B.1 EXPERIMENTS WITH BOTH STRAGGLERS AND ADVERSARIES UNDER BACKDOOR ATTACK

Based on the hyperparameters described in Table 4 , we show experimental results with both stragglers and adversaries under backdoor attack. It can be observed from Fig. B .1 that Sself successfully defends against the backdoor attack while other schemes show high attack ratios as global round increases.

B.2 EXPERIMENTS UNDER NO-SCALED BACKDOOR ATTACK

In addition to model replacement backdoor attack we considered so far, we perform additional experiments under no-scaled backdoor attack (Bagdasaryan et al., 2018) where the adversarial devices do not scale the weights and only transmit the corrupted model to the server. Fig. B .2 shows the performance under no-scaled backdoor attack with only adversaries (no stragglers). It can be seen that our Sself consistently achieves low attack success rates compared to others. Since the adversaries do not scale the weights, the norm-thresholding approach cannot defend against the attack. 

C EXPERIMENTAL RESULTS FOR VARYING HYPERPARAMETERS

To observe the impact of hyperparameter setting, we performed additional experiments with various δ and E th values, the key hyperparameters of Sself. The results are shown in Fig. C .1 with only adversaries. We performed data poisoning attack for varying δ and model update poisoning attack for varying E th . It can be seen that our scheme still performs well (better than RFA), even with naively chosen hyperparameters, confirming the advantage of Sself in terms of reducing the overhead associated with hyperparameter tuning.

D PERFORMANCE COMPARISON WITH MULTI-KRUM

While we compared Sself with RFA in our main manuscript, here we compare our scheme with Multi-Krum (Blanchard et al., 2017) which is a Byzantine-resilient aggregation method targeting conventional distributed learning setup with IID data across nodes. In Multi-Krum, among N workers in the system, the server tolerates f Byzantine workers under the assumption of 2f + 2 < N . After filtering f worker nodes based on squared-distances, the server chooses N workers among N -f remaining workers with the best scores and aggregates them. We set M = N -f for comparing our scheme with Multi-Krum. It can be seen that if the number of adversaries exceed f , the performance of Multi-Krum significantly decreases. Compared to Multi-Krum, the proposed Sself method can filter out the poisoned devices and then take the weighted sum of the survived results even when the portion of adversaries is high. Figs. 2(c) and 2(d) show the results under the existence of both stragglers and adversaries, under model update poisoning attack. The parameter f of Multi-Krum is set to the maximum value satisfying 2f + 2 < N , where N depends on the number of received results for both semi-synchronous and ignore stragglers approaches. However, even when we set f to the maximum value, the number of adversaries can still exceed f , which degrades the performance of Multi-Krum combined with semi-synchronous and ignore stragglers approaches. Obviously, Multi-Krum can be combined with the wait for stragglers approach by setting f large enough. However, this scheme still suffers from the effect of stragglers, which significantly slows down the overall training process. 

E IMPACT OF PUBLIC DATA

In the experiments in our main manuscript, we utilized 2% of the training data samples as public data to defend against adversarial attacks. In this section, to observe the impact of the portion of public data, we performed additional experiments by changing the portion of public data under three attack scenarios in a synchronous setup. In the main manuscript, we let 2% of the entire training set to be the public data and the remaining data to be the training data at the devices for a fair comparison with other schemes. Here, the overall training set is utilized at the devices and among them, a certain portion of data are collected at the server. Fig. D .1 shows the results with various portions of public data on FMNIST. From the results, it can be seen that our Sself protects the system against adversarial attacks with only a small amount of public data. But as shown in the plot where the portion of the public data is 0.03%, if the amount of public data becomes smaller than a certain threshold, the robustness of Sself does suffer. 

G EXPERIMENTS IN A MORE SEVERE STRAGGLER SCENARIO

When modeling stragglers, we gave a delay of 0, 1, 2 to each device in the experiments of main manuscript. In this section, each device can have delay of 0 to 4 which is again determined independently and uniformly random. In Fig. G .1, we show the results with both stragglers and adversaries under data and model-update poisoning on FMNIST dataset. We set C to 0.4 and r to 0.2. It can be seen that our Sself still shows the best performance under both data poisoning and model-update poisoning compared to other baseline schemes.

H EXPERIMENTS WITH VARYING PORTION OF ADVERSARIES

In this section, we show the performance of Sself with varying portion of adversaries under data and model poisoning attacks. We do not consider stragglers here. We set δ to 1 and E th to 1 as in the experiments of the main manuscript. Fig. H .1 shows the results with different attack ratio on FMNIST dataset. For data poisoning, our Sself shows robustness against up to 0.4 of the attack ratio, but with 0.5 or higher, performance is degraded. For model update poisoning, it can be seen that our Sself performs well even with a higher attack ratio. I PROOF OF THEOREM 1  w j t (k) ← w j-1 t (k) -η∇F k (w j-1 t (k), ξ j-1 t (k)) f or j = 1, . . . , E, where ξ j t (k) is a set of data samples that are randomly selected from the k-th device during the j-th local update at global round t. After E local updates, the k-th benign device transmits w E t (k) to the server. However, in each round, the adversarial devices transmit poisoned model parameters. Using these notations, the parameters defined in Section 2 can be rewritten as follows: v (i) t+1 = k∈U (i) t β i (k)w E i (k) where β i (k) ∝ m k {F pub (w E i (k))} δ and k∈U (i) t β i (k) = 1 (9) z t+1 = t i=0 α t (i)v (i) t+1 where α t (i) ∝ k∈U (i) t m k (t -i + 1) c and t i=0 α t (i) = 1 (10) w t+1 = (1 -γ)w t + γz t+1 I.2 KEY LEMMA We introduce the following key lemma for proving Theorem 1. Our proof is largely based on the convergence proof of FedAsync in (Xie et al., 2019a) . Lemma 1 Suppose Assumptions 1, 2 hold and the learning rate η is set to be less than 1 L . Consider the k-th benign device that received the current global model w t from the server at global round t. After E local updates, the following holds: E[F (w E t (k)) -F (w * )|w 0 t (k)] ≤ (1 -ηµ) E [F (w 0 t (k)) -F (w * )] + Eρ 1 η 2 . ( ) Proof of Lemma 1. First, consider one step of SGD in the k-th local device. For a given w j-1 t (k), for all global round t and for all local updates j ∈ {1, . . . , E}, we have . . .  ≤ (1 -ηµ) E [F (w 0 t (k)) -F (w * )] + ηρ 1 2 E j=1 (1 -ηµ) j-1 = (1 -ηµ) E [F (w 0 t (k)) -F (w * )] + ηρ 1 2 1 -(1 -ηµ) E ηµ From η < 1 L ≤ 1 µ , ηµ < 1 ≤ (1 -ηµ) E [F (w 0 t (k)) -F (w * )] + k∈U (i) t-1 β i (k)E[F (w E i (k)) -F (w * )|w t-1 ] = (1 -γ)[F (w t-1 ) -F (w * )] + γ t-1 i=0 α t-1 (i) k∈B (i) t-1 β i (k)E[F (w E i (k)) -F (w * )|w t-1 ] + k∈M (i) t-1 β i (k)E[F (w E i (k)) -F (w * )|w t-1 ] ≤ (1 -γ + γα t-1 (t -1)(1 -Ω t-1 t-1 )(1 -ηµ) E )[F (w t-1 ) -F (w * )] + Eηρ 1 γ 2 + γ(1 -ηµ) E t-2 i=0 α t-1 (i) k∈B (i) t-1 β i (k) F (w 0 i (k)) -F (w * ) + γ t-1 i=0 α t-1 (i) k∈M (i) t-1 β i (k)E[F (w E i (k)) -F (w * )|w t-1 ] ≤ (1 -γ + γα t-1 (t -1)(1 -Ω t-1 t-1 )(1 -ηµ) E )[F (w t-1 ) -F (w * )] + γΩ max Γ + γ(1 -ηµ) E t-2 i=0 α t-1 (i) k∈B (i) t-1 β i (k) F (w 0 i (k)) -F (w * ) + Eηρ 1 γ 2 (f ) ≤ (1 -γ + γα t-1 (t -1)(1 -Ω t-1 t-1 )(1 -ηµ) E )[F (w t-1 ) -F (w * )] + γΩ max Γ + γ(1 -ηµ) E t-2 i=0 α t-1 (i) (1 k∈B (i) t-1 β i (k) 1 2µ ∇F (w 0 i (k)) 2 + Eηρ 1 γ 2 (g) ≤ (1 -γ + γα t-1 (t -1)(1 -Ω t-1 t-1 )(1 -ηµ) E )[F (w t-1 ) -F (w * )] + γΩ max Γ + Eηρ 1 γ 2 + γ(1 -α t-1 (t -1))(1 -ηµ) E ρ 2 2µ (h) ≤ (1 -γ + γα t-1 (t -1)(1 -Ω t-1 t-1 )(1 -ηµ) E )[F (w t-1 ) -F (w * )] + γ(Eρ 1 + (1 -α t-1 (t -1))ρ 2 + 2µΩ max Γ) 2µ -γ + γα τ (τ )(1 -Ω τ τ )(1 -ηµ) E )[F (w 0 ) -F (w * )] + γ(Eρ 1 + (1 -α T -1 (T -1))ρ 2 + 2µΩ max Γ) 2µ + T -1 τ =1 γ(Eρ 1 + (1 -α T -1-τ (T -1 -τ ))ρ 2 + 2Ω max Γ) 2µ τ k=1 (1 -γ + γα T -k (T -k)(1 -Ω T -k T -k )(1 -ηµ) E ) (d) ≤ (1 -γ + γ(1 -ηµ) E ) T [F (w 0 ) -F (w * )] + 1 -{1 -γ + γ(1 -ηµ) E } T Eρ 1 + ρ 2 + 2µΩ max Γ 2µ(1 -(1 -ηµ) E ) (e) ≤ (1 -γ + γ(1 -ηµ) E ) T [F (w 0 ) -F (w * )] + 1 -{1 -γ + γ(1 -ηµ) E } T ρ 1 + ρ 2 + 2µΩ max Γ 2ηµ 2 = ν T [F (w 0 ) -F (w * )] + (1 -ν T )C which completes the proof. Here, (a) comes from the Law of total expectation, (b), (c) are due to inequality (14). And (d) comes from 0 ≤ α t (i) ≤ 1 and 0 ≤ Ω (i) t < 1 for all i, t. In addition, (e) is from ηµ ≤ 1 and E is a positive integer.



https://www.kaggle.com/pranavraikokte/covid19-image-dataset



Semi-Synchronous Entropy and Loss based Filtering/Averaging (Sself) Input: Initialized model w 0 , Output: Final global model w T Algorithm at the Server 1: for each global round t = 0, 1, ..., T -1 do 2:

Figure 2: Test accuracy versus training time with only stragglers. Sself is our scheme.

Figure 3: Performance of different schemes with only adversaries.

Figure 4: Performance of different schemes with both stragglers and adversaries.

Figure B.1: Performance of different schemes with both stragglers and adversaries under backdoor attack. We set C = 0.1, r = 0.1.

Figure B.2: Performance comparison with no-scaled backdoor attack. We set C = 0.1, r = 0.1.

Figure C.1: Impact of varying hyperparameters under model update poisoning and data poisoning attacks. We set C = 0.2, r = 0.2.

Fig. C.2 compares Sself with Multi-Krum under model update poisoning. We first observe Figs. 2(a) and 2(b) which show the results with only adversaries.It can be seen that if the number of adversaries exceed f , the performance of Multi-Krum significantly decreases. Compared to Multi-Krum, the proposed Sself method can filter out the poisoned devices and then take the weighted sum of the survived results even when the portion of adversaries is high. Figs.2(c) and 2(d) show the results under the existence of both stragglers and adversaries, under model update poisoning attack. The parameter f of Multi-Krum is set to the maximum value satisfying 2f + 2 < N , where N depends on the number of received results for both semi-synchronous and ignore stragglers approaches. However, even when we set f to the maximum value, the number of adversaries can still exceed f , which degrades the performance of Multi-Krum combined with semi-synchronous and ignore stragglers approaches. Obviously, Multi-Krum can be combined with the wait for stragglers approach by setting

Figure C.2: Performance comparison with Multi-Krum under model update poisoning. We set C = 0.2 and r = 0.2.

Figure D.1: Impact of portion of public data at the server using FMNIST. We set C = 0.1, r = 0.1 for the backdoor attack and C = 0.2, r = 0.2 for the others.

Fig. C.3 compares Sself with Multi-Krum under scaled backdoor attack. The results are consistent with the results in Fig. C.2, confirming the advantage of Sself over Multi-Krum combined with straggler-mitigating schemes.

Figure F.1: Performance of different schemes on medical dataset (Covid-19 image dataset) under data and model update poisoning attacks. We set C = 1, r = 0.1.

) -F (w * ) -ηE[∇F (w j(1 -ηµ)[F (w j-1 t (k)) -F (w * )] + ηρ 1 2 µ-strongly convexity (13)Applying above result to E local updates in k-th local device, we haveE F (w E t (k)) -F (w * )|w 0 t (k) = E[ E[F (w E t (k)) -F (w * )|w E

From ηµ < 1, 1 -(1 -ηµ) E ≤ Eηµ I.3 PROOF OF THEOREM 1Now utilizing Lemma 1, we provide the proof for Theorem 1. First, consider one round of global aggregation at the server. For a given w t-1 , the server updates the global model according to equation(11). Then for all t ∈ 1, . . . , T , we haveE[F (w t ) -F (w * )|w t-1 ] (a) ≤ (1 -γ)[F (w t-1 ) -F (w * )] + γE[F (z t ) -F (w * )|w t-1 ] (b) ≤ (1 -γ)[F (w t-1 ) -F (w * )] + γ t-1 i=0 α t-1 (i)E[F (v i t ) -F (w * )|w t-1 ](c)≤ (1 -γ)[F (w t-1 ) -F (w * )] + γ

14)    where (a), (b), (c) come from convexity, (d) follows Lemma 1, (e) comes fromΩ max = max 0≤i≤t,0≤t≤T Ω (i) tand the Assumption 4. (f ) is due to µ-strongly convexity, (g) is from Assumption 3 and (h) comes from ηµ < 1. Note that t-1 i=0 α t-1 (i) = 1 for all t.

Applying the above result to T global aggregations in the server, we haveE[F (w T ) -F (w * )|w 0 ] (a) = E [E[F (w T ) -F (w * )|w T -1 ]|w 0 ] (b) ≤ E (1 -γ + γα T -1 (T -1)(1 -Ω T -1 T -1 )(1 -ηµ) E )[F (w T -1 ) -F (w * )]|w 0 + γ(Eρ 1 + (1 -α t-1 (t -1))ρ 2 + 2µΩ max Γ) 2µ

Hyperparameters for Sself with only stragglers Data poisoning and model update poisoning attacks: Table 2 describes the hyperparameters for Sself with only adversaries, under data poisoning and model update poisoning attacks. For the RFA in

Hyperparameters for Sself with only adversaries, under data and model update poisoning

Hyperparameters for Sself with only adversaries, under backdoor attack

Hyperparameters for Sself with both stragglers and adversaries, under backdoor attack

F EXPERIMENTS ON COVID-19 DATASET OPEN TO PUBLICIn this section, we performed additional experiments on Kaggle's Covid-19 dataset 1 which is open to public. We consider both model update poisoning and data poisoning attacks in a synchronous setup. Image classification is performed to detect Covid-19 using Chest X-ray images. The dataset consists of 317 color images of 3480 × 4248 pixels in 3 classes (Normal, Covid and Viral-Pneumonia). There are 251 training images and 66 test images. We resized the images into 224 × 224 pixels and used convolutional neural network with 6 convolutional layers and 1 fully connected layer. We used 6% of the training data as the public data. We divided the remaining training samples into 10 devices and set C = 1 and r = 0.1. Fig. F.1 shows the results of different schemes under data poisoning and model update poisoning attacks on Covid-19 dataset. As other baseline schemes, our Sself shows robustness against model update poisoning attack. In data poisoning attack, our Sself shows the best performance compared to other schemes. In conclusion, utilizing part of the open medical dataset as public data, we show that our Sself could effectively defend against model update and data poisoning attacks.

Performance with varying portion of adversaries. Data and model-update poisoning attacks are considered with FMNIST. We set C = 0.2.starting from global round t. At global round t, each device receives the current global model w t and round index t from the server, and sets its initial model to w t , i.e., w 0 t (k) ← w t for all k = 1, . . . , N . Then each k-th benign device performs E local updates of stochastic gradient descent (SGD) with learning rate η:

