F 2 ED-LEARNING: GOOD FENCES MAKE

Abstract

In this paper, we present F 2 ED-LEARNING, the first federated learning protocol simultaneously defending against both a semi-honest server and Byzantine malicious clients. Using a robust mean estimator called FilterL2, F 2 ED-LEARNING is the first FL protocol providing dimension-free estimation error against Byzantine malicious clients. Besides, F 2 ED-LEARNING leverages secure aggregation to protect the clients from a semi-honest server who wants to infer the clients' information from the legitimate updates. The main challenge stems from the incompatibility between FilterL2 and secure aggregation. Specifically, to run FilterL2, the server needs to access individual updates from clients while secure aggregation hides those updates from it. We propose to split the clients into shards, securely aggregate each shard's updates and run FilterL2 on the updates from different shards. The evaluation shows that F 2 ED-LEARNING consistently achieves optimal or close-to-optimal performance under three attacks among five robust FL protocols.

1. INTRODUCTION

Federated learning (FL) has drawn numerous attention in the past few years as a new distributed learning paradigm. In federated learning, the users collaboratively train a model with the help of a centralized server when all the data is held locally to preserve the users' privacy. The privacy guarantee can be further enhanced using secure aggregation technique (Bonawitz et al., 2017) which hides the individual local updates and only reveals the aggregated global update. The graceful balance between utility and privacy popularizes federated learning in a variety of sensitive applications such as Google GBoard, healthcare service and self-driving cars. The above threat model assumes that all the users honestly upload their local updates. However, it is likely that a small number of clients are malicious in a large-scale FL system with tens of thousands of clients. Besides, in most SGD-based FL algorithms used today (McMahan & Ramage, 2017) , the centralized server averages the local updates to obtain the global update, which is vulnerable to even only one malicious client. Therefore, a malicious client can arbitrarily craft its update to either prevent the global model from converging or lead it to a sub-optimal minimum. This kind of attack in federated learning is well-studied by Bhagoji et al. ( 2019 To mitigate these attacks, various Byzantine-robust FL protocols (Blanchard et al., 2017; Yin et al., 2018; Fu et al., 2019; Pillutla et al., 2019) are proposed to reduce the impact of the contaminated updates. These protocols replace trivial averaging with well-designed Byzantine-robust mean estimators. These estimators suppress the influence of the malicious updates and output a mean estimation as accurate as possible. Nevertheless, almost all of these aggregators suffer from the curse of dimensionality. Specifically, the estimation error scales up with the size of the model in a squareroot fashion. As a concrete example, a three-layer MLP on MNIST contains more than 50,000 parameters and leads to a 223-fold increase of the estimation error, which is prohibitive in practice. Draco (Chen et al., 2018) , BULYAN (Mhamdi et al., 2018) and ByzatineSGD (Alistarh et al., 2018) are the only three works that state to yield dimension-free estimation error. However, Draco is designed for distributed learning and is incompatible with federated learning because it requires redundant updates from each worker. On the other hand, although Bulyan (Mhamdi et al., 2018) and ByzantineSGD (Alistarh et al., 2018) provide dimension-free estimation error, it is based on much stronger assumptions than other works. When the assumptions are relaxed to the common case, Bulyan's estimation error still scales up with the square root of the model size as discussed in Section 2. In addition, these robust FL protocols have incompatible implementation with secure aggregation techniques. The robust estimators have to access local updates while secure aggregation hides them from the server. Consequently, the system cannot simultaneously protect the server and the clients, but has to place complete trust in either of them. The lack of two-way protection severely harms the people's confidence in the FL system and prevents federated learning from being used in many sensitive applications such as home monitoring and self-driving cars. Contribution. In this paper, we propose FEDERATED LEARNING WITH FENCE, abbreviately F 2 ED-LEARNING. F 2 ED-LEARNING integrates a robust mean estimator with dimension-free error (Steinhardt, 2018) and secure aggregation (Bonawitz et al., 2017) to defend against both the Byzantine malicious clients and the semi-honest server. In particular, F 2 ED-LEARNING is the first Byzantine-robust FL system with dimension-free estimation error. To address the incompatibility, the clients are split into multiple shards, the local updates from the same shard are securely aggregated at the centralized server, and the robust estimator is run on the aggregated local updates from different shards. Surprisingly, sharding also consolidates the independently and identically distributed (IID) assumption required by the robust estimator even under heterogeneous data distribution. According to Lindeberg central limitation theorem (Lindeberg, 1922) , despite the heterogeneity of the individual local updates, the aggregated local updates from the shards will approximately follow an IID Gaussian distribution. A variety of Byzantine-robust FL protocols are proposed to defend against these attacks. Krum (Blanchard et al., 2017) picks the subset of updates with enough close neighbors and averages the subset. Yin et al. (2018) leverage traditional robust estimators like trimmed mean or median to achieve order-optimal statistical error rate under strongly convex assumptions. Yin et al. (2019) propose to use robust mean estimators to defend against saddle point attack. Mhamdi et al. (2018) pointed out that Krum, trimmed mean and median all suffers from O( √ d) (d is the model size) estimation error and proposed a general framework Bulyan to reduce the error to O(1). However, we point out that the improvement of Bulyan actually comes from its stronger assumption. In particular, Bulyan assumes that expectation of the distance between two benign updates is bounded by a constant σ 1 , while Krum assumes that the distance is bounded by σ 2 √ d. We can easily see that if σ 1 = σ 2 √ d, Bulyan falls back to the same order of estimation error as Krum. The same loophole exists in the analysis of ByzantineSGD (Alistarh et al., 2018) . Consequently, there is no known federated learning protocol with dimension-free estimation error against Byzantine adversaries.

3. PROBLEM SETUP

In this section, we review the general pipeline of federated learning, introduce the threat model, and establish the notation system. We use bold lower-case letters (e. 



); Fang et al. (2019); Bagdasaryan et al. (2020); Sun et al. (2020).

Byzantine-robust aggregation has drawn enormous attention in the past few years due to the emergence of various distributed attacks in federated learning. Fang et al. (2019) formalize the attack as an optimization problem and successfully migrate the data poisoning attack to federated learning. The proposed attacks even work under Byzantine-robust federated learning. Sun et al. (2020) manage to launch data poisoning attack on the multi-task federated learning framework. Bhagoji et al. (2019) and Bagdasaryan et al. (2020) even manage to insert backdoor functionalities into the model via local model poisoning or local model replacement.

g. a,b,c) to denote vectors, and bold upper-case letters (e.g. A, B, C) for matrices. We denote 1 • • • n with [n]. Federated Learning Pipeline. In a federated learning system, there are one server S and n clients C i , i ∈ [n]. Each client holds data samples drawn from some unknown distribution D. Let (w; z) be the loss function on the model parameter w ∈ R d and a data sample z. Let L(w) = E z∼D [ (w; z)] be the population loss function. Our goal is to learn the model w such that the population loss function

