PRACTICAL LOCALLY PRIVATE FEDERATED LEARN-ING WITH COMMUNICATION EFFICIENCY Anonymous

Abstract

Federated learning (FL) is a technique that trains machine learning models from decentralized data sources. We study FL under local differential privacy constraints, which provides strong protection against sensitive data disclosures via obfuscating the data before leaving the client. We identify two major concerns in designing practical privacy-preserving FL algorithms: communication efficiency and highdimensional compatibility. We then develop a gradient-based learning algorithm called sqSGD (selective quantized stochastic gradient descent) that addresses both concerns. The proposed algorithm is based on a novel privacy-preserving quantization scheme that uses a constant number of bits per dimension per client. Then we improve the base algorithm in two ways: first, we apply a gradient subsampling strategy that offers simultaneously better training performance and smaller communication costs under a fixed privacy budget. Secondly, we utilize randomized rotation as a preprocessing step to reduce quantization error. We also initialize a discussion about the role of quantization and perturbation in FL algorithm design with privacy and communication constraints. Finally, the practicality of the proposed framework is demonstrated on benchmark datasets. Experiment results show that sqSGD successfully learns large models like LeNet and ResNet with local privacy constraints. In addition, with fixed privacy and communication level, the performance of sqSGD significantly dominates that of baseline algorithms.

1. INTRODUCTION

1.1 BACKGROUND Federated learning (FL) Kairouz et al. (2019) ; Konečnỳ et al. (2016) is a rapidly evolving application of distributed optimization to large-scale learning or estimation scenarios where multiple entities. called clients, collaborate in solving a machine learning problem, under the coordination of a central server. Each client's raw data is stored locally and not exchanged or transferred. To achieve the learning objective, the server collects minimal information from the clients for immediate aggregation. FL is particularly suitable for mobile and edge device applications since the (sensitive) individual data never directly leave the device and has seen deployments in industries (?Hard et al., 2019; Leroy et al., 2019) . While FL offers significant practical privacy improvements over centralizing all the training data, it lacks a formal privacy guarantee. As discussed in Melis et al. (2018) , even if only model updates (i.e. gradient updates) are transmitted, it is easy to compromise the privacy of individual clients. Differential privacy (DP) (Dwork et al., 2014) is the state-of-the-art approach to address information disclosure. Differentially private algorithms fuse participation of any individual via injecting algorithm-specific random noise. In FL setting, DP is suitable for protecting against external adversaries, i.e. a malicious analyst that tries to infer individual data via observing final or intermediate model results. However, DP paradigms typically assume a trusted curator, which corresponds to the server in the FL setting. This assumption is often not satisfied in practical cases, under which users that act as clients may not trust the service provider that acts as the server. Local differential privacy (LDP) (Kasiviswanathan et al., 2011; Dwork et al., 2014) provides privacy protection on the individual level via applying randomized mechanisms that obfuscate the data before leaving the client, and is a more natural privacy model for distributed learning scenarios like FL, allowing easier compliance with regulatory strictures (Bhowmick et al., 2018) .

1.2. METHODOLOGY

We will be focusing on distributed learning procedures that aim to solve an empirical risk minimization problem in a decentralized fashion: 1. Server distribute the current value θ t among a subset S of clients. min θ∈Θ L(θ), L(θ) = 1 N M m=1 L m (θ) 2. For each client s ∈ S, a local update is computed ∆ s = g(θ t , Y s ) and transmitted to the server. 3. The server aggregates all {∆ s } s∈S s to obtain a global update ∆ and updates the parameter as θ t+1 = θ t + ∆. We distinguish 2 practical scenarios as cross-silo FL and cross-device FL (Kairouz et al., 2019) . In cross-silo FL, M is relatively small (i.e. M ≤ 100 ) and usually each client has a moderate or large amount of data (i.e. min m N m 1). As a consequence, all the clients participate in the learning process (i.e. S = [M ]). In each iteration a client computes locally a negative stochastic gradient g(θ t , Y s ) = -1 R y∈Rs ∇ (θ t , y) of L s (θ t )/N s , based on a uniform random subsample R s ⊂ Y s of size R. In cross-device FL S is a uniformly random subset of [M ], and g(θ t , Y s ) = -1 Ns y∈Ys ∇ (θ t , y) is the average negative gradient over Y s evaluated at θ t . In this paper we will be focusing on the FedSgd aggregation rule corresponding to a stochastic gradient step with learning rate η, i.e., ∆ = η s∈S Ns N g(θ t , Y s ). Generally speaking, there are two sources of data that need privacy protection: the parameter θ t and the updates {∆ s } s∈S . In this paper we will develop algorithms for protecting {∆ s } s∈S against inference attack or reconstruction attack, which will be defined later. Note that the protection of θ t s could be done via applying standard central DP techniques as in Bhowmick et al. (2018) . To begin our discussion on suitably defined privacy models, we first review the local model of privacy. Kasiviswanathan et al. (2011) we say a randomized algorithm A that maps the private data X ∈ X to some value Z = A(X) ∈ Z is -locally differentially private, if the induced conditional probability measure P (•|X = x) satisfies that for any x, x ∈ X and any Z ∈ Z:

Local differential privacy

e -≤ P (Z|X = x) P (Z|X = x ) ≤ e In the FL setting, the private data X is typically not the raw data Y but the gradient of the loss function evaluated at Y . Hereafter we will assume the private data to be a d dimensional euclidean vector, while d could be very large (i.e. to the order of millions). LDP is a very strong privacy protection model that protects individual data against inference attacks that aims at inferring the membership of arbitrary data from the data universe. According to the discussion in Bhowmick et al. (2018), allowing adversaries with such strength may be overly pessimistic: to conduct an effective inference attack the adversary shall come up with "the true data" and only use the privatized output to verify his/her belief about the membership. If the prior information is reasonably constrained for the adversary, we may adopt a weaker, but still elegant type of privacy protection paradigm that protects against reconstruction attacks. Heuristically, reconstruction attack aims at recovering individual data with respect to a well-defined criterion, given a prior over the data domain that is not too concentrated. Formally, we adopt the definition in Bhowmick et al. ( 2018):



)where M denotes the number of clients. Let Y i denotes training data that belongs to the m-th client consisting N m data points, and N = M m=1 N m . The term L m (θ) = y∈Ym (θ, y) stands for locally aggregated loss evaluated at parameter θ, where : Θ × Y → R + is the loss function depending on the problem context and we use Θ and Y to describe parameter and data domain respectively. We will seek to optimize (1) via gradient-based methods. In what follows we will use [n], n ∈ N + to denote the set {1, 2, . . . , n}. At round t, the procedure iterates as:

