SINGLE SMPC INVOCATION DPHELMET: DIFFERENTIALLY PRIVATE DISTRIBUTED LEARNING ON A LARGE SCALE

Abstract

We introduce a distributing differentially private machine learning training protocol that locally trains support vector machines (SVMs) and computes their averages using a single invocation of a secure summation protocol. With state-of-the-art secure summation protocols and using a strong foundation model such as SimCLR, this approach scales to a large number of users and is applicable to non-trivial tasks, such as CIFAR-10. Our experimental results illustrate that for 1,000 users with 50 data points each, our scheme outperforms state-of-the-art scalable distributed learning methods (differentially private federated learning, short DP-FL) while requiring around 500 times fewer communication costs: For CIFAR-10, we achieve a classification accuracy of 79.7 % for an ε = 0.59 while DP-FL achieves 57.6 %. More generally, we prove learnability properties for the average of such locally trained models: convergence and uniform stability. By only requiring strongly convex, smooth, and Lipschitz-continuous objective functions, locally trained via stochastic gradient descent (SGD), we achieve a strong utility-privacy trade-off. the number of data points across all users (O( 1 /nm)). However, it has considerable communication overhead since it requires one SMPC invocation per training iteration. Federated learning (McMahan et al., 2017) with a DP-SGD approximation (Abadi et al., 2016) (DP-FL) constitutes another line of research with moderate utility bounds and moderate communication overhead. In DP-FL, an untrusted aggregator combines the gradient updates from each user, while each user satisfies DP. DP-FL does not require SMPC for similar security guarantees but needs O(#training_steps) communication rounds. The utility bounds are comparatively high since the noise scales with O(m √ n). Appx. C discusses related work in more detail. Concerning expressivity, Abadi et al. (2016); Tramèr & Boneh (2021); De et al. ( 2022) have shown that pre-trained models can improve the performance of a differentially private machine learning method (DP-SGD) for non-trivial tasks (e.g., CIFAR-10). While such models require sufficient public data, they exist and provide simplifying representations for various domains: SimCLR for pictures, Facenet for portrait pictures, UNet for medical segmentation imagery, or GPT-3 for natural language. Yet, this prior work does not excel at all three metrics simultaneously: scalability, expressivity, and utility-privacy trade-off. This places an inherent disadvantage when comparing current distributed training processes to a centralized training process. . Our Secure Distributed DP-Helmet work extends on prior work (Jayaraman et al., 2018) such that it is scalable, expressive, and has a good utility-privacy trade-off. Table 1 compares our approach with Jayaraman et al. (2018)'s approaches and DP-FL. In summary, we make two tangible contributions:

1. INTRODUCTION

Scalable distributed privacy-preserving machine learning methods have a plethora of applications, ranging from medical institutions that want to learn from distributed patient data, over edge AI health applications, to decentralized recommendation systems. Preserving each person's privacy during distributed learning raises two challenges: (1) during the distributed learning process the inputs of all parties have to be protected and (2) the resulting model itself should not leak information about the contribution of any person to the training data. To tackle (1), secure multi-party computation protocols (SMPC) can protect data during distributed computation. To tackle (2), differentially private (DP) mechanisms provide guarantees for using or releasing the model in a privacy-preserving manner. The literature contains a rich body of work on this kind of privacy-preserving distributed machine learning (PPDML) which is frequently evaluated with respect to scalability with the number of users who participate in the distributed learning, expressivity of the learning method with the goal of encompassing complex learning tasks, and a good utility-privacy trade-off without a significant loss in accuracy for protecting each person's data, optimally the same utility-privacy trade-off as the centralized training scheme while only adding little communication overhead. Jayaraman et al. (2018) introduced a theoretic result where the model optimum is noised (output perturbation). Here, each of the n users locally trains a convex empirical risk minimization (ERM) model on m data points and contributes the parameters of this model, carefully noised to a single invoked SMPC step, resulting in an averaged differentially private model. This approach achieves DP (Chaudhuri et al., 2011) , requires as little noise as the centralized setting (O( 1 /nm)), and incurs little communication overhead, with one SMPC invocation. However, they use untight utility bounds Pathak et al. (2010) that scale with the number of local data points (O( 1 /m)) and not with the combined number of data points across all users (O( 1 /nm)). Jayaraman et al. (2018) prove strong utility bounds with another scheme, the gradient perturbation: each user contributes the gradients of each local training iteration carefully noised to a single invoked SMPC step which results in an averaged differentially private gradient step. This construction adds as little noise as centralized training (O( 1 /nm)) and achieves strong utility bounds which scale with 1. For SGD-based strongly convex ERM, we prove a tighter utility bound which essentially states that we only need the average of locally trained models, e.g. support vector machines (SVMs) or logistic regression (LR), to converge to the optimal centrally trained model with rate O( 1 /M) for M iterations (cf. Thm. 21). We also show train-test generalization by proving uniform stability which states that averaging our models linearly improves the stability bound (cf. Thm. 19). 2. In Cor. 10 we show how with enough data, guarantees as in local DP can be achieved, even without assumptions on the training algorithm beyond a norm-bounded parameter space: we protect the entire input of a user while achieving strong utility bounds (> 80% test accuracy for CIFAR-10).

2. OVERVIEW

inference via a pre-trained feature extractor (SimCLR) an SVM, via a learning algorithm T , and finally contributes a model which is carefully noised with a spherical Σ-parameterized Gaussian to a single invoked secure summation step which results in an averaged and (ε, δ)-DP model. ξ denotes some hyperparameters and K a set of classes.

Systems overview.

Secure Distributed DP-Helmet achieves scalable, distributed privacypreserving training on sensitive data with a strong classification performance. A schematic overview of our work is illustrated in Fig. 1 . Here each user -the protocol party -holds and protects a small dataset while all users jointly learn a model without leaking information about the local dataset. More precisely, the jointly computed model protects information about individual persons in two scenarios: first, each user is a local aggregator (e.g., a hospital) and each person contributes one data point (differential privacy, see Fig. 3 ); second, each user is a person and contributes a small dataset (local DP, see Fig. 2 for Υ = 50). Essentially, we base our method on the following simple yet effective scheme introduced by Jayaraman et al. (2018) : Each user locally trains a model, e.g. an SVM, via a learning algorithm T , and contributes the parameters of this model carefully noised to a single invoked secure summation or SMPC step which results in an averaged and (ε, δ)-DP model. This construction allows using only O( 1 /nm) much noise for n users with m data points each -the same as in a centralized setting. Thus we seek and subsequently achieve three criteria: (1) scalability with the number of users which is measured by the number of communication rounds and the number of secure summation invocations if required with the above privacy definition. (2) high expressivity which is measured by the classification accuracy as well as the utility degradation when compared to a centralized scheme where all data is stored at the aggregator. (3) good utility-privacy trade-off which is measured by how much more noise we need to add due to the distributed training scheme.

Related Work.

In Table 1 we detail our utility-bound improvement in comparison to prior work. Observe that our work matches the utility bound and privacy bound of centralized training while having constant secure summation overhead: We only require one invocation of secure summation which is even realistic for Smartphone-based applications since we do not have to deal with issues of multiple consecutive communication rounds like dropouts or unstable connectivity. When using Bell et al. ( 2020)'s construction for secure summation, the number of communication rounds is fixed to 4 rounds and the size of each communication round is increased by only log(n_users), which is diminishingly small when compared to the constant overhead amounting to the model size. In comparison to DP-FL, we have a 500-fold decrease in communication cost: DP-FL has 1,920 rounds of size ℓ, where ℓ is the model size (roughly 60,000 floats for CIFAR-10) while we have 4 rounds of size log(n_users) + ℓ for roughly the same model size. Evaluation. Our evaluation on CIFAR-10 with SimCLR-based pre-training shows that for 1,000 users with 50 data points per user, our scheme achieves with SVMs a classification accuracy of 79.7 % for an ε = 0.59. Extrapolated to hundreds of thousands and millions of protocol parties (see Fig. 2 for a thorough evaluation), we can protect the entire local dataset (say, 50-group DP) of a protocol party and achieve high accuracy: For local datasets of size 50, Secure Distributed DP-Helmet achieves ≥ 84 % accuracy and (ε, δ) = (0.01, 10 -10 ) for 200, 000 users and ≥ 87 % accuracy and (ε, δ) = (2 • 10 -4 , 10 -12 ) for 20,000,000 users while guaranteeing local DP-like properties (or 50-group DP). In our experiments, Secure Distributed DP-Helmet significantly improves upon DP-FL for more than 400 ≤ n users (for CIFAR-10), as it avoids the factor √ n noise overhead.  O( 1 /nm) O( 1 /m √ n) -(O(M ) rounds) Jayaraman et al. (2018), gradient perturbation O( 1 /nm) O( 1 /nm) O(M ) Jayaraman et al. (2018), output perturbation O( 1 /m) O( 1 /nm) 1 Secure Distributed DP-Helmet (ours) O( 1 /nm) O( 1 /nm) 1 Centralized training O( 1 /nm) O( 1 /nm) 0 2.1 KEY IDEAS Trustworthy distributed noise generation. One core requirement of SMPC-based distributed learning is honestly generated and unleakable noise as otherwise, our privacy guarantees would not hold anymore. There is a rich body of work on distributed noise generation (Moran et al., 2009; Dwork et al., 2006a; Kairouz et al., 2015b; 2021; Goryczka & Xiong, 2015) . So far, however, no distributed noise generation protocol scales to millions of users. To jointly create noise of a given magnitude, we can alternatively use a simple, yet effective technique: utilizing the large number of users in our system, we can reasonably assume that at least a fraction of them (say t = 50%) are not colluding to violate privacy by sharing the noise they generate with each other. As long as we combine the noise of each user in an oblivious fashion, every user can create noise separately and independently and we are still guaranteed noise of a magnitude depending on t. Strong utility-privacy tradeoff via tight composition and convexity. Secure Distributed DP-Helmet utilizes differentially private SGD-based SVMs for which strong utility-privacy tradeoffs have been shown (Wu et al., 2017) . We use SVMs for multi-class classification via the one-vs-rest (OVR) scheme. Each class is waged against the combination of all other classes during training. We rely on output perturbation to estimate a sensitivity bound on the resulting models and add calibrated noise to the model. Meiser & Mohammadi (2018); Sommer et al. (2019) ; Balle et al. (2020a) show tight composition bounds for such sensitivity-bounded additive mechanism. Expressivity of our approach. A known limitation of convex SVMs is their limited expressivity. As a remedy, we utilize transfer learning and operate in two phases: a pre-training phase in which a powerful representation model is trained on a public dataset, and a training phase in which we train a set of SVMs on a sensitive dataset. The datasets can have different, disjoint distributions; it suffices that the two datasets are comparable in structure. In our evaluation, we use the SimCLR representation model (Chen et al., 2020b) trained on ImageNet and then fine-tune it on CIFAR-10. Threat model & security goals. We separate the security assumptions of our protocol and those of the underlying secure summation. For our work, we assume security against malicious, global attackers that do not follow our protocol as long as we have a ratio of at least t honest users (say t = 50 %). In particular, we consider dishonest noise generation. The attacker in both variants tries to extract sensitive information about other parties from the interaction and the result. As in other strong security definitions, the attacker has strong background knowledge and knows everything about and can influence each user's dataset, except for one data point of one user. Our privacy goals are (ε, δ)-differential privacy (protecting single samples) and (ε, δ)-Υ-group differential privacy (protecting all samples of a user at once).

2.2. WHAT DOES THAT MEAN FOR PRACTICAL APPLICATIONS?

From a bird's eye view, our experiments show that the mean of several SVMs significantly improves the classification accuracy, even if the individual SVMs have a poor performance on their own. We pushed this approach to its limit and trained each SVMs on only 50 data points. Our experiments show that the SVM obtained by computing the mean of 1,000 such SVMs has very high accuracy. Local DP. For applications with around 200,000 or more users, we can protect the entire local dataset of a user, i.e., the entire locally trained SVM. This result can be generalized: We can protect each local SVM independently of how many data points were used to train it. With this generalized view, our scheme does not only protect data points but users, which makes our DP guarantees akin to a group-DP setting (comparable to local DP). Applications can leverage this method and let users train SVMs on their own devices instead of requiring local aggregators for sets of users. For which other learning algorithms is this framework applicable? Our utility-privacy results apply, beyond SGD-based SVMs, to other learning methods; those methods improve significantly upon averaging locally trained models (cf. Cor. 10 and Appxs. M and N). We show that the following five requirements suffice for showing both differential privacy and learnability properties: (1) bounded output sensitivity, (2) strongly convex training objective, (3) smooth training objective, (4) Lipschitz continuous training objective, and (5) SGD-based update routine. Notably, we formally only require that the norm of each model is bounded; we do not make any assumptions about the training procedure of each base learner, which learns a single model in the ensemble. In particular, base learners do not need to satisfy differential privacy. Concerning learnability properties, we show that a generalizability property and a convergence property improve the average of locally trained models when compared to the locally trained models, for a certain class of learning methods. Whenever SGD is used with a strongly convex, smooth, and Lipschitz ERM objective (empirical risk minimization), averaging local models improves on uniform stability, which is a form of generalizability property, and convergence, which measures the loss-distance to the optimal model for a given dataset. Result: models (1d intercepts with p-dimensional hyperplanes): As a privacy notion, we consider differential privacy (DP) (Dwork et al., 2006b) . Intuitively, differential privacy quantifies the protection of any individual's data within a dataset against an arbitrarily strong attacker observing the output of a computation on said dataset. Strong protection is achieved by bounding the influence of each individual's data on the resulting SVMs. For the (standard) definition of differential privacy we utilize in our proofs, we refer to Appx. B.1. f (k) M k∈K ∈ R (p+1)×|K| clipped (x) := c • x/max(c, ∥x∥); J (f, D, k) := Λ 2 f T f + 1 N (x,y)∈D ℓ huber f T clipped (x) • y • (1[y = k] -1[y ̸ = k]) ; for k in K: for m in 1, . . . , M : f (k) m ← SGD(J (f m , D, k), α m ), with learning rate α m := min( 1 β , 1 Λm ) and β = 1 /2h + Λ; f (k) m := R • f (k) m /∥f (k) m ∥; // We consider Support Vector Machines (SVMs), which can be made strongly convex, thus display a unique local minimum, and have a lower bound on the growth of the optimization function. Having a unique local minimum makes those methods ideal for computing tight differential privacy bounds and thus highly relevant machine learning predictors for our work. In fact, this differentially private SVM definition (DP_SGD_SVM) can be derived directly from the work of Wu et al. (2017) on empirical risk minimization using SGD-based optimization. They rely on a smoothed version of the hinge-loss: the Huber loss ℓ huber (cf. Appx. B.4 for details). We additionally apply norm-clipping to all inputs. We use the one-vs-rest (OVR) method to achieve a multiclass classifier. Alg. 1 provides pseudocode for the sensitivity-bounded algorithm (before adding noise). In contrast to Wu et al. (2017) , which assumes for each data point ∥x∥ ≤ 1, we use a generalization that holds for larger norm bounds c > 1: we assume ∥x∥ ≤ c, where c is a hyperparameter of the learning algorithm SGD_SVM. As a result, the optimization function J is c + RΛ Lipschitz (instead of 1 + RΛ Lipschitz as in Wu et al. (2017) ) and (( Wu et al. (2017) showed a sensitivity bound for SGD_SVM from which we can conclude DP guarantees. The sensitivity proof follows from Wu et al. (2017, Lemma 8) with the Lipschitz constant L = c + RΛ, a smoothness β = (( c 2 /2h + Λ) 2 + pΛ 2 ) c 2 /2h + Λ) 2 + pΛ 2 ) 1 /2 smooth (instead of 1 /2h + Λ smooth). 1 /2 and a Λ-strong convexity. Similarly, our work applies to L 2 -regularized logistic regression where we adapt Alg. 1 with the optimization function J ′ (f, D) := Λ 2 f T f + 1 N (x,y)∈D ln(1 + exp(-f T clipped (x) • y)) which is Λ-strongly convex, L = c + RΛ Lipschitz, and β = (( c 2 /4 + Λ) 2 + pΛ 2 ) 1 /2 smooth. We would also need to adapt the learning rate to accommodate the change in the smoothness parameter but continue to have the same sensitivity as for the classification case. Definition 1 (Sensitivity). Let f be a function that maps datasets to the p-dimensional vector space R p . The sensitivity of f is defined as max D∼1D ′ ∥f (D) -f (D ′ )∥, where D ∼ 1 D ′ denotes that the datasets D and D ′ differ in at most one element. We say that f is an s-sensitivity-bounded function. The following lemma directly follows from Wu et al. (2017, Lemma 8) . Lemma 2. With the input clipping bound c, the model clipping bound R, the strong convexity factor Λ, and the number of data points N , the learning algorithm SGD_SVM of Alg. 1 has a sensitivity bound of s = 2(c+RΛ) N Λ for each of the |K| output models. For sensitivity-bounded functions, there is a generic additive mechanism that adds Gaussian noise to the results of the function and achieves differential privacy, if the noise is calibrated to the sensitivity. Lemma 3 (Gaussian mechanism is DP (Theorem A.1 & Theorem B.1 in Dwork & Roth (2014) )). Let q k be functions with sensitivity s on the set of datasets D. For ε ∈ (0, 1), c 2 > 2 ln 1.25 /(δ/|K|), the Gaussian Mechanism D → {q k (D)} k∈K + N (0, (σ • I (p+1)×|K| ) 2 ) with σ ≥ c•s•|K| ε is (ε, δ)-DP, where I d is the d-dimensional identity matrix. The mechanism that first learns |K| SVM models via SGD_SVM of Alg. 1 and then adds multivariate Gaussian Noise N (0, (σ • I (p+1)×|K| ) 2 ) is DP. Note that there are tighter composition results (Meiser & Mohammadi, 2018; Sommer et al., 2019; Balle et al., 2020a) where ε ∈ O( |K|) which we do not formalize for brevity reasons but follow in our experiments. Corollary 4 (Gaussian mechanism on SGD_SVM is DP). With the s-sensitivity-bounded learning algorithm SGD_SVM (cf. Lem. 2), the dimension of each data point p, the set of classes K, and ε ∈ (0, 1), DP_SGD_SVM(D, ξ, K, σ) := SGD_SVM(D, ξ, K) + N (0, (σ • s • I (p+1)×|K| ) 2 ) is (ε, δ)-DP, where ε ≥ 2 ln 1.25 /(δ/|K|) • |K| • 1 /σ and I d is the d-dimensional identity matrix. Notation. A learning algorithm is a function from datasets to learned models. Subsequently, we consider the notion of a configuration in many theorems. Definition 5 (Configuration ζ). A configuration ζ(U, t, T, s, ξ, ℧, i, N, K, σ) consists of a set of users U of which t • U are honest, an s-sensitivity-bounded learning algorithm T on inputs (D, ξ, K), hyperparameters ξ, a local datasets D (i) of user U (i) ∈ U with N = min i∈{1,...,|U |} |D (i) | and ℧ = |U | i D (i) , a set of classes K, and a noise multiplier σ. avg(T ) is the aggregation of |U| local models of algorithm T : avg(T (℧)) = 1 |U | |U | i=1 T (D (i) , ξ, K). If unique, we simply write ζ.

4. SECURE DISTRIBUTED DP-HELMET

This section presents Secure Distributed DP-Helmet in detail (cf. Alg. 2) and its utility-privacy properties. Here, each user separately trains a sensitivity-bounded learning algorithm, e.g. DP_SGD_SVMs, before their parameters are combined with the parameters trained by other users via a single round of secure summation. The single round of secure multiparty computation allows us to have the full benefit of securely aggregating data: we can show centralized-DP guarantees within a threat model akin to that of federated learning with differential privacy. Active attacks and untrustworthy noise. Our threat model allows each user to place very little trust in other users. However, we focus on passive adversaries. Active attacks that, e.g., aim to poison the resulting model, are left for future work. Note that even passive adversaries can collude and exchange information about the randomness they used in their local computation. As we combine the noise added by different users, we need to take into account that not all of that noise is necessarily secret to the adversary. To compensate for untrustworthy users, we double the noise added by each user; as long as half of all users are honest, our guarantees thus are valid. Next, we derive a tight output sensitivity bound. A naïve approach would be to release each individual predictor, determine the noise scale proportionally to σ := σ (cf. Cor. 4), showing (ε, δ)-DP for every user. We can save a factor of |U| 1 /2 by leveraging that |U| is known to the adversary and we have at least t = 50%, yielding σ : = σ • 1 / t • |U |. Corollary 6. Given a configuration ζ, Secure Distributed DP-Helmet(ζ) (cf. Alg. 2) without adding noise, i.e. avg(T (℧)), has a sensitivity of s • 1 /|U| for each class k ∈ K. The proof is placed in Appx. G. Having bounded the sensitivity of the aggregate to s • 1 /|U|, we show that locally adding noise per user proportional to σ • s • 1 / |U | and taking the mean is equivalent to only centrally adding noise proportional to σ • s • 1 /|U| (as if the central aggregator was honest). Lemma 7. Given a configuration ζ and any noise scale σ, then 1  |U | |U | i=1 N (0, (σ • 1 / |U |) 2 ) = N (0, (σ • 1 /|U|) 2 ). M priv := f (k) priv k∈K M ← T (D, ξ); // T is s-sensitivity-bounded M priv ← M + N (0, (σ • s • I p+1×|K| ) 2 ) with σ := σ • 1 / t • |U |; Run the client code of the secure summation protocol π SecAgg on input Mpriv /|U|; def Server Secure Distributed DP-Helmet(U): Data: users U Result: empty string Run the server protocol of π SecAgg ; The proof is placed in Appx. H. We can now prove differential privacy for Secure Distributed DP-Helmet of Alg. 2 where we have noise scale Cor. 9 and DP-FL. While the averaging will slightly offset this massive amount of noise, such a result does not hold for DP-FL because in the local training the sensitivity does not decrease. Hence, in contrast to Secure Distributed DP-Helmet, the standard deviation of the noise that is locally added will continuously increase, no matter how many users join the distributed training. σ := σ • 1 / t • |U | and thus ε ∈ O( s / t • |U |). Theorem 8 (Main Theorem, simplified). Given a configuration ζ, Secure Distributed DP-Helmet(ζ) (cf. Alg. 2) satisfies computational (ε, δ + ν)-DP with ε ≥ 2 ln 1.25 /(δ/|K|) • |K| • 1 /σ Cor. 9 generalizes to a more comprehensive Cor. 10 that is data oblivious. If we can show that the training algorithm of every user has the same bounded sensitivity, i.e., that the norm of each model is bounded, then Secure Distributed DP-Helmet can apply to the granularity of users instead of that of data points. We explicitly don't need to make any further assumptions about the training procedure of each base learner; it is sufficient that the local models are combined via noisy arithmetic mean. This method renders a tighter sensitivity bound than SGD_SVM for certain settings of Υ or data points per user N . Moreover, it enables the use of other SVM optimizers or Logistic Regression. In particular, the training procedure of each base learner does not need to satisfy differential privacy. Corollary 10. Given a learning algorithm T , we say that T is R-norm bounded if for any input dataset D with N = |D|, any hyperparameter ξ, and all classes k ∈ K, ∥T (D, ξ, k)∥ ≤ R. Any R-norm bounded learning algorithm T has a sensitivity s = 2R. In particular, T + N (0, (σ • s • I d ) 2 ) satisfies (ε, δ), Υ-group differential privacy with Υ = N and ε ≥ 2 ln 1.25 /(δ/|K|) • |K| • 1 /σ, where N (0, (σ • s • I d ) 2 ) is spherical multivariate Gaussian noise and σ a noise multiplier. The proof is in Appx. J. Here the number of local data points N can vary among the users. STABILITY & CONVERGENCE. The core of our approach is to locally train models and compute the average without further synchronizing or fine-tuning of the models: avg(T ). For T = SGD_SVM, we prove that the learnability properties (Shalev-Shwartz et al., 2010) uniform stability and convergence are comparable to a centrally trained SGD_SVM.

Uniform stability.

We show in Appx. M that the training generalizes well by proving uniform stability in the sense of Bousquet & Elisseeff (2002) for T = SGD_SVM: |E[J (avg(T (℧)), ℧, _) -E z∈Z [J (avg(T (℧)), z, _)]]| ∈ O(|℧| -1 ) where J is the objective function (cf. Alg. 1) and Z an unknown data distribution where ℧ ∈ Z. In particular, we show that averaging the locally trained SGD_SVMs linearly improves the stability bound. Convergence. In line with Zhang et al. (2013) on averaged ERM models, we show in Appx. N that avg(SGD_SVM) gracefully converges to the best model for the combined local datasets ℧: ) with 200,000 (left) and 20,000,000 (right) users. We train 1,000 models on 50 data points each; to emulate having more users we rescale the ε-values (ε ′ := 1000 • ε • Υ /nusers) and report the respective (interpolated) accuracy values. We extrapolate the privacy guarantees, due to the limited dataset size. Our accuracy values are pessimistic as we keep the accuracy numbers that we got from averaging 1,000 models. Actually taking the mean over 200,000 or even 20,000,000 users should provide better results. In our evaluation, Υ = 50 group privacy is comparable to local DP. A lower value of Υ < 50 places trust in users as local aggregators. For Υ ≥ 2, we can use a tighter group-privacy bound (cf. Cor. 10); hence, the accuracy values are the same as for Υ = 50 = N , where the entire local data set of a user is protected. E[J (avg(SGD_SVM(℧)), ℧, _) -inf f J (f, ℧, _)] ∈ O( 1 /M) We analyze four experimental questions: (RQ1) First, how does Secure Distributed DP-Helmet as well as the strongest alternative, DP-SGD-based federated learning, perform in terms of privacy-utility trade-off? Moreover, how does the performance change if we allow more users (cf. Fig. 3 , right, and Fig. 4 )? (RQ2) Second, what is the utility loss of applying both methods in a distributed fashion instead of centrally (cf. Fig. 3 , left)? (RQ3) Third, how does Secure Distributed DP-Helmet perform when we have truly many users (≥ 200,000 users) and when we are in a local-DP setting (cf. Fig. 2 )? (RQ4) Fourth, how do learning algorithms different than DP_SGD_SVM perform (cf. Appx. F)? Pretraining. We used a SimCLR pre-trained modelfoot_0 on ImageNet ILSVRC-2012 (Russakovsky et al., 2015) for all experiments (cf. Fig. 6 in the appendix for an embedding view). It is built with a ResNet152 with selective kernels (Li et al., 2019) architecture including a width multiplier of 3 and it has been trained in the fine-tuned variation of SimCLR where 100 % of ImageNet's label information has been integrated during training. Overall, it totals 795 M parameters and achieves 83.1 % classification accuracy (1000 classes) when applied to a linear prediction head. In comparison, a supervised-only model of the same size would only achieve 80.5 % classification accuracy. Sensitive Dataset. CIFAR-10 (Krizhevsky, 2009) acts as our sensitive dataset, as it is frequently used as a benchmark dataset, especially in the context of the differential privacy literature. CIFAR-10 is an MIT-licensed dataset consisting of 60,000 thumbnail-sized, colored images of 10 classes. Evaluation. The model performance is delineated threefold: First, we evaluated a benchmark scenario in Fig. 3 (left) to compare our Secure Distributed DP-Helmet (cf. Section 4) to a DP-SGD- based federated learning approach (DP-FL) on a single layer perceptron with softmax loss. There the approximately same number of data points is split across a various number of users ranging from 1 to 1000. Second, we also evaluated a realistic scenario in Fig. 3 (right) where we fixed the number of data points per user and report the performance increase obtained with more partaking users. Fig. 4 depicts the setting of Fig. 3 (right) for a fixed privacy budget. Third, we evaluated a scenario with truly many users as well as a local-DP setting in Fig. 2 where we rescale the privacy budget to accommodate the changed parameters. The experiments lead to three conclusions: (RQ1) First, performance improves with an increasing number of users (cf. Fig. 3 (right)). Although the Secure Distributed DP-Helmet training performs subpar to DP-FL for few users, it takes off after about 400 users due to its vigorous performance gain with the number of users (cf. Fig. 4 ). (RQ2) Second, in a scenario of a globally fixed number of data points (cf. Fig. 3 (left)) that are distributed over the users, Secure Distributed DP-Helmet's performance degrades more gracefully than that of DP-FL. Thm. 21 supports the more graceful decline; it states that averaging multiple of the here used SVM predictors eventually converges to the optimal SVM on all training data. The difference between 1 and 100 users is largely due to our assumption of t = 50 % dishonest users, which means noise is scaled by a factor of √ 2. In comparison, DP-FL performs worse the more users U partake as the noise scales with O(|U| 1 /2 ). (RQ3) Third, the advantage of our method over DP-FL becomes especially evident when considering significantly more users (cf. Fig. 2 ), such as is common in distributed training via smartphones. Here, DPguarantees of ε ≤ 2•10 -4 become plausible with at least 87 % prediction performance for a task like CIFAR-10. Alternatively, leveraging Cor. 9 we can consider a local DP scenario (with Υ = 50) without a trusted aggregator, yielding an accuracy of 84 % for ε = 5 • 10 -4 . Starting from Υ ≥ 2, a user-level sensitivity (cf. Cor. 10) is in the evaluated setting tighter than a data point dependent one; hence, the accuracy values are the same as for the local DP scenario. (RQ4) We refer to Appx. F for an ablation study in the centrally trained setting for different learning algorithms than DP_SGD_SVM. In this setting, the here used DP_SGD_SVM has a worse privacyutility trade-off than other DP learners like DP-SGD: for ε = 0.59, DP_SGD_SVM has an accuracy of 87.4 % while DP-SGD has 93.6 %. The reasons include leakage via sequential composition (through DP-SGD-SVM's one-versus-rest multi-class approach) compared to DP-SGD's joint learning of all classes as well as its noise-correcting property from its iterative noise application.

Computation costs.

For Secure Distributed DP-Helmet with 1,000 users and a model size l ≈ 100,000 for CIFAR-10, we need less than 0.2 s for the client and 40 s for the server, determined by extrapolating the experiments of (Bell et al., 2020, Table 2 ). Experimental setup. Appx. D describes our experimental setup.

A LIMITATIONS & DISCUSSION

Distributional shifts between the public and sensitive datasets For pre-training our models, we leverage contrastive learning. While very effective generally, contrastive learning is susceptible to performance loss if the shape of the sensitive data used to train the SVMs is significantly different from the shape of the initial public training data. Multi-class classification As we train separate SVMs for the different classes, this approach works best if the number of classes is limited. Distributed DPHelmet can deal with multiple classes; CIFAR-10 has 10 classes. However, if a classification task has significantly more different classes, non-SVM-based approaches might perform better. Input Clipping DP_SGD_SVM requires a norm bound on the input data as it directly influences the SVM training. In many pre-training methods like SimCLR no natural bound exists thus we have to artificially norm clip the input data. To provide a non-data-dependent clipping bound in CIFAR-10 data, we determined the clipping bound on the CIFAR-100 dataset (here: 34.854); its similar data distribution encompasses the output distribution of the pretraining reasonably well. Hyperparameter Search In SGD_SVM, we deploy two performance-crucial hyperparameters: the regularization weight Λ as well as the predictor radius R, both of which influence noise scaling. In the noise scaling subterm c /Λ + R, the maximal predictor radius is naturally significantly smaller than c /Λ due to the regularization penalty. Thus, an imperfect R resulting from a non-hyperparametertuned SVM does not have a large impact on the performance. Estimating the regularization weight for a fixed ε from public data is called hyperparameter freeness in prior work (Iyengar et al., 2019) . For other ε values we can fit a (linear) curve on a smaller but related public dataset (proposed by Chaudhuri et al. (2011)) or synthetic data (proposed by AMP-NT (Iyengar et al., 2019 )) as smaller ε prefer higher regularization weights and vice versa.

B EXTENDED PRELIMINARIES B.1 DIFFERENTIAL PRIVACY

To ease our analysis, we consider a randomized mechanism M to be a function translating a database to a random variable over possible outputs. Running the mechanism then is reduced to sampling from the random variable. With that in mind, the standard definition of differential privacy looks as follows. Definition 11 (≈ ε,δ relation). Let Obs be a set of observations, and RV(Obs) be the set of random variables over Obs, and D be the set of all databases. A randomized algorithm M : D → RV(Obs) for a pair of datasets D, D ′ , we write M (D) ≈ ε,δ M (D ′ ) if for all tests S ⊆ Obs we have Pr[M (D) ∈ S] ≤ exp(ε) Pr[M (D ′ ) ∈ S] + δ. (1) Definition 12 (Differential Privacy). Let Obs be a set of observations, and RV(Obs) be the set of random variables over Obs, and D be the set of all databases. A randomized algorithm M : D → RV(Obs) for all pairs of databases D, D ′ that differ in at most 1 element is a (ε, δ)-DP mechanism if we have M (D) ≈ ε,δ M (D ′ ). In the context of machine learning, the randomized algorithm represents the training procedure of a predictor. Our distinguishing element is one data record of the database. Computational Differential Privacy Note that because of the secure summation, we technically require the computational version of differential privacy (Mironov et al., 2009) , where the differential privacy guarantees are defined against computationally bounded attackers; the resulting increase in δ is negligible and arguments about computationally bounded attackers are omitted to simplify readability. Definition 13 (Computational ≈ c ε,δ Differential Privacy). Let D be the set of all databases and η a security parameter. A randomized algorithm M : D → RV(Obs) for a pair of datasets D, D ′ , we write M (D) ≈ c ε,δ M (D ′ ) if for any polynomial-time probabilistic attacker Pr[A(M (D)) = 0] ≤ exp(ε) Pr[A(M (D ′ )) = 1] + δ(η). For all pairs of databases D, D ′ that differ in at most 1 element M is a computational (ε, δ(η))-DP mechanism if we have M (D) ≈ c ε,δ M (D ′ ).

B.2 SECURE SUMMATION

Hiding intermediary local training results as well as ensuring their integrity is provided by an instance of secure multi-party computation (SMPC) called secure summation (Bonawitz et al., 2017; Bell et al., 2020) . It is targeted to comply with distributed summations across a huge number of parties. In fact, Bell et al. ( 2020) has a computational complexity for n users on an l-sized input of O(log 2 n+l log n) for the client and O(n(log 2 n + l log n)) for the server as well as a communication complexity of O(log n + l) for the client and O(n(log n + l)) for the server thus enabling an efficient run-through of roughly 10 9 users without biasing towards computationally equipped users. Additionally, it offers resilience against client dropouts and colluding adversaries, both of which are substantial features for our distributed setting: Theorem 14 (Secure Aggregation π SecAgg in the semi-honest setting exists (Bell et al., 2020)). Let s 1 , . . . , s n be the d-dimensional inputs of the clients U (1) , . . . , U (n) . Let F be the ideal secure summation function: F(s 1 , . . . , s n ) := 1/n n i=1 s i . If secure authentication encryption schemes and authenticated key agreement protocol exist, the fraction of dropouts (i.e., clients that abort the protocol) is at most ρ ∈ [0, 1], at most a γ ∈ [0, 1] fraction of clients is corrupted (C ⊆ U (1) , . . . , U (n) , |C| = γn), and the aggregator is honest-but-curious, there is a secure summation protocol π SecAgg for a central aggregator and n clients that securely emulates F in the following sense: there is a probabilistic polynomial-time simulator Sim F such that Real π SecAgg (s 1 , . . . , s n ) is statistically indistinguishable from Sim F (C, F(s 1 , . . . , s n )), i.e., for an unbounded attacker A there is a negligible function ν such that Recent work (Tramèr & Boneh, 2021; De et al., 2022) has shown that strong feature extractors (such as SimCLR (Chen et al., 2020a; b) ), trained in an unsupervised manner, can be combined with simple learners to achieve strong utility-privacy tradeoffs for high-dimensional data sources like images. As a variation to transfer learning, it delineates a two-step process (cf. Fig. 5 ), where a simplified representation of the high-dimensional data is learned first before a tight privacy algorithm like DP_SGD_SVM conducts the prediction process on these simplified representations. For that, two data sources are compulsory: a public data source which is used to undertake the learning of a framework aimed to obtain pertinent simplified representations in addition to our sensitive data source that conducts the prediction process in a differentially private manner. Thereby the sensitive dataset is protected while strong expressiveness is assured through the use of the feature reduction network. Also note that a homogeneous data distribution of the public and the sensitive data is not necessarily required. Advantage(A) = |Pr[A(Real π SecAgg (s 1 , . . . , s n )) = 1] -Pr[A(Sim F (C, F(s 1 , . . . , s n ))) = 1]| ≤ ν(η). Recent work has shown that for several applications such representation reduction frameworks can be found, such as SimCLR for pictures, FaceNet for face images, UNet for segmentation, or GPT-3 for language data. Without loss of generality, we focus in this work on the unsupervised SimCLR feature reduction network (Chen et al., 2020a; b) . SimCLR uses contrastive loss and image transformations to align the embeddings of similar images while keeping those of dissimilar images separate (Chen et al., 2020a) . It is based upon a self-supervised training scheme called contrastive loss where no labeled data is required. Labelless data is especially useful as it exhibits possibilities to include large-scale datasets which would otherwise be unattainable due to the labeling efforts needed.

B.4 DP_SGD_SVM

Definition 15. The Huber loss according to Chaudhuri et al. (2011, Equation 7) is with a smoothness parameter h defined as ℓ huber (z) :=    0 if z > 1 + h 1 4h (1 + h -z) 2 if |1 -z| ≤ h 1 -z if z < 1 -h .

C RELATED WORK C.1 PRIVACY-PRESERVING DISTRIBUTED MACHINE LEARNING

There is a rich body of literature about different differentially private distributed learning techniques that protect any individual data point (sometimes called distributed learning with global DP guarantees). One direction uses an untrusted central aggregator; users locally add noise to avoid leakage toward the aggregator. This method computationally scales well with the number of users. Another direction utilizes cryptographic protocols to jointly train a model without a central aggregator. This direction requires less noise for privacy, but the cryptographic protocols face scalability challenges. For local noising, the most prominent and flexible approach is federated learning (McMahan et al., 2017) with DP-SGD approximation (Abadi et al., 2016) (DP-FL). DP-FL proposes each of the n users locally train with the DP-SGD algorithm and share their local gradient updates with a central aggregator. This aggregator updates a global model with the average of the noisy local updates, leading to noise overhead in the order of √ n. This noise overhead can be completely avoided by PPDML protocols that rely on cryptographic methods to hide intermediary training updates from a central aggregator. There are several secure distributed learning methods that protect the contributions during training but do not come with privacy guarantees for the model such as DP: an attacker (e.g., a curious training party) can potentially extract information about the training data from the model. As we focus on differentially private distributed learning methods (PPDML in this paper), we will neglect those methods. cpSGD (Agarwal et al., 2018) is a PPDML protocol that utilizes SMPC methods to honestly generate noise and compute DP-SGD. While cpSGD provides the full flexibility of SGD, it does not scale to millions of users as it relies on expensive SMPC methods. Another recent PPDML work (Truex et al., 2019) relies on a combination of SMPC and DP methods. This work, however, also does not scale to millions of users. Another line of research aims for the stronger privacy goal of protecting a user's entire input (called local DP) during distributed learning (Balle et al., 2020b; Girgis et al., 2021) . Due to the strong privacy goal, federated learning with local DP tends to achieve weaker accuracy. With Cor. 9, evaluated in Fig. 2 in Section 5, we show how Secure Distributed DP-Helmet achieves a comparable guarantee via group privacy: given enough users, any user can protect their entire dataset at once while we still reach good accuracy. For DP training of SVMs, there exist other methods, such as objective perturbation and gradient perturbation. When performed under SMPC-based distributed training, both methods would require a significantly higher number of SMPC invocations; hence, they are unsuited for the goals of this work. Appx. C.2 discusses those approaches in detail.

C.2 DIFFERENTIALLY PRIVATE EMPIRICAL RISK MINIMIZATION

On differentially private empirical risk minimization for convex loss functions (Chaudhuri et al., 2011) , which is utilized in this work, the literature discusses three directions: output perturbation, objective perturbation, and gradient perturbation. Output perturbation (Chaudhuri et al., 2011; Wu et al., 2017) estimates a sensitivity on the final model without adding noise, and only in the end adds noise that is calibrated to this sensitivity. We rely on output perturbation because it enables us to only have a single invocation of an SMPC protocol at the end to merge the models while still achieving the same low sensitivity as if the model was trained at a trustworthy central party that collects all data points, trains a model and adds noise in the end. Objective perturbation (Chaudhuri et al., 2011; Kifer et al., 2012; Iyengar et al., 2019; Bassily et al., 2019) adds noise to the objective function instead of adding noise to the final model. In principle, SMPC could also be used to emulate the situation that a central party as above trains a model via objective perturbation. Yet, in that case, each party would have to synchronize with every other party far more often, as no party would be allowed to learn how exactly the objective function would be perturbed. That would result in far higher communication requirements. Concerning gradient perturbation (Bassily et al., 2014; Wang et al., 2017; Feldman et al., 2018; Bassily et al., 2019; Feldman et al., 2020) , recent work has shown tight privacy bounds. In order to achieve the same low degree of required noise as in a central setting, SMPC could be utilized. Yet, for SGD also multiple rounds of communication would be needed as the privacy proof (for convex optimization) does not take into account that intermediary gradients are leaked. Hence, the entire differentially private SGD algorithm for convex optimization would have to be computed in SMPC, similar to cpSGD (see above).

D EXPERIMENTAL SETUP

We leveraged 5-repeated 6-fold stratified cross-validation for all experiments unless stated differently. Privacy Accounting has been undertaken either by using the privacy bucket (Meiser & Mohammadi, 2018; Sommer et al., 2019 ) toolboxfoot_1 or, for Gaussians without subsampling, with Sommer et al. (2019, Theorem 5) where both can be extended to multivariate Gaussians (see Appx. L). We note that with either of these tactics, ε ∈ O(|K| 1 /2 ). The δ parameter of differential privacy has been set to δ = 10 -5 if not stated otherwise, which is for the CIFAR-10 dataset always below 1 /n, where n is the sum of the size of all local datasets. Concerning computation resources, for our experiments, we trained 1000 DP_SGD_SVM with 50 data points each, which took 10 minutes on a machine with 2x Intel Xeon Platinum 8168, 24 Cores @2.7 GHz with an Nvidia A100 and allocated 16GB RAM. For DP_SGD_SVM-based experiments, we utilize the strongly convex projected stochastic gradient descent algorithm (PSGD) as used by Wu et al. (2017) . More specifically, we chose a batch size of 20, the Huber loss with a smoothness parameter h = 0.1, a hypothesis space radius R ∈ {0.04, 0.05, 0.06, 0.07, 0.08}, a regularization parameter Λ ∈ {10, 100, 200}, and trained for 500 epochs; for the variant where we protect the whole local dataset, we have chosen a different Λ ∈ {0.5, 1, 2, 5} and R ∈ {0.06, 0.07}. In every experiment, we chose for each parameter combination the best performing regularization parameter Λ as well as R, i.e. those values that lead to the best mean accuracy. This is highly important, as the regularization parameter not only steers the utility but also the amount of noise needed to the effect where there is a sweet spot for each noise level where the amount of added noise is on the edge of still being bearable. For the federated learning experiments, we utilized the opacusfoot_2 PyTorch library (Yousefpour et al., 2021) , which implements DP-SGD (Abadi et al., 2016) . We loosely adapted our hyperparameters to the ones reported by Tramèr & Boneh (2021) who already evaluated DP-SGD on SimCLR's embeddings for the CIFAR-10 dataset. In detail, the neural network is a single-layer perceptron with 61 450 trainable parameters on a 6 144-d input and 10d output. The loss function is the categorical cross-entropy on a softmax activation function and training has been performed with stochastic gradient descent. Furthermore, we set the learning rate to 4, the Poisson sample rate q := 1024 /50000 which in expectation samples a batch size of 1024, trained for 40 epochs, and norm-clipped the gradients with a clipping bound c := 0.1. In the distributed training scenario, instead of running an end-to-end experiment with full SMPC clients, we evaluate a functionally equivalent abstraction without cryptographic overhead. In our experiments, we randomly split the available data points among the users and emulated scenarios where not all data points were needed by taking the first training data points. However, the validation size remained constant. Moreover, for DP-SGD-based federated learning, we kept a constant batch size whenever enough data is available i.e. increased the sampling rate as follows: q ′ := 1024 /20000 for 20000, q ′′ := 1024 /5000 for 5000, and q ′′ := 1023 /1024 for 500 available data points (|U| • N ). For DP-SGD-based FL, we emulated a higher number of users by dividing the noise multiplier σ by |U| 1 /2 to the benefit of DP-FL. The justification for dividing by |U| 1 /2 is that in FL the model performance is not expected to differ as the mean of the gradients of one user is the same as the mean of gradients from different users: SGD computes, just as FL, the mean of the gradients. Yet, the noise will increase by a factor of |U| 1 /2 . Hence, we optimistically assume that everything stays the same, just the noise increases by a factor of |U| 1 /2 . et al., 2016) , and AMP (SVM with objective perturbation) (Iyengar et al., 2019) on CIFAR-10 benchmark dataset (left: δ = 10 -5 , right: δ = 2 • 10 -8 ≪ 1 /dataset_size). For comparison, we report a non-private SVM baseline.

F.1 SETUP OF THE ABLATION STUDY

For DP_SMO_SVM-based experiments, we used the liblinear (Fan et al., 2008) library via the Scikit-Learn method LinearSVCfoot_3 for classification. Liblinear is a fast C++ implementation that uses the SVM-agnostic sequential minimal optimization (SMO) procedure. However, it does not offer a guaranteed and private convergence bound. More specifically, we used the L 2 -regularized hinge loss, an SMO convergence tolerance of tol := 2 • 10 -12 with a maximum of 10,000 iterations which were seldom reached, and a logarithmically spaced inverse regularization parameter C ∈ {3, 6} • 10 -8 , {1, 2, 3, 6} • 10 -7 , {1, 2, 3, 6} • 10 -6 , {1, 2, 3, 6} • 10 -foot_4 , {1, 2} • 10 -4 . To better fit with the LinearSVC implementation, the original loss function is rescaled by 1 /Λ and C is set to 1 /Λ • n with n as the number of data points. Furthermore, for distributed DP_SMO_SVM training we extended the range of the hyperparameter C -whenever appropriate -up to 3 • 10 -3 which becomes relevant in a scenario with many users and few data points per user. Similar to DP_SGD_SVM-based experiment, the best performing regularization parameter C was selected for each parameter combination. The non-private reference baseline uses a linear SVM optimized via SMO with the hinge loss and an inverse regularization parameter C = 2 (best performing of C ∈ ≤ 5 • 10 -5 , 0.5, 1, 2 ). For the ablation study, we also included the Approximate Minima Perturbation (AMP) algorithm 5 (Iyengar et al., 2019) which resembles an instance of objective perturbation. There, we used a (80-20)train-test split with 10 repeats and the following hyperparameters: L ∈ {0.1, 1.0, 34.854}, eps_frac ∈ {.9, .95, .98, .99}, eps_out_frac ∈ {.001, .01, .1, .5}. We selected (L = 1, eps_out_f rac = 0.001, eps_f rac = 0.99) as a good performing parameter combination for AMP. For better performance, we resembled the GPU-capable bfgs_minimize from the Tensorflow Probability package. To provide better privacy guarantees, we leveraged the results of Kairouz et al. (2015a) ; Murtagh & Vadhan (2016) for tighter composition bounds on arbitrary DP mechanisms.

F.2 RESULTS OF THE ABLATION STUDY

For the extended ablation study, we considered the centralized setting (only 1 user) and compare different algorithms as well as different values for the privacy parameter δ. The results are depicted in Fig. 7 and display four algorithms: firstly, the differentially private Support Vector Machine with SGD-based training DP_SGD_SVM (cf. Section 3.1), secondly, a similar differentially private SVM but with SMO-based training which does not offer a guaranteed and private convergence bound, thirdly, differentially private Stochastic Gradient descent (DP-SGD) (Abadi et al., 2016) applied on a 1-layer perceptron with the cross-entropy loss, and fourthly, approximate minima perturbation (AMP) (Iyengar et al., 2019) which is based upon an SVM with objective perturbation. Note that, only DP_SMO_SVM and DP_SGD_SVM have an output sensitivity and are thus suited for this efficient Secure Distributed DP-Helmet scheme. While all algorithms come close to the non-private baseline with rising privacy budgets ε, we observe that although DP-SGD performs best, DP_SMO_SVM comes considerably close, DP_SGD_SVM has a disadvantage above DP_SMO_SVM of about a factor of 2, and AMP a disadvantage of about a factor of 4. We suspect that DP-SGD is able to outperform the other variants as it is the only contestant which directly optimizes for the multi-class objective via the cross-entropy loss while others are only able to simulate it via the one-vs-rest (ovr) SVM training scheme. Although DP_SMO_SVM renders best of the variants with an output sensitivity, it does not offer a privacy guarantee when convergence is not reached. In the case of AMP, we have an inherent disadvantage of about a factor of 3 due to an unknown output distribution, and thus bad composition results in the multi-class SVM. Here, the privacy budget of AMP roughly scales linearly with the number of classes. For DP-SGD, DP_SGD_SVM, and DP_SMO_SVM, Fig. 7 shows that a smaller and considerably more secure privacy parameter δ ≪ 1 /dataset_size is supported although reflecting on the reported privacy budget ε. Proof. Without loss of generality, we consider one arbitrary class k ∈ K. We know that T is an s-sensitivity bounded algorithm thus s = max D (i) 0 ∼D (i) 1 T (D (i) 0 , ξ, k) -T (D (i) 1 , ξ, k) with D (i) , ξ, K). The challenge element -i.e. the element that differs between D (i) 0 and D (i) 1 -is only contained in one of the |U| SGD_SVM's. By the application of the parallel composition theorem, we know that the sensitivity reduces to max D (i) 0 ∼D (i) 1 ,∀i=0,...,|U | 1 |U| |U | i=1 T (D (i) 0 , ξ, k) - 1 |U| |U | i=1 T (D (i) 1 , ξ, k) = s • 1 |U| . Hence, the constant 1 /|U| factor reduces the sensitivity by a factor of 1 /|U|. H PROOF OF LEM. 7 We recall Lem. 7: Lemma 7. Given a configuration ζ and any noise scale σ, then 1 |U | |U | i=1 N (0, (σ • 1 / |U |) 2 ) = N (0, (σ • 1 /|U|) 2 ). Proof. We have to show that 1 |U | |U | i=1 N (0, (σ • 1 √ |U | ) 2 ) = N (0, (σ • 1 |U | ) 2 ). It can be shown that the sum of normally distributed random variables behaves as follows: Let X ∼ N (µ X , σ 2 X ) and Y ∼ N (µ Y , σ 2 Y ) two independent normally-distributed random variables, then their sum Z = X + Y equals Z ∼ N (µ X + µ Y , σ 2 X + σ 2 Y ) in the expectation. Thus, in this case, we have 1 |U | |U | i=1 N (0, (σ • 1 √ |U | ) 2 ) = 1 |U | N (0, |U | • (σ • 1 / |U |) 2 ) = 1 |U | N (0, σ2 ). As the normal distribution belongs to the location-scale family, we get N (0, (σ • 1 /|U|) 2 ). I PROOF OF THM. 8 We state the full version of Thm. Proof. We first show (ε, δ)-DP for a variant M 1 of Secure Distributed DP-Helmet that uses the ideal summation protocol F instead of π SecAgg . Then, we conclude from Thm. 14 that for Secure Distributed DP-Helmet (abbreviated as M 2 ) which uses the real secure summation protocol π SecAgg for some negligible function ν 1 (ε, δ + ν 1 )-DP holds. Recall that we assume at least t • |U | many honest users. As we solely rely on the honest t • |U | to contribute correctly distributed noise to the learning algorithm T , we have for each class similar to Lem. 7 1 |U| t•|U | i=1 N (0, (σ • 1 |U| ) 2 ) = t•|U | i=1 N (0, (σ • 1 |U| |U| ) 2 ) = N (0, (σ • t • |U| |U| |U| ) 2 ) = N (0, (σ • √ t |U| ) 2 ). Hence, we scale the noise parameter σ with 1/ √ t and get 1 |U| t|U | i=1 N (0, (σ • 1 √ t • |U| ) 2 ) = N (0, (σ • 1 |U| ) 2 ). By Cor. 6, Lem. 7, and Lem. 3, we know that M 1 satisfies (ε, δ)-DP (with the parameters as described above). Hence, considering an unbounded attacker A and Thm. 14, we know that for any pair of neighboring data sets D, D ′ the following holds Pr [A (M 1 (D)) = 1] ≤ exp(ε) Pr [A (M 1 (D ′ )) = 1] + δ By Thm. 14, we know that π SecAgg (s 1 , . . . , s n ) securely emulates F (w.r.t. an unbounded attacker). Hence, there is a negligible function ν such that for any neighboring data sets D, D ′ (differing in at most one element) the following holds w.l.o.g.: Pr [A (M 2 (D)) = 1] -ν(η) ≤ Pr [A (Sim F (M 1 (D))) = 1] . For the attacker A ′ that first applies Sim and then A, we get: Pr [A (M 2 (D)) = 1] -ν(η) ≤ exp(ε) Pr [A (Sim F (M 1 (D ′ ))) = 1] + δ (8) ≤ exp(ε) (Pr [A (M 2 (D ′ )) = 1] + ν(η)) + δ thus we have Pr [A (M 2 (D)) = 1] ≤ exp(ε) Pr [A (M 2 (D ′ )) = 1] + δ + (1 + exp(ε)) • ν(η). From a similar argumentation it follows that Pr [A (M 2 (D ′ )) = 1] ≤ exp(ε) Pr [A (M 2 (D)) = 1] + δ + (1 + exp(ε)) • ν(η) holds. Hence, with ν 1 := (1 + exp(ε)) • ν(η) the mechanism Secure Distributed DP-Helmet mechanism M 2 which uses π SecAgg is (ε, δ + ν 1 )-DP. As ν is negligible and ε is constant, ν 1 is negligible as well. J PROOF OF COR. 10 We recall Cor. 10: Corollary 10 (User-level sensitivity). Given a learning algorithm T , we say that T is R-norm bounded if for any input dataset D with N = |D|, any hyperparameter ξ, and all classes k ∈ K, ∥T (D, ξ, k)∥ ≤ R. Any R-norm bounded learning algorithm T has a sensitivity s = 2R. In particular, T + N (0, (σ • s • I d ) 2 ) satisfies (ε, δ), Υ-group differential privacy with Υ = N and ε ≥ 2 ln 1.25 /(δ/|K|) • |K| • 1 /σ, where N (0, (σ • s • I d ) 2 ) is spherical multivariate Gaussian noise and σ a noise multiplier. Proof. First observe that for any s-sensitivity-bounded function q ′′ , two adjacent inputs D, D ′ (differing in one element) with ∥q ′′ (D) -q ′′ (D ′ )∥ 2 = s are worst-case inputs. As a spherical Gaussian distribution (covariance matrix Σ = σ 2 • I p×n ) is rotation invariant, there is a rotation such that the difference only occurs in one dimension and has length s. Hence, it suffices to analyze a univariate Gaussian distribution with sensitivity s. Hence, the privacy loss distribution of both mechanisms (for the worst-case inputs) is the same. As a result, for all ε ≥ 0, δ ∈ [0, 1] (i.e. the privacy profile is the same) if (ε, δ)-ADP holds for the univariate Gaussian mechanism it also holds for the multivariate Gaussian mechanism.

M STABILITY OF AVERAGING MODELS

Definition 18 (Uniform Stability, Definition 2.1 in Hardt et al. (2016) ). Let f (h, z) denote a loss function on hypothesis h and instance z. A randomized algorithm A is ϵ-uniformly stable if for all datasets S, S ′ ∈ Z n of size n such that S and S ′ differ in at most one example, we have sup z E A [f (A(S); z) -f (A(S ′ ); z))] ≤ ϵ stab Theorem 19 (Averaging models is uniformly stable). Given a set of users U (i) ∈ U each with a local data set D (i) ∈ Z originating from an unknown data distribution Z, a learning algorithm T with a Λ-strongly convex, L-Lipschitz, and β-smooth training objective J (f, D (i) , K) on model parameters f (like SGD_SVM of Alg. 1), an averaging routine avg(T ( ℧)) = 1 |U | |U | i=1 T (D (i) , ξ, K) with ℧ := |U | i D (i) (like in Alg. 2), and the projected SGD update routine for a c-norm clipped data point z (i) m ∈ D (i) and class k ∈ K, i.e. f (i) m+1 = ∥f ∥≤R f (i) m -α t ∂ ∂f J (f (i) m , z m , k) =: G, then for a constant learning rate α ≤ 1 /β, M steps, and N := |℧| total data points, T is ϵ stab -uniformly stable in the sense of Bousquet & Elisseeff (2002) with |E D,SGD_SVM [J (avg(T (℧)), ℧, _) -E z∈Z [J (avg(T (℧)), z, _)]]| ≤ ϵ stab ≤ 2L 2 ΛN ∈ O(N -1 ). Proof. By definition of uniform stability (Hardt et al., 2016 i) , ξ, k) respectively after M steps where i D (i) , i D ′(i) are 1-neighboring datasets. We know due to the Lipschitz condition that for a given z, k E T   J ( 1 |U| |U | i=1 f (i) M , z, k) -J ( 1 |U| |U | i=1 f ′(i) M , z, k)   ≤ ϵ stab with f (i) M = T (D (i) , ξ, k) and f ′(i) M = T (D ′( E   J ( 1 |U| |U | i=1 f (i) M , z, k) -J ( 1 |U| |U | i=1 f ′(i) M , z, k)   ≤ L E[δ M ]. with δ m = 1 |U | |U | i=1 f ′(i) m -f (i) m ≤ 1 |U | |U | i=1 f ′(i) m -f (i) m . Next, we need to bound E[δ M ] by defining a modified growth recursion (Hardt et al., 2016, Lemma 2.5) for two arbitrary sequences of gradient updates G 1 , . . . , G M and G ′ 1 , . . . , G ′ M , the starting point f (i) 0 = f ′(i) 0 , any i ∈ [1, |U|], and some j ∈ [1, |U|] as δ 0 = 0 δ m+1 ≤ ηδ m if G (i) m = G ′(i) m is η-expansive ηδ m + 2σm |U | if G (j) m and G ′(j) m are σ-bounded, G (i) m is η-expansive . Note that we consider the differing element occurring only in one local gradient update and not in each one. We recall the definition of a gradient update as f m+1 = G m (f m ) and f ′ m+1 = G ′ m (f ′ m ). Proof, growth recursion (case I). δ m+1 = 1 |U| |U | i=1 G m (f ′(i) m ) -G m (f (i) m ) ≤ 1 |U| |U | i=1 η f ′(i) m -f (i) Having established the growth recursion, we now combine the bounds with their probability of occurrence as well as calculate the corresponding η-expansiveness and σ-boundedness terms.  (i) m ̸ = G ′(i) m . Thus, we have E[δ m+1 ] ≤ (1 - |U| N )η E[δ m ] + |U| N (η E[δ m ] + 2σ m |U| ) ≤ (1 - |U| N )(1 -αΛ) E[δ m ] + |U| N (1 -αΛ) E[δ m ] + |U| N 2αL |U| = (1 -αΛ) E[δ m ] + 2αL N . The remaining part goes by the proof of Hardt et al. (2015, Theorem 3.9) with the Lipschitzness of the training objective as well as the growth recursion E[δ m+1 ]. In short, we unfold the recursion: E[δ M ] ≤ 2Lα N M m=1 (1 -αΛ) m ≤ 2L ΛN and insert it into our initial bound which gets us for all k and any z E   J ( 1 |U| |U | i=1 f (i) M , z, k) -J ( 1 |U| |U | i=1 f ′(i) M , z, k)   ≤ 2L 2 ΛN . Note that this proof permits the learning rate scheduling in Secure Distributed DP-Helmet which has been set to α m := min( 1 /β, 1 /Λm) for iteration m and the β-smooth and Λ-strongly convex objective. The bias term b depends on Z, β, Λ, L.

N CONVERGENCE OF AVERAGING MODELS

Proof. The proof is based on Nemirovski et al.'s proof of convergence for strongly convex SGD_SVM training. (Nemirovski et al., 2009, Section 2.1) . Subsequently, we abbreviate the output of the learning algorithm T at iteration m for the i-th user with f (i) m := T (D (i) , ξ, k). First, we define the convergence criterion A m at the iterate m and then its recursive growth A m+1 . A m = 1 2 E    1 |U| ( |U | i=0 f (i) m ) -f * 2    Our convergence criterion describes that we measure and subsequently seek to bound the difference in the weights between the averaged T 's 1 |U | |U | i=0 f (i) m and the optimal weights f * for the loss J on the combined data of all users. Subsequently, we abbreviate G(f ) := ∂ ∂f J (f, _, _). A m+1 = 1 2 E    1 |U| ( |U | i=0 Π ∥f ∥≤R (f (i) m -α m G(f (i) m ))) -f * 2    = 1 2 E    1 |U| ( |U | i=0 Π ∥f ∥≤R (f (i) m -α m G(f (i) m ))) -Π ∥f ∥≤R (f * ) 2    ≤ 1 2 E    1 |U| ( |U | i=0 f (i) m -α m G(f (i) m )) -f * 2    (binomial expansion ⟨x + y, x + y⟩ = ⟨x, x⟩ + 2⟨x, y⟩ + ⟨y, y⟩ and linearity of expectation) = A m + 1 2 α 2 m E    1 |U| |U | i=0 G(f (i) m ) 2    -α m E   ( 1 |U| ( |U | i=0 f (i) m ) -f * ) T ( 1 |U| |U | i=0 G(f (i) m ))   Recall that because of the L-Lipschitz continuity of J , ∥G(f )∥ ≤ L. Hence, E 1 |U | |U | i=0 G(f (i) m ) 2 = E G( 1 |U | |U | i=0 f (i) m ) 2 ≤ L 2 . We now have for the recursion A m+1 ≤ A m -α m E   ( 1 |U| ( |U | i=0 f (i) m ) -f * ) T G( 1 |U| |U | i=0 f (i) m )   + 1 2 α 2 m L 2 . Recall, strong convexity states that (f ′ -f ) T (∇J (f ′ )-∇J (f )) ≥ µ ∥f ′ -f ∥ 2 , ∀f ′ , f . Hence, we also know for the optimal f * that (f ′ -f * ) T ∇J (f ′ ) ≥ µ ∥f ′ -f * ∥ 2 , ∀f ′ . With f ′ := 1 |U | |U | i=0 f (i) m , we conclude E   ( 1 |U| ( |U | i=0 f (i) m ) -f * ) T G( 1 |U| |U | i=0 f (i) m )   ≥ µ E    1 |U| ( |U | i=0 f (i) m ) -f * 2    = 2µA m . The strong convexity constant µ = Λ can be determined by H J(f, i D (i) , _) ⪰ ΛI, ∀f where H is the hessian matrix, I the identity matrix and B ⪰ ιI means that B -ιI is positive semidefinite. As the argumentation above about J's strong convexity holds for any f , it also holds for f = 1 |U | |U | i=0 f (i) m . In summary, we now have A m+1 ≤ (1 -2Λα m )A m + 1 2 α 2 m L 2 . ( ) The smoothness assumption can be equivalently formulated as Note that the bias term b depends on how many iterations are conducted with the constant learning rate 1 /β. ∥∇J (f ) -∇J (f ′ )∥ ≤ β ∥f -f ′ ∥ , ∀f, f ′ ⇔ J (f ) ≤ J (f * ) + 1 2 β ∥f -f * ∥ 2 ,



accessible at https://github.com/google-research/simclr, Apache-2.0 license accessible at https://github.com/sommerda/privacybuckets, MIT license accessible at https://github.com/pytorch/opacus/, Apache-2.0 license https://scikit-learn.org/stable/modules/generated/sklearn.svm. LinearSVC.html, BSD-3-Clause license reference implementation by the authors: https://github.com/sunblaze-ucb/ dpml-benchmark, MIT license



Figure 1: Schematic overview of Secure Distributed DP-Helmet. Each user locally extracts a simplified data representation via a pre-training feature extractor (SimCLR), then trains a model, e.g. an SVM, via a learning algorithm T , and finally contributes a model which is carefully noised with a spherical Σ-parameterized Gaussian to a single invoked secure summation step which results in an averaged and (ε, δ)-DP model. ξ denotes some hyperparameters and K a set of classes.

SGD_SVM(D, ξ, K) with hyperparameters ξ := (h, c, Λ, R, M )Data: dataset D := {(x i , y i )} N i=1where x i is structured as [1, x i,1 , . . . , x i,p ]; set of classes K; Huber loss smoothness parameter h ∈ R + ; input clipping bound: c ∈ R + ; #iterations M ; regularization parameter: Λ ∈ R + ; model clipping bound: R ∈ R + ;

Secure Distributed DP-Helmet. For T = SGD_SVM (cf. Alg. 1) we have s = 2(c+RΛ) N Λ with hyperparameters ξ := (h, c, Λ, R, M ). π SecAgg is described in Bell et al. (2020, Algorithm 2) and can be extended to floating points using fixed-point arithmetic. def Client Secure Distributed DP-Helmet(D, |U|, K, T, t, ξ, σ): Data: local dataset D with N = |D|; #users |U|; set of classes K; training algorithm T ; ratio t of honest users; hyperparameters ξ; noise multiplier σ Result: DP-models (intercepts with p-dimensional hyperplanes):

and a function ν negligible in the security parameter of the secure aggregation.The full statement and proof are in Appx. I. Simplified, the proof follows by the application of the sensitivity (cf. Cor. 6) to the Gauss Mechanism (cf. Lem. 3) where the noise is applied per user (cf. Lem. 7). If each user contributes 50 data points and we have 1000 users, N • |U | = 50,000.Next, we show that we can protect the entire dataset of a single user (e.g., for distributed training via smartphones). The sensitivity-based bound on the Gaussian mechanism (see Appx. K) directly implies that we can achieve strong Υ-group privacy results, which is equivalent to local DP. Corollary 9 (Group-private variant). Given a configuration ζ, Secure Distributed DP-Helmet(ζ) (cf. Alg. 2) satisfies computational (ε, δ + ν), Υ-group DP with ε ≥ Υ • 2 ln 1.25 /(δ/|K|) • |K| • 1 /σ for ν as above: for any pair of datasets ℧, ℧ ′ that differ at most Υ many data points, Secure Distributed DP-Helmet(ζ(. . . , ℧, . . . )) ≈ ε,δ Secure Distributed DP-Helmet(ζ(. . . , ℧ ′ , . . . ))

Figure 2: (ε, Υ)-Heatmap for classification accuracy of Secure Distributed DP-Helmet on CIFAR-10 dataset (left: δ = 10 -10 ; right: δ = 10 -12) with 200,000 (left) and 20,000,000 (right) users. We train 1,000 models on 50 data points each; to emulate having more users we rescale the ε-values (ε ′ := 1000 • ε • Υ /nusers) and report the respective (interpolated) accuracy values. We extrapolate the privacy guarantees, due to the limited dataset size. Our accuracy values are pessimistic as we keep the accuracy numbers that we got from averaging 1,000 models. Actually taking the mean over 200,000 or even 20,000,000 users should provide better results. In our evaluation, Υ = 50 group privacy is comparable to local DP. A lower value of Υ < 50 places trust in users as local aggregators. For Υ ≥ 2, we can use a tighter group-privacy bound (cf. Cor. 10); hence, the accuracy values are the same as for Υ = 50 = N , where the entire local data set of a user is protected.

Figure 3: Classification accuracy compared to the privacy budget ε (in log 10 -scale) of Secure Distributed DP-Helmet (cf. Section 4) and DP-SGD-based federated learning (FL) on CIFAR-10 dataset (δ = 10 -5 ). (left) We use all available data points of CIFAR-10 for each line, spreading them among a differing number of users. (right) Different numbers of users with 50 data points per user.

Figure 4: Classification accuracy versus #users with 50 data points per user for a fixed ε = 0.5885, δ = 10 -5 . Values for FL are interpolated.

Figure 5: Pre-training: Schematic overview. Dashed lines denote data flow in the training phase and solid lines in the inference phase.

Figure 6: 2-d projection of the CIFAR-10 dataset via t-SNE (Van der Maaten & Hinton, 2008) with colored labels. Note that t-SNE is defined on the local neighborhood thus global patterns or structures may be arbitrary.

neighboring datasets. For instance, for T = SGD_SVM we have s = 2(c+RΛ) N Λ (cf. Lem. 2). By Alg. 2, we take the average of multiple local models, i.e. avg(T (℧)) = 1 |U | |U | i=1 T (D

8:Theorem 8 (Main Theorem, full). Given a configuration ζ, a maximum fraction of dropouts ρ ∈ [0, 1], and a maximum fraction of corrupted clients γ ∈ [0, 1], if secure authentication encryption schemes and authenticated key agreement protocol exist, then Secure Distributed DP-Helmet(ζ) (cf. Alg. 2) satisfies computational (ε, δ + ν 1 )-DP with ε ≥ 2 ln 1.25 /(δ/|K|) • |K| • 1 /σ, for ν 1 := (1 + exp(ε)) • ν(η) and a function ν negligible in the security parameter η used in secure aggregation.

Definition 20 (Convergence). Let f (h, z) denote a loss function on hypothesis h and instance z and F S (h) := 1 |S| z∈S f (h, z) the empirical risk on some dataset S. An algorithm A converges with rate ϵ conv under a data distribution Z ifE S∈Z [F S (A(S)) -inf h F S (h)] ≤ ϵ conv .Theorem 21 (Averaging models converges). Given a set of users U (i) ∈ U each with a local data set D (i) , a learning algorithm T with a Λ-strongly convex, L-Lipschitz, and β-smooth training objective J (f, D(i) , K) on model parameters f (like SGD_SVM of Alg. 1), an averaging routine avg(T(℧)) = 1 |U | |U | i=1 T (D (i) , ξ, K) with ℧ := |U | i D (i) (like in Alg.2), and the projected SGD update routine for a c-norm clipped data point z (i) m ∈ D (i) and class k ∈ K, i.e. f ) =: G, then for a diminishing learning rate α m = min( 1 β , 1 Λm ), M steps, a given Z := 1 |U | ( f * , and a bias term b, T convergences to f * := argmin f J (f, ℧, _) with E[J (avg(T (℧)), ℧, _) -J (f * , ℧, _)] ≤ ϵ conv ≤ βL 2 2Λ 2 (M -1) -1 + b(M -1) -2 ∈ O(M -1 ).

∀f ⇔ ∥H(J (f ))∥ ≤ β, ∀f . Similarly to the argumentation above, since beta smoothness holds for any f , it also holds for f = 1 M .By unraveling the recursive formula of A M (cf. Equation (21)) we get with the base case A 0 -2Λα n A 0 .Recall the learning rate α m = min( 1 β , 1 Λm ) At m 0 = β Λ are we switching the learning rate from 1 β to 1 Λm . First, we consider the case m ≤ m 0 where we rewrite for a constant learning rate α m = 1 β and φ= L 2 2β : m + m 0 (β -Λ)A 0 =: b ′Next, we consider the case m > m 0 where we rewrite for a diminishing learning rate α m = 1 Λm as well as ς= βL 2 0 -1)m 0 b ′ -m 0 ) =:b ≤ ς(M -1) -1 + b(M -1) -2If we approach M to ∞, i.e. assume a sufficient number of iterations, we further simplify lim , ℧, _) -J (f * , ℧, _)

Comparison to related work for n users with m data points each: utility guarantee to the population optimum, DP noise scale, and number of SMPC invocations. In DP-FL an untrusted aggregator combines the updates from each user, while each user update satisfies DP (by adding noise and norm-clipping each gradient). It does need a communication round per training iteration M .

for M many training iterations.

Since each user samples during each training iteration one data point, we have for a given iteration a probability of |U | /N that an individual data point of ℧ has been chosen resulting in differing gradient updates G

annex

Proof. We know that the sensitivity of the learning algorithm T is defined as s = max D∼D ′ ∥T (D, ξ, k) -T (D ′ , ξ, k)∥ for Υ-neighboring datasets D, D ′ . Thus, in our case we have s = 2R since any T (_, ξ, k) ∈ [-R, R]. As this holds independent on the dataset and by Lem. 3 and by Lem. 16, we can protect any arbitrary number of data points per user, i.e. we have Υ-group DP.

K GROUP PRIVACY REDUCTION OF MULTIVARIATE GAUSSIAN

Lemma 16. Let pdf N (A,B) [x] denote the probability density function of the multivariate Gaussian distribution with location and scale parameters A, B which is evaluated on an atomic event x. For any atomic event x, any covariance matrix Σ, any group size k ∈ N, and any mean µ, we getProof.forforAs the Gaussian distribution belongs to the location-scale family, Lem. 16 directly implies that the (ε, δ)-DP guarantees of using N (0, k 2 Σ) noise for sensitivity k and using N (0, Σ) for sensitivity 1 are the same.

L REPRESENTING MULTIVARIATE GAUSSIANS AS UNIVARIATE GAUSSIANS

For the sake of completeness, we rephrase a proof that we first saw in Abadi et al. (2016) that argues that sometimes the multivariate Gaussian mechanism can be reduced to the univariate Gaussian mechanism.Lemma 17. Let pdf N (µ,diag(σ 2 )) denote the probability density function of a multivariate (p ≥ 1) spherical Gaussian distribution with location and scale parameters µ ∈ R p , σ ∈ R p + . Let M gauss,p,q be the p dimensional Gaussian mechanism D → q(D) + N (0, σ 2 • I p ) for σ 2 > 0 of a function q : D → R p , where D is the set of datasets. Then, for any p ≥ 1, if q is s-sensitivity-bounded, then for any p ≥ 1, there is another s-sensitivity-bounded function q ′ : D → R such that the following holds: for all ε ≥ 0, δ ∈ [0, 1] if M gauss,1,q ′ satisfies (ε, δ)-ADP, then M gauss,p,q satisfies (ε, δ)-ADP.

