CROSS-SILO TRAINING OF DIFFERENTIALLY PRIVATE MODELS WITH SECURE MULTIPARTY COMPUTATION

Abstract

We address the problem of learning a machine learning model from training data that originates at multiple data owners in a cross-silo federated setup, while providing formal privacy guarantees regarding the protection of each owner's data. Existing solutions based on Differential Privacy (DP) achieve this at the cost of a drop in accuracy. Solutions based on Secure Multiparty Computation (MPC) do not incur such accuracy loss but leak information when the trained model is made publicly available. We propose an MPC solution for training differentially private models. Our solution relies on an MPC protocol for model training, and an MPC protocol for perturbing the trained model coefficients with Laplace noise in a privacy-preserving manner. The resulting MPC+DP approach achieves higher accuracy than a pure DP approach, while providing the same formal privacy guarantees.

1. INTRODUCTION

The ability to induce a machine learning (ML) model from data that originates at multiple data owners (clients) in a cross-silo federated setup, while protecting the privacy of each data owner, is of great practical value in a wide range of applications, for a variety of reasons. Most prominently, training on more data typically yields higher quality ML models. For instance, one could train a more accurate model to predict the length of hospital stay of COVID-19 patients when combining data from multiple clinics. This is a cross-silo application where the data is horizontally distributed, meaning that each data owner (clinic) has records/rows of the data (HFL). Furthermore, being able to combine different data sets enables new applications that pool together data from multiple data owners, or even from different data owners within the same organization. An example of this is an ML model that relies on lab test results as well as healthcare bill payment information about patients, which are usually managed by different departments within a hospital system. This is an example of a cross-silo application where the data is vertically distributed, i.e. each data owner has their own columns (VFL). While there are clear advantages to training ML models over data that is distributed across multiple data owners, in practice often these data owners do not want to disclose their data to each other, because the data in itself constitutes a competitive advantage, or because the data owners need to comply with data privacy regulations. 2019)). Formal privacy guarantees in this case can be provided by DP, however at a cost of accuracy loss that is inversely proportional to the privacy budget (see Sec. 2). To mitigate this accuracy loss, we propose an MPC solution for training DP models. Our Approach. Rather than having each party training local models on their own data sets, we have the parties running an MPC protocol on the totality of the data sets without requiring each party to disclose their private information to anyone. Since we restrict our analysis to generalized linear models, we then have these parties using MPC to generate the necessary noise and privately adding it to the weights of the trained classifier to satisfy DP requirements. We show that this procedure yields the same accuracy and DP guarantees as in the global DP model however without requiring the parties to reveal their data, model parameters, or gradients to a central aggregator, or to anyone else for that matter. Indeed, the MPC protocols effectively play the role of a trusted curator implementing global DP. The resulting classifier can be published in the clear, or used for private inference on top of MPC. Our solution is applicable in cross-silo federated scenarios in which the data is horizontally distributed as well as in cross-silo federated scenarios where the data is vertically distributed. It obtained the highest accuracy in the iDASH2021 Track III competition on confidential computing, where the challenge was to propose a federated learning algorithm for training of a model to predict the risk of wild-type transthyretin amyloid cardiomyopathy using medical claims data from different hospitals, while providing DP guarantees.  In other words, A is DP if A generates similar probability distributions over outputs on neighboring data sets D and D ′ . The parameter ϵ ≥ 0 denotes the privacy budget or privacy loss, while δ ≥ 0 denotes the probability of violation of privacy, with smaller values indicating stronger privacy guarantees in both cases. ϵ-DP is a shorthand for (ϵ, 0)-DP. A can for instance be an algorithm that takes as input a data set D of training examples and outputs an ML model. An (ϵ, δ)-DP randomized algorithm A is commonly created out of an algorithm A * by adding noise that is proportional to the sensitivity of A * . We describe the Laplace noise technique that we use to this end in detail in Sec. 4. Secure Multiparty Computation. MPC is an umbrella term for cryptographic approaches that allow two or more parties to jointly compute a specified output from their private information in a distributed fashion, without revealing the private information to each other (Cramer et al. (2015) ). MPC is concerned with the protocol execution coming under attack by an adversary which may corrupt one or more of the parties to learn private information or cause the result of the computation to be incorrect. MPC protocols are designed to prevent such attacks being successful, and can be mathematically proven to guarantee privacy and correctness. We follow the standard definition of the Universal Composability (UC) framework (Canetti (2000) ), in which the security of protocols is analyzed by comparing a real world with an ideal world. For details, see Evans et al. (2018) . An adversary can corrupt a certain number of parties. In a dishonest-majority setting the adversary is able to corrupt half of the parties or more if he wants, while in an honest-majority setting, more than half of the parties are always honest (not corrupted). Furthermore, the adversary can have different levels of adversarial power. In the semi-honest model, even corrupted parties follow the instructions of the protocol, but the adversary attempts to learn private information from the internal state of the corrupted parties and the messages that they receive. MPC protocols that are secure against semihonest or "passive" adversaries prevent such leakage of information. In the malicious adversarial model, the corrupted parties can arbitrarily deviate from the protocol specification. Providing security in the presence of malicious or "active" adversaries, i.e. ensuring that no such adversarial attack can succeed, comes at a higher computational cost than in the passive case. The protocols in Sec. 4 are sufficiently generic to be used in dishonest-majority as well as honest-majority settings, with passive or active adversaries. This is achieved by changing the underlying MPC scheme to align with the desired security setting. As an illustration, we describe the well-known additive secret-sharing scheme for dishonest-majority 2PC with passive adversaries. In Sec. 5 we additionally present results for honest-majority 3PC and 4PC schemes with passive and active adversaries; for details about those MPC schemes we refer to Araki et al. (2016); Dalskov et al. (2021) . In the additive secret-sharing 2PC scheme there are two computing parties, nicknamed Alice and Bob. All computations are done on integers, modulo an integer q. The modulo q is a hyperparameter that defines the algebraic structure in which the computations are done. A value x in Z q = {0, 1, . . . , q -1} is secret shared between Alice and Bob



http://www.humangenomeprivacy.org/2021/competition-tasks.html



The importance of enabling privacy-preserving model training in federated setups has spurred a large research effort in this domain, most notably in the development and use of Privacy-Enhancing Technologies (PETs), prominently including Federated Learning (FL) (Kairouz et al. (2021)), Differential Privacy (DP) (Dwork et al. (2014)), Secure Multiparty Computation (MPC) (Cramer et al. (2015)), and Homomorphic Encryption (HE) (Lauter (2021)). Each of these techniques has its own (dis)advantages. Approaches based on (combinations of) FL, MPC, or HE alone do not provide sufficient protection if the trained model is to be made publicly known, or even if it is only made available for black-box query access, because information about the model and its training data is leaked through the ability to query the model (Fredrikson et al. (2015); Tramèr et al. (2016); Song et al. (2017); Carlini et al. (

DP is concerned with providing aggregate information about a data set D without disclosing information about specific individuals in D (Dwork et al. (2014)). A data set D ′ that differs in a single entry from D is called a neighboring database. A randomized algorithm A is called (ϵ, δ)-DP if for all pairs of neighboring databases D and D ′ , and for all subsets S of A's range, P(A(D) ∈ S) ≤ e ϵ • P(A(D ′ ) ∈ S) + δ.

