Scrunch: PREVENTING SENSITIVE PROPERTY INFER-ENCE THROUGH PRIVACY-PRESERVING REPRESENTA-TION LEARNING

Abstract

Many tasks that are commonly performed by devices attached to the Internet are currently being offloaded to the cloud, using the Machine Learning as a Service (MLaaS) paradigm. While this paradigm is motivated by the reduced capacity of mobile terminals, it also hinders privacy associated with the data exchanged over the network. Thus, the data exchanged among parties shall be conveniently anonymized to prevent possible confidentiality and privacy issues. While many privacy-enhancing algorithms have been proposed in the past, they are usually relying on very complex models that make difficult their applicability to real-world systems or envision too friendly attacker models. In this paper, we propose a deep learning system that creates anonymized representations for the data, while keeping the accuracy for the targeted MLaaS task high, assuming that the attacker can re-train an adversarial model. Our results show that the proposed algorithm i) is effective yet it uses a lighter approach than state-of-the-art ii) considers less friendly attacker models, and iii) outperforms the benchmark under different privacy metrics.

1. INTRODUCTION

The complexity and size of ML models is growing over time. Recent examples, such as GTP-3 with 175B parameters [OpenAI (2022 (accessed September 28, 2022) ] or Megatron-Turing with 530B [Nvidia (2022 (accessed September 28, 2022) ], have presented models that are impossible to generate or even maintain for most companies in the world, not to speak about academia or users with personal devices. Moreover, it is expected similar growth in the next years [Fedus et al. (2021) ]. This progression, together with the slowdown in the production of new hardware severely limits the capacity of small (and even big) enterprises to use the last advances in Natural Language Processing (NLP), image recognition, or other complex ML tasks. In this scenario, big tech companies have started to offer their models in a Machine Learning as a Service (MLaaS) fashion. That is, they run the gigantic ML models on their premises and allow customers to query the model for a pre-negotiated fare. This model is convenient for both customers that do not have the ability to create their own complex model (i.e., because they do not have a tagged dataset), and for those that need to execute (even simple) ML tasks on limited devices such as mobile phones or IoT devices. However, to perform an MLaaS task, the customer should send the raw data (e.g., an image) to the service provider. While this operation may not present big problems in certain tasks (e.g., a connected vehicle sending telemetry data for predictive maintenance), it certainly has heavy privacy/confidentiality implications in others (e.g.., a surveillance system requesting image classification services). Alternatively, the service provider could give the model to the customer to avoid data transfer. Nonetheless, this is typically not feasible in the case of limited devices or huge models. And even in cases where the customer could execute the model, the MLaaS provider may have concerns as the customer could blackbox or use the model without the provider's permission. In this paper, we present Scrunch, a technique that allows the usage of MLaaS without the privacy implications of sending raw data to third parties. In our technique, a previously trained model is split into two parts and then its first part is fine-tuned adding a second loss function, to ensure the information after this point is only valuable to perform the task at hand, but not to perform any other task. The usage of a pre-trained model allows the easy usage of already existing models without the need of training them from scratch. After the two parts are both trained taking into account the new, combined loss function, the first part can be sent to the customers that can execute it even with limited resources, and only transfer the obtained data representations. The rest of the model stays within the service provider ensuring that customers cannot make non-legitimate usage of the entire model from the provider. Scrunch is able to create privacy-preserving data representations. It provides accuracy similar to the one of a neural network without privacy and, at the same time, provides higher privacy than state-of-the-art privacy solutions. In the remaining of this paper, we present the privacy model in Section 2, then, we implement our solution for two different Neural Network architectures and data sets and evaluate its performance in Section 3. In Section 4, we show how the model parameters affect its way of working. Finally, Section 5 concludes the paper. ---------------- The application of privacy-preserving techniques to data sharing and ML has been widely studied in the past years with solutions ranging from the already classic k-anonymity [Sweeney ( 2002 DP grants a formal guarantee about how likely the data is to leak sensitive information, i.e. info beyond what is legitimately intended to be publicly available by the data owner. The problem to be solved in our scenario, instead, concerns 'inference privacy', i.e. reducing the amount of information that is sent/published in the first place. In addition, applying DP -e.g. in the form of noise -to the data with no further tweaks usually tends to quickly degrade the whole informational content, including what should be kept usable.

1.1. STATE OF THE ART

Other approaches that try to preserve the privacy of exchanged data are those that employ advanced cryptographic techniques. Two particularly researched approaches today are Fully Homomorphic Encryption (FHE) and Secure Multi-Party Computation. Thanks to FHE, direct inference on encrypted data becomes possible [Gilad-Bachrach et al. (2016); Gentry (2009); Bos et al. (2013) ]. And since data is never decrypted, its privacy is guaranteed. FHE usually suffers from an accuracy drop with complex networks, since it works by approximating a neural network with a low degree polynomial function. But the real major drawback is the computational cost: the encryption schemes' complexity makes the inference time increase by many orders of magnitude, making it impractical for real-time use cases. Another cryptographic approach is Secure Multi-Party Computation (SMC), which makes it possible for two entities to compute a function over their inputs while maintaining those inputs perfectly private [Mohassel & Zhang (2017); Liu et al. (2017) ]. Usually, SMC scenarios are based on garbled circuits, secret sharing, and oblivious transfer. SMC also suffers from high cryptographic complexity. Another popular field of research concerns about how to securely run an ML model on a cloud machine. Proposals from this field rely on trusted execution environments such as Intel SGX and ARM TrustZone [Tramer & Boneh (2018); Hanzlik et al. (2021) ]. Nevertheless, such scenarios still require the client to trust the cloud servers with their data. Finally, more similar to our work, there is another sub-field of the Privacy-Preserving research community that tries to generate privacy-preserving data representations. AutoGAN [Nguyen et al. (2020) ] proposes a non-linear dimension reduction framework based on a GAN structure. On it, a Generator and a Discriminator are iteratively trained in turn, in an adversarial manner, to enforce a certain distance between original and potentially reconstructed data. Osia et al. (2018) protects the data against the execution of previously known ML tasks. Moreover, works like [Osia et al. (2020) ], Cloak [Mireshghallah et al. (2020a) ] and Shredder [Mireshghallah et al. (2020b) ] apply a



)], l-diversity[Machanavajjhala et al. (2007)] or t-closeness[Li et al. (2007)] to more novel solutions such as z-anonymity[Jha et al. (2020)]. Among all of them, Differential Privacy (DP)[Dwork et al.  (2006)] is, with absolute certainty, the most accepted and used by the ML community.

