Scrunch: PREVENTING SENSITIVE PROPERTY INFER-ENCE THROUGH PRIVACY-PRESERVING REPRESENTA-TION LEARNING

Abstract

Many tasks that are commonly performed by devices attached to the Internet are currently being offloaded to the cloud, using the Machine Learning as a Service (MLaaS) paradigm. While this paradigm is motivated by the reduced capacity of mobile terminals, it also hinders privacy associated with the data exchanged over the network. Thus, the data exchanged among parties shall be conveniently anonymized to prevent possible confidentiality and privacy issues. While many privacy-enhancing algorithms have been proposed in the past, they are usually relying on very complex models that make difficult their applicability to real-world systems or envision too friendly attacker models. In this paper, we propose a deep learning system that creates anonymized representations for the data, while keeping the accuracy for the targeted MLaaS task high, assuming that the attacker can re-train an adversarial model. Our results show that the proposed algorithm i) is effective yet it uses a lighter approach than state-of-the-art ii) considers less friendly attacker models, and iii) outperforms the benchmark under different privacy metrics.

1. INTRODUCTION

The complexity and size of ML models is growing over time. Recent examples, such as GTP-3 with 175B parameters [OpenAI (2022 (accessed September 28, 2022) ] or Megatron-Turing with 530B [Nvidia (2022 (accessed September 28, 2022) ], have presented models that are impossible to generate or even maintain for most companies in the world, not to speak about academia or users with personal devices. Moreover, it is expected similar growth in the next years [Fedus et al. (2021) ]. This progression, together with the slowdown in the production of new hardware severely limits the capacity of small (and even big) enterprises to use the last advances in Natural Language Processing (NLP), image recognition, or other complex ML tasks. In this scenario, big tech companies have started to offer their models in a Machine Learning as a Service (MLaaS) fashion. That is, they run the gigantic ML models on their premises and allow customers to query the model for a pre-negotiated fare. This model is convenient for both customers that do not have the ability to create their own complex model (i.e., because they do not have a tagged dataset), and for those that need to execute (even simple) ML tasks on limited devices such as mobile phones or IoT devices. However, to perform an MLaaS task, the customer should send the raw data (e.g., an image) to the service provider. While this operation may not present big problems in certain tasks (e.g., a connected vehicle sending telemetry data for predictive maintenance), it certainly has heavy privacy/confidentiality implications in others (e.g.., a surveillance system requesting image classification services). Alternatively, the service provider could give the model to the customer to avoid data transfer. Nonetheless, this is typically not feasible in the case of limited devices or huge models. And even in cases where the customer could execute the model, the MLaaS provider may have concerns as the customer could blackbox or use the model without the provider's permission.

