PROPERTY INFERENCE ATTACKS AGAINST t-SNE PLOTS

Abstract

With the prevailing of machine learning (ML), researchers have shown that ML models are also vulnerable to various privacy and security attacks. As one of the representative attacks, the property inference attack aims to infer the private/sensitive properties of the training data (e.g., race distribution) given the output of ML models. In this paper, we present a new side channel for property inference attacks, i.e., t-SNE plots, which are widely used to show feature distribution or demonstrate model performance. We show for the first time that the private/sensitive properties of the data that are used to generate the plot can be successfully predicted. Briefly, we leverage the publicly available model as the shadow model to generate t-SNE plots with different properties. We use those plots to train an attack model, which is a simple image classifier, to infer the specific property of a given t-SNE plot. Extensive evaluation on four datasets shows that our proposed attack can effectively infer the undisclosed property of the data presented in the t-SNE plots, even when the shadow model is different from the target model used to generate the t-SNE plots. We also reveal that the attacks are robust in various scenarios, such as constructing the attack with fewer t-SNE plots/different density settings and attacking t-SNE plots generated by fine-tuned target models. The simplicity of our attack method indicates that the potential risk of leaking sensitive properties in t-SNE plots is largely underestimated. As possible defenses, we observe that adding noise to the image embeddings or t-SNE coordinates effectively mitigates attacks but can be bypassed by adaptive attacks, which prompts the need for more effective defenses.

1. INTRODUCTION

Machine learning (ML) models are becoming powerful and can be used as a feature extractor to generate representations (also known as embeddings) for the input data (He et al., 2016; Sandler et al., 2018; Huang et al., 2017) . However, such representations are still in high-dimensional space (e.g., 512 dimensions for the ResNet-18). To better understand the representations of data or demonstrate the model's performance, people usually use dimension reduction techniques such as t-SNE (van der Maaten & Hinton, 2008) (abbreviation for t-distributed stochastic neighbor embedding) to reduce high-dimensional representations into a 2-dimensional space for visualization. Despite being powerful, ML models are also shown to be vulnerable to various privacy attacks that aim to reveal the sensitive information of the training dataset given access to the target model. Property inference attack (Ganju et al., 2018; Zhou et al., 2022; Mahloujifar et al., 2022) is one of the representative attacks whereby the adversary aims to infer the sensitive global properties of the training dataset (e.g., the race distribution) from the representations generated from the target model. t-SNE plots, on the other hand, are condensations of data representations. Such plots are usually considered to be safe and shared with the public via scientific papers or blogs. However, it is unclear whether such plots would leak sensitive property information about the data as well. Our Work. In this paper, we take the first step toward understanding the privacy leakage from t-SNE plots through the lens of the property inference attack against such plots. Here, we consider the general property as the macro-level information of the dataset, e.g., race distributions. Note that this property is not necessarily related to the t-SNE plots, for instance, the t-SNE plot is used to show how distinguishable the gender distribution is, and the adversary can infer the race distribution of

