PROPERTY INFERENCE ATTACKS AGAINST t-SNE PLOTS

Abstract

With the prevailing of machine learning (ML), researchers have shown that ML models are also vulnerable to various privacy and security attacks. As one of the representative attacks, the property inference attack aims to infer the private/sensitive properties of the training data (e.g., race distribution) given the output of ML models. In this paper, we present a new side channel for property inference attacks, i.e., t-SNE plots, which are widely used to show feature distribution or demonstrate model performance. We show for the first time that the private/sensitive properties of the data that are used to generate the plot can be successfully predicted. Briefly, we leverage the publicly available model as the shadow model to generate t-SNE plots with different properties. We use those plots to train an attack model, which is a simple image classifier, to infer the specific property of a given t-SNE plot. Extensive evaluation on four datasets shows that our proposed attack can effectively infer the undisclosed property of the data presented in the t-SNE plots, even when the shadow model is different from the target model used to generate the t-SNE plots. We also reveal that the attacks are robust in various scenarios, such as constructing the attack with fewer t-SNE plots/different density settings and attacking t-SNE plots generated by fine-tuned target models. The simplicity of our attack method indicates that the potential risk of leaking sensitive properties in t-SNE plots is largely underestimated. As possible defenses, we observe that adding noise to the image embeddings or t-SNE coordinates effectively mitigates attacks but can be bypassed by adaptive attacks, which prompts the need for more effective defenses.

1. INTRODUCTION

Machine learning (ML) models are becoming powerful and can be used as a feature extractor to generate representations (also known as embeddings) for the input data (He et al., 2016; Sandler et al., 2018; Huang et al., 2017) . However, such representations are still in high-dimensional space (e.g., 512 dimensions for the ResNet-18). To better understand the representations of data or demonstrate the model's performance, people usually use dimension reduction techniques such as t-SNE (van der Maaten & Hinton, 2008) (abbreviation for t-distributed stochastic neighbor embedding) to reduce high-dimensional representations into a 2-dimensional space for visualization. Despite being powerful, ML models are also shown to be vulnerable to various privacy attacks that aim to reveal the sensitive information of the training dataset given access to the target model. Property inference attack (Ganju et al., 2018; Zhou et al., 2022; Mahloujifar et al., 2022) is one of the representative attacks whereby the adversary aims to infer the sensitive global properties of the training dataset (e.g., the race distribution) from the representations generated from the target model. t-SNE plots, on the other hand, are condensations of data representations. Such plots are usually considered to be safe and shared with the public via scientific papers or blogs. However, it is unclear whether such plots would leak sensitive property information about the data as well. Our Work. In this paper, we take the first step toward understanding the privacy leakage from t-SNE plots through the lens of the property inference attack against such plots. Here, we consider the general property as the macro-level information of the dataset, e.g., race distributions. Note that this property is not necessarily related to the t-SNE plots, for instance, the t-SNE plot is used to show how distinguishable the gender distribution is, and the adversary can infer the race distribution of the data used to generate the plot. A successful attack may cause severe consequences as it provides additional information to the adversary, which is often sensitive and should be protected. Also, it can be used to audit the fairness of the dataset (Buolamwini & Gebru, 2018) . In this work, we first systematize the threat model and attack methodology, which is straightforward. We assume that the adversary can access the victim's t-SNE plot and may have knowledge of the distribution of the target dataset, but they do not necessarily have access to the target model. To infer the general property of the samples in the t-SNE plot, the adversary first samples groups of images from a shadow dataset with different properties (e.g. different proportions of males). Then, those groups of images will be used to query the shadow model to get representations and generate groups of t-SNE plots. An attack model is trained based on the <plot, property> pairs, which is an image classifier that can distinguish between t-SNE plots with different labels. Once well trained, the attack model can then be used to infer the property of public t-SNE plots. Our evaluations on both classification and regression tasks show that the proposed attack can achieve high accuracy, on some of the datasets like CelebA and LFW. For instance, the accuracy for predicting the proportion of males on CelebA t-SNE plots reaches 0.92, and the average regression error of predicting precise proportions is 0.02. We also study the reason for the relatively poor attack performance on the other datasets (e.g. FairFace) or attributes (e.g. Oval Face). We discover that this is due to the less distinguishable representations generated by the target model over these datasets/attributes. Also, we observe that the attack is still effective even when the shadow model is different from the target model. We further demonstrate that our attack is robust with fewer training plots and transferable to different t-SNE density settings. For instance, the regression attack model trained on t-SNE plots with 750 sample points can reach a low error of 0.04 on t-SNE plots with 1, 000 or 500 sample points. We additionally show that, by mixing only a small number of t-SNE plots from a new dataset, our attack can transfer to the new dataset. and we reveal the validity of our attack when the target and shadow models are fine-tuned, which is common in practice (see Section 5.5 for details). To mitigate the attack, we perturb the image embeddings/t-SNE coordinates and discover that it indeed reduces the attack performance to a large extent, e.g., by adding Gaussian noise to the t-SNE coordinates, the inference error of regression attack model increases significantly from 0.02 to 0.48. However, such defenses can still be bypassed by an adaptive attacker (see also Section 5.4). In short, our work demonstrates that the published t-SNE plots can be a valid side-channel for the property inference attacks. We appeal to our community's attention to the privacy protection of publishing t-SNE plots.

2. BACKGROUND AND RELATED WORK

t-SNE Plots. Generally, t-SNE (van der Maaten & Hinton, 2008) is used to transform highdimensional data to low-dimensional (e.g. 2D) data while preserving their relationship, i.e., similar data are mapped to the near space and dissimilar data are projected more distant. To analyze how separable some specific characteristics of the data are, one common practice is to leverage the pre-trained ML models as feature extractors to generate representations for the data and use t-SNE to reduce the high dimensional representations into 2D space for visualization, i.e., with t-SNE plots (see Figure 4 (b) as an example). Also, t-SNE plots can be used to show the performance of fine-tuned ML models (see Figure 13 as an example). Property Inference Attack. Property inference intends to gain insights into the global properties of training datasets, which are unconsciously leaked. This poses a threat to the intellectual property of the data owner. In addition, property inference can audit the fairness of datasets, e.g. the gender fairness in benchmark datasets (Buolamwini & Gebru, 2018) . Previous work has shown the vulnerability of property inference against machine learning models, including discriminative models and generative models. Ateniese et al. (2015) proposes inference attack against shallow machine learning models, e.g. Hidden Markov Models and Support Vector Machines. Ganju et al. (2018) proposes the first property inference attack against fully connected neural networks. Both of the work assumes that the adversary has white-box access to the machine learning models. Concretely, the adversary first trains shadow models on datasets with different properties, then leverages the parameters of shadow models to train the meta classifier to identify the property of the training dataset. Mahloujifar et al. (2022) conducts the property inference attack by injecting poisonous data into

