MASS: MULTI-ATTRIBUTE SELECTIVE SUPPRESSION

Abstract

The recent rapid advances in machine learning technologies largely depend on the vast richness of data available today. Along with the new services and applications enabled by those machine learning models, various governmental policies are put in place to regulate such data usage and protect people's privacy and rights. As a result, data owners often opt for simple data obfuscation (e.g., blur people's faces in images) or withholding data altogether, which leads to severe data quality degradation and greatly limits the data's potential utility. Aiming for a sophisticated mechanism which gives data owners fine-grained control while retaining the maximal degree of data utility, We propose Multi-attribute Selective Suppression, or MaSS, a general framework for performing precisely targeted data surgery to simultaneously suppress any selected set of attributes while preserving the rest for downstream machine learning tasks. MaSS learns a data modifier through adversarial games between two sets of networks, where one is aimed at suppressing selected attributes, and the other ensures the retention of the rest of the attributes. We carried out an extensive evaluation of our proposed method using multiple datasets from different domains including facial images, voice audio, and video clips, and obtained highly promising results in MaSS' generalizability and capability of drastically suppressing targeted attributes while imposing virtually no impact on the data's usability in other downstream machine learning tasks.

1. INTRODUCTION

The recent rapid advances in machine learning (ML) can be largely attributed to powerful computing infrastructures as well as the availability of large-scale datasets, such as ImageNet1K (Deng et al., 2009) for computer vision, WMT (Foundation, 2019) for neural machine translation, and Lib-riLight (Kahn et al., 2020) for speech recognition. Studies have shown that ML models trained on large-scale datasets can usually prove effective in many additional downstream tasks (Brown et al., 2020) . On the other hand, ethical concerns have been raised surrounding proper data usage in issues like data privacy (Liu et al., 2021 ), data minimization (Goldsteen et al., 2021) , etc. Therefore, if there are more data available and can be used without worrying whether or not the data is handled properly, the ML models can be further improved by more data and help the ML community to advance on many domains. Attempting to balance between model performance and proper data usages, a common approach usually taken is to simply modify the data to remove its "sensitive" attributes, and experimentally demonstrate that the targeted sensitive attributes are indeed removed. What's crucially important but usually omitted here, however, is the preservation of the "total utility" of the data, because the suppression operation oftentimes also negatively impact, or even completely destroy, the other "non-sensitive" attributes, hence greatly damaging the dataset's potential future utility. For example, DeepPrivacy (Hukkelås et al., 2019) is able to demonstrate its privacy protection capability, but the modified data it produces can no longer be utilized for additional downstream tasks like sentiment analysis, age detection, or gender classification. Since data is one of the main driving forces for the rapid advancement of machine learning research, we argue that the ideal scenario would be to have the flexibility of selecting an arbitrary set of attributes and only suppressing them while leaving all the other attributes completely intact. In this way, the community could unleash the potential utility of the modified data to develop more advanced algorithms. Towards this exact goal, we present Multi-attribute Selective Suppression (or MaSS) in this paper, to enable such capability of precise attributes suppression for multi-attribute datasets. The high-level Figure 1 : MaSS is able to precisely target any selected attributes in a multi-attribute dataset for suppression while leaving the rest of the attributes intact for any potential downstream ML-based analytic tasks. For example as illustrated by the diagram, when operating on the original multiattribute dataset and configured to suppress Attr. 0, MaSS is able to transform the dataset such that the model for detecting Attr. 0 is unable to reliably detect Attr. 0 from the transformed data, while the models for Attr. 1 and 2 are not affected.

MaSS

objective of MaSS is also illustrated in Figure 1 , where MaSS is configured to suppress Attr. 0 without knowing in advance that Attr. 1 and 2 will be used for downstream tasks. After the data transformation performed by MaSS, Attr. 0 becomes suppressed, nondetectable by its corresponding machine learning model, but at the same time, Attr. 1 and 2 are left intact, and still can be extracted by their corresponding ML models. As a concrete example, suppose we are working with a facialimage dataset which contains attributes like age, gender, and sentiment, where, let us assume, age and gender are considered sensitive. Then, MaSS would transform this facial-image dataset such that age and gender information could no longer be not be inferred by the corresponding ML models, whereas sentiment information could still be extracted from the transformed data. The contributions of our work are threefolds, 1. We propose the novel MaSS framework to enable the powerful flexibility of precise suppression of arbitrary, selective data attributes. 2. We employ multiple learning mechanisms in MaSS to enable its attribute-specific as well as generic feature preservation capabilities, which help it achieve satisfactory data utility protection both with and without the prior knowledge about downstream tasks. 3. We thoroughly validate MaSS using a wide range of multi-attribute datasets, including image, audio, and video. All our results demonstrate MaSS' strong performance in its intended selective attribute suppression and preservation.

2. RELATED WORKS

Data Privacy. A large body of work has studied methods applying generative adversarial networks (GANs) to generate and modify facial features in images, so these identity-related sensitive features can be de-identified. DeepPrivacy (Hukkelås et al., 2019) proposed to use a conditional generative adversarial network to generate realistic anonymized faces, while considering the existing background and a sparse pose annotation. To further ensure the face anonymization using GAN-based methods, CIAGAN (Maximov et al., 2020) proposed an identity control discriminator to control which fake identity is used in the anonymizaiton process by introducing an identity control vector. Instead of generating the entire faces for anonymization, Li et al. proposed to apply conditional GAN to only identify and modify the five identity-sensitive attributes. To also enable face anonymization with the selected semantic attributes manipulation, PI-Net (Chen et al., 2021) proposed to generate realistic looking faces with the selected attributes preserved. The above works usually focuses on suppression only while the future data utilities are not considered. Our approach not only suppresses the attributes but also preserves the data utility concurrently. On the other hand, SPAct tried to suppress the multiple attributes in a video through contrastive learning while preserving the utility for action recognition; however, their approach lacks the flexibility to handle individual attributes but can only process all attributes at once and limits to the action recognition dataset; while our method is fully configurable and validated in different data domains. Moriarty et al. proposed the method to suppress the biometric information while preserving its utility; however, their approach requires the information of downstream task while our method does not.

