MASS: MULTI-ATTRIBUTE SELECTIVE SUPPRESSION

Abstract

The recent rapid advances in machine learning technologies largely depend on the vast richness of data available today. Along with the new services and applications enabled by those machine learning models, various governmental policies are put in place to regulate such data usage and protect people's privacy and rights. As a result, data owners often opt for simple data obfuscation (e.g., blur people's faces in images) or withholding data altogether, which leads to severe data quality degradation and greatly limits the data's potential utility. Aiming for a sophisticated mechanism which gives data owners fine-grained control while retaining the maximal degree of data utility, We propose Multi-attribute Selective Suppression, or MaSS, a general framework for performing precisely targeted data surgery to simultaneously suppress any selected set of attributes while preserving the rest for downstream machine learning tasks. MaSS learns a data modifier through adversarial games between two sets of networks, where one is aimed at suppressing selected attributes, and the other ensures the retention of the rest of the attributes. We carried out an extensive evaluation of our proposed method using multiple datasets from different domains including facial images, voice audio, and video clips, and obtained highly promising results in MaSS' generalizability and capability of drastically suppressing targeted attributes while imposing virtually no impact on the data's usability in other downstream machine learning tasks.

1. INTRODUCTION

The recent rapid advances in machine learning (ML) can be largely attributed to powerful computing infrastructures as well as the availability of large-scale datasets, such as ImageNet1K (Deng et al., 2009) for computer vision, WMT (Foundation, 2019) for neural machine translation, and Lib-riLight (Kahn et al., 2020) for speech recognition. Studies have shown that ML models trained on large-scale datasets can usually prove effective in many additional downstream tasks (Brown et al., 2020) . On the other hand, ethical concerns have been raised surrounding proper data usage in issues like data privacy (Liu et al., 2021 ), data minimization (Goldsteen et al., 2021) , etc. Therefore, if there are more data available and can be used without worrying whether or not the data is handled properly, the ML models can be further improved by more data and help the ML community to advance on many domains. Attempting to balance between model performance and proper data usages, a common approach usually taken is to simply modify the data to remove its "sensitive" attributes, and experimentally demonstrate that the targeted sensitive attributes are indeed removed. What's crucially important but usually omitted here, however, is the preservation of the "total utility" of the data, because the suppression operation oftentimes also negatively impact, or even completely destroy, the other "non-sensitive" attributes, hence greatly damaging the dataset's potential future utility. For example, DeepPrivacy (Hukkelås et al., 2019) is able to demonstrate its privacy protection capability, but the modified data it produces can no longer be utilized for additional downstream tasks like sentiment analysis, age detection, or gender classification. Since data is one of the main driving forces for the rapid advancement of machine learning research, we argue that the ideal scenario would be to have the flexibility of selecting an arbitrary set of attributes and only suppressing them while leaving all the other attributes completely intact. In this way, the community could unleash the potential utility of the modified data to develop more advanced algorithms. Towards this exact goal, we present Multi-attribute Selective Suppression (or MaSS) in this paper, to enable such capability of precise attributes suppression for multi-attribute datasets. The high-level 1

