LEARNING PRIVATE REPRESENTATIONS WITH FOCAL ENTROPY

Abstract

How can we learn a representation with good predictive power while preserving user privacy? We present an adversarial representation learning method to sanitize sensitive content from the representation in an adversarial fashion. Specifically, we propose focal entropy -a variant of entropy embedded in an adversarial representation learning setting to leverage privacy sanitization. Focal entropy enforces maximum uncertainty in terms of confusion on the subset of privacy-related similar classes, separated from the dissimilar ones. As such, our proposed sanitization method yields deep sanitization of private features yet is conceptually simple and empirically powerful. We showcase feasibility in terms of classification of facial attributes and identity on the CelebA dataset as well as CIFAR-100. The results suggest that private components can be removed reliably.

1. INTRODUCTION

Lately, the topics of privacy and security are enjoying increased interest in the machine learning community. This can largely be attributed to the success of big data in conjunction with deep learning and the urge to create and process ever-larger data sets for mining. However, with the emergence of more and more machine learning services becoming part of our daily lives and making use of our data, special measures must be taken to protect privacy and decrease the risk of privacy creep Narayanan & Shmatikov (2006); Backstrom et al. (2007) . Simultaneously, growing privacy concerns entail the risk of becoming a major deterrent in the widespread adoption of machine learning and the attainment of their concomitant benefits. Therefore, reliable and accurate privacy-preserving methodologies are needed, which is why the topic lately has enjoyed increased attention in the research community. Several efforts have been made in machine learning to develop algorithms that preserve user privacy while achieving reasonable predictive power. Solutions proposed for privacy in the research community are versatile. A standard approach to address privacy issues in the client-server setup is to anonymize the data of clients. This is often achieved by directly obfuscating the private part(s) of the data and/or adding random noise to raw data. Consequently, the noise level controls the trade-off between predictive quality and user privacy (e.g., data-level Differential Privacy Dwork ( 2006)). These approaches associate a privacy budget with all operations on the dataset. However, complex training procedures run the risk of exhausting the budget before convergence. A recent solution to such a problem has been federated learning McMahan et al. (2016); Geyer et al. (2017) , which allows us to collaboratively train a centralized model while keeping the training data decentralized. The idea behind this strategy is that clients transfer the parameters of the training model in the form of gradient updates to a server instead of the data itself. While such an approach is appealing to train a network with data hosted on different clients, transferring the models between clients and server, and averaging the gradients across the clients generates significant data transmission and extra computations, which considerably prolongs training. Another widely adopted solution is to rely on encoded data representation. Following this notion, instead of transferring the client's data, a feature representation is learned on the clients' side and transferred to the server. Unfortunately, the learned features may still contain rich information, which can breach user privacy Osia et al. (2017; 2018) . Also, the extracted features can be exploited by an attacker to infer private attributes Salem et al. ( 2019). Yet, another approach is homomorphic encryption Armknecht et al. (2015) . Despite providing strong cryptographic guarantees, in theory, it incurs considerable computational overhead, which still prevents its applicability for SOTA deep learning architectures Srivastava et al. (2019) .

