MEMBERSHIP ATTACKS ON CONDITIONAL GENERATIVE MODELS USING IMAGE DIFFICULTY

Abstract

Membership inference attacks (MIA) try to detect if data samples were used to train a neural network model. As training data is very valuable in machine learning, MIA can be used to detect the use of unauthorized data. Unlike the traditional MIA approaches, addressing classification models, we address conditional image generation models (e.g. image translation). Due to overfitting, reconstruction errors are typically lower for images used in training. A simple but effective approach for membership attacks can therefore use the reconstruction error. However, we observe that some images are "universally" easy, and others are difficult. Reconstruction error alone is less effective at discriminating between difficult images used in training and easy images that were never seen before. To overcome this, we propose to use a novel difficulty score that can be computed for each image, and its computation does not require a training set. Our membership error, obtained by subtracting the difficulty score from the reconstruction error, is shown to achieve high MIA accuracy on an extensive number of benchmarks.

1. INTRODUCTION

Deep neural networks have been widely adopted in various computer vision tasks, e.g. image classification, semantic segmentation, image translation and generation etc. The high sample-complexity of such models requires large amounts of training data. However, obtaining many training images might not be an easy task. In fact, collection and annotation is often an expensive and labor intensive process. In some domains, such as medical imaging, publicly available training data are particularly scarce due to privacy concerns. In such settings, it is common to grant access to private sensitive data for training purposes alone, while ensuring not to reveal the data in the inference stage. A common solution is training the model privately and then providing black-box access to the trained model. However, even black-box access may leak sensitive information about the training data. Membership inference attacks (MIA) are one way to detect such leakage. Given access to a data sample, an attacker attempts to find whether or not the sample was used in the training process. MIA attacks have been widely studied for image classification models, achieving high success rates (Shokri et al., 2017; Salem et al., 2018; Sablayrolles et al., 2019; Yeom et al., 2018; Li & Zhang, 2020; Choo et al., 2020) . Due to overfitting in deep neural networks, prediction confidence tends to be higher for images used in training. This difference in prediction confidence helps MIA methods to successfully determine which images were used for training. Therefore, in addition to detecting information leakage, MIA also provide insights on the degree of overfitting in the victim model. We address MIA for a new domain -conditional image generation models, e.g. image translation. While classification models give a probability vector over possible classes, generation models give a single color for every pixel. We propose a MIA that uses pixel-wise reconstruction error, as overfitting causes lower reconstruction error on images used for training. But we observe that some images are "universally" easy, and others are universally difficult. Reconstruction error alone is therefore less accurate at discriminating between difficult images used in training and previously unseen easy images. To overcome this limitation, we add a novel image difficulty score which is computed for each query image. Our image difficulty score uses the accuracy of a linear predictor computed over a given image, predicting pixel values from deep features of that image. The reconstruction error together with the difficulty score helps to discriminate between two factors of variation in the reconstruction error, namely (i) The "intrinsic" difficulty of the conditional generation task for each image, based on its difficulty score and (ii) The boost in accuracy due to overfitting to the training images. Defining a membership error that subtracts the difficulty score from the reconstruction error is shown empirically to achieve high success rates in MIA. Differently from other MIA approaches, we do not assume the existence of a large number of in-distribution data samples for training a shadow model -but rather operate on merely a single image. Our method is evaluated on an extensive number of benchmarks demonstrating its effectiveness compared to strong baseline methods. 2018) further relaxed those assumptions and demonstrated that using only one shadow model is sufficient for a successful attack, and proposed using out-of-distribution dataset and different shadow model architectures, for a slightly inferior attack. Even more interestingly, they showed that without any training, a simple threshold on the victim model's confidence score is sufficient. This shows that classification models are more confident of samples that appeared in the training process, compared to unseen samples. 2019) proposed an attack based on applying a threshold over the loss value rather then the confidence and showed that black-box attacks are as good as white-box attacks. As the naive defense against such attacks is to modify the victim model's API to only output the predicted label, other works proposed label-only attacks (Yeom et al., 2018; Li & Zhang, 2020; Choo et al., 2020) .

Sablayrolles et al. (

While most previous work has been around classification models, there has been some effort regarding MIA on generative models such as GANs and VAEs (Chen et al., 2019; Hayes et al., 2019; Hilprecht et al., 2019 ). An attack against semantic segmentation models was proposed by He et al. (2019) , where a shadow semantic segmentation model is trained, and is used to train a binary classifier. The classifier is trained on image patches, and the final decision regarding the query image is set by the aggregation of the per-patch classification scores. The input to the classifier is a structured loss map between the shadow model's output and the ground truth segmentation map. Although this task is the closest to ours, our work is the first study of membership inference attacks on conditional image generation model. Besides membership inference attacks, other privacy attacks against neural networks exist. We refer the reader to Sec. A.1 for more details of such attacks.

2.2. CONDITIONAL IMAGE GENERATION

Image-to-image translation is the task of mapping an image from a source domain to a target domain, while preserving the semantic and geometric content of the input image. Over the last decade, with the advent of deep neural network models and increasing dataset sizes, significant progress was made in this field. Currently, the most popular methods for training image-to-image translation models use Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and are currently used in two main scenarios: (i) unsupervised image translation between domains (Zhu et al., 2017a; Kim et al., 2017; Liu et al., 2017; Choi et al., 2018) ; (ii) serving as a perceptual image loss function (Isola et al., 2017; Wang et al., 2018; Zhu et al., 2017b) . In this work we introduce the novel task of MIA on conditional image generation models.

3. MIA ON CONDITIONAL IMAGE GENERATION MODELS

In membership inference attacks (MIA), an adversary attacks a victim model by attempting to infer whether a query data sample was used to train a victim model. Such attacks exploit overfitting to



MEMBERSHIP INFERENCE ATTACKS (MIA) Shokri et al. (2017) were the first to study MIA against classification models in a black-box setting. In black-box setting the attacker can only send queries to the victim model and get the full probability vector response, without being exposed to the model itself. They proposed to train multiple shadow models to mimic the behavior of the victim model, and then use those to train a binary classifier to distinguish between known samples from the train set and unknown samples. They assume the existence of in-distribution new training data and knowledge of the victim model architecture. Salem et al. (

