CANARY IN A COALMINE: BETTER MEMBERSHIP IN-FERENCE WITH ENSEMBLED ADVERSARIAL QUERIES

Abstract

As industrial applications are increasingly automated by machine learning models, enforcing personal data ownership and intellectual property rights requires tracing training data back to their rightful owners. Membership inference algorithms approach this problem by using statistical techniques to discern whether a target sample was included in a model's training set. However, existing methods only utilize the unaltered target sample or simple augmentations of the target to compute statistics. Such a sparse sampling of the model's behavior carries little information, leading to poor inference capabilities. In this work, we use adversarial tools to directly optimize for queries that are discriminative and diverse. Our improvements achieve significantly more accurate membership inference than existing methods, especially in offline scenarios and in the low false-positive regime which is critical in legal settings. Code is available at https://github.com/YuxinWenRick/canary-in-a-coalmine 

1. INTRODUCTION

In an increasingly data-driven world, legislators have begun developing a slew of regulations with the intention of protecting data ownership. The right-to-be-forgotten written into the strict GDPR law passed by the European Union has important implications for the operation of ML-as-a-service (MLaaS) providers (Wilka et al., 2017; Truong et al., 2021) . As one example, Veale et al. (2018) discuss that machine learning models could legally (in terms of the GDPR) fall into the category of "personal data", which equips all parties represented in the data with rights to restrict processing and to object to their inclusion. However, such rights are vacuous if enforcement agencies are unable to detect when they are violated. Membership inference algorithms are designed to determine whether a given data point was present in the training data of a model. Though membership inference is often presented as a breach of privacy in situations where belonging to a dataset is itself sensitive information (e.g. a model trained on a group of people with a rare disease), such methods can also be used as a legal tool against a non-compliant or malicious MLaaS provider. Because membership inference is a difficult task, the typical setting for existing work is generous to the attacker and assumes full white-box access to model weights. In the aforementioned legal scenario, this is not a realistic assumption. Organizations have an understandable interest in keeping their proprietary model weights secret and short of a legal search warrant, often only provide black-box querying to their clients (OpenAI, 2020) . Moreover, even if a regulatory agency forcibly obtained white-box access via an audit, for example, a malicious provider could adversarially spoof the reported weights to cover up any violations. In this paper, we achieve state-of-the-art performance for membership inference in the black-box setting by using a new adversarial approach. We observe that previous work (Shokri et al., 2017; Yeom et al., 2018; Salem et al., 2018; Carlini et al., 2022a) improves membership inference attacks through a variety of creative strategies, but these methods query the targeted model using only the original target data point or its augmentations. We instead learn query vectors that are maximally discriminative; they separate all models trained with the target data point from all models trained without it. We show that this strategy reliably results in more precise predictions than the baseline method for three different datasets, four different model architectures, and even models trained with differential privacy.

2. BACKGROUND AND RELATED WORK

Homer et al. ( 2008) originated the idea of membership inference attacks (MIAs) by using aggregated information about SNPs to isolate a specific genome present in the underlying dataset with high probability. Such attacks on genomics data are facilitated by small sample sizes and the richness of information present in each DNA sequence, which for humans can be up to three billion base pairs. Similarly, the overparametrized regime of deep learning makes it vulnerable to MIAs. Yeom et al. ( 2018) designed the first attacks on deep neural networks by leveraging overfitting to the training data -members exhibit statistically lower loss values than non-members. Since their inception, improved MIAs have been developed, across different problem settings and threat models with varying levels of adversarial knowledge. Broadly speaking, MIAs can be categorized into metric-based approaches and binary classifier approaches (Hu et al., 2021) . The latter utilizes a variety of calculated statistics to ascertain membership while the former involves training shadow models and using a neural network to learn the correlation (Shokri et al., 2017; Truong et al., 2021; Salem et al., 2018) . More specifically, existing metric-based approaches include: correctness (Yeom et al., 2018; Choquette-Choo et al., 2021; Bentley et al., 2020; Irolla & Châtel, 2019; Sablayrolles et al., 2019 ), loss (Yeom et al., 2018; Sablayrolles et al., 2019 ), confidence (Salem et al., 2018) , and entropy (Song & Mittal, 2021; Salem et al., 2018) Despite the vast literature on MIAs, all existing methods in both categories rely solely on the data point x whose membership status is in question -metric-based approaches compute statistics based on x * or augmentations of x * and binary classifiers take x * as an input and output membership status directly. Our work hinges on the observation that an optimized canary image x mal can be a more effective litmus test for determining the membership of x * . Note that this terminology is separate from the use in Zanella-Béguelin et al. (2020) and Carlini et al. (2019) , where a canary refers to a sequence that serves as a proxy to test memorization of sensitive data in language models. It also differs from the canary-based gradient attack in Pasquini et al. (2021) , where a malicious federated learning server sends adversarial weights to users to infer properties about individual user data (e.g. membership inference) even with secure aggregation. The metric used for assessing the efficacy of a MIA has been the subject of some debate. A commonly used approach is balanced attack accuracy, which is an empirically determined probability of correctly ascertaining membership. However, Carlini et al. (2022a) point out that this metric is inadequate because it implicitly assigns equal weight to both classes of mistakes (i.e. false positive and false negatives) and it is an average-case metric. The latter characteristic is especially troubling because meaningful privacy should protect minorities and not be measured solely on effectiveness for the majority. A good alternative to address these shortcomings is to provide the receiver operating characteristic (ROC) curve. This metric reports the true positive rate (TPR) at each false positive rate (FPR) by varying the detection threshold. One way to distill the information present in the ROC curve is by computing the area under the curve (AUC) -more area means a higher TPR across all FPRs on average. However, more meaningful violations of privacy occur at a low FPR. Methods that optimize solely for AUC can overstress the importance of high TPR at high FPR, a regime inherently protected by plausible deniability. In our work, we report both AUC and numerical results at the FPR deemed acceptable by Carlini et al. (2022a) for ease of comparison.



. The ability to query such metrics at various points during training has been shown to further improve membership inference. Liu et al. (2022) devise a model distillation approach to simulate the loss trajectories during training, and Jagielski et al. (2022b) leverage continual updates to model parameters to get multiple trajectory points.

