EXPLORING CONNECTIONS BETWEEN MEMORIZATION AND MEMBERSHIP INFERENCE

Abstract

Membership inference (MI) allows adversaries to query trained machine learning models to infer if a particular data sample was used in training. Prior work has shown that the efficacy of MI is not the same for every sample in the training dataset; they broadly attribute this behavior to various data properties such as distributional difference. However, systematically analyzing the reasons for such disparate behavior has received little attention. In this work, we investigate the cause for such a discrepancy, and observe that the reason is more subtle and fundamental. We first provide empirical insight that an MI adversary is very successful with those samples that are highly likely to be memorized, irrespective of whether the sample is from the same or a different distribution. Next, we provide a game-based formulation which lower-bounds the advantage of an adversary with the ability to determine if a sample is memorized or not, under certain assumptions made about the efficacy of the model on the memorized samples. Finally, based on our theoretical results, we present a practical instantiation of a highly effective MI attack on memorized samples.

1. INTRODUCTION

Advances in machine learning (ML) are enabling a wide variety of new tasks that were previously deemed complex for computerized systems. These tasks are powered by models which are trained on large volumes of data, that often fall under the category of being sensitive or private, as it is collected from a variety of sources. For example, data used to customize (or fine-tune) large language models can often be sensitive (Carlini et al., 2021; Zanella-Béguelin et al., 2020) . Hence, understanding and explaining the privacy risks of the data used to train these models is an important problem that need to be solved before widespread adoption of these models. Several prior works (Shokri et al., 2017; Yeom et al., 2018) have successfully established that such models are susceptible to privacy attacks, such as membership inference (MI), that aim to infer if specific data-points are used during their training. Even more concerning is that they have shown that the efficacy of MI is not the same for every sample in the training dataset. Unfortunately, the problem of explaining this discrepancy has received less attention. Only recently, researchers have proposed techniques to measure the susceptibility of attack per sample (Carlini et al., 2022a; Ye et al., 2022) , and attributed the behaviour of disparate risks with coarse relation to distributional difference (Kulynych et al., 2019) . Consequently, out-of-distribution (OOD) samples which are part of the training dataset were deemed to be at a higher risk compared to other samples. In this work, we first systematically analyse the correctness of the above reasoning using representative techniques from OOD detection (Hendrycks & Gimpel, 2016; Liu et al., 2020) and MI literature (Carlini et al., 2022a) . Our empirical observations reveal that the relationship between OOD samples and higher MI risk is not straightforward ( § 3). The reasoning is more subtle and fundamental. We bridge the gap in our understanding of this problem and provide reasons for varying MI risk among training data-points. We demonstrate that MI and OOD samples are connected via the susceptibility of memorization of these samples Feldman & Zhang (2020); Brown et al. (2021) . That is, we show that an adversary is highly successful in predicting the membership of those samples that are likely to be memorized, irrespective of the distribution from which the data-point is sampled. Moreover, as shown in previous work by Feldman (2020) , and demonstrated in our evaluation, it is the OOD samples that have a higher tendency of being memorized by the model, thereby misleading previous work in concluding that distribution distance increases the susceptibility to MI. Next, we formalize a new game-based definition for MI, and present connections between memorization bounds from Brown et al. ( 2021) and the advantage of an MI adversary. We show that the ability of an adversary to predict whether a given sample is memorized or not lower-bounds the advantage of any adversary in predicting the membership of that sample ( § 4). Lastly, we propose a practical instantiation of a highly effective MI strategy on memorized samples that can be performed with existing attacks. Extensive evaluation on image recognition models using four popular MI attacks (Yeom et al., 2018; Shokri et al., 2017; Song & Mittal, 2021; Carlini et al., 2022a) confirm that the susceptibility of MI risk is maximum, at times 100% (Advantage = 1) for memorized samples ( § 5).

2. BACKGROUND & RELATED WORK

Notation. The focus of our discussion is on supervised machine learning (ML). Here, one wishes to learn a trained model using data-points of the form z = (x, y), where x ∈ X is the space of inputs and y ∈ Y is the space of outputs. A distribution D captures the space of inputs and outputs, and a dataset D of size T is sampled from it (i.e., D ∼ D T ) such that D = {z i } T i=1 . Using this dataset, a learning algorithm L can create a trained model θ ∼ L(D) by minimizing a suitable objective L.

2.1. MEMBERSHIP INFERENCE ATTACKS

Membership inference attacks (MIAs) are training distribution-aware, and aim to infer if a particular sample z was present in the training dataset given access to a model θ. One common theory for why such an attack exists relies on the belief that models overfit to their training data (Yeom et al., 2018) , leading to numerous follow-up works aiming to infer the advantage of MIAs (in determining membership) (Mahloujifar et al., 2022; Humphries et al., 2020; Erlingsson et al., 2019; Jayaraman et al., 2020; Thudi et al., 2022) . Prior MIAs approach this problem by training shadow models (i.e., models trained on different subsets of the training data) and observing their outputs to infer membership outcomes (Shokri et al., 2017) . Recent work by Carlini et al. (2022a) employs a similar approach, and combines the shadow model training with hypothesis testing (also performed in prior work (Murakonda et al., 2021; Ye et al., 2022) ) to yield success. Their work, however, remarks that attacks are most representative at low false positive rate (FPR) regimes. In our work, we will use the MIA of Carlini et al. (2022a) ; more details are presented in Appendix B.1. Other approaches towards estimating MI success involve measuring entropy (Song & Mittal, 2021) .

2.2. OOD DETECTION

The goal of out-of-distribution (OOD) detection is to determine if an input is drawn from the same distribution as the training dataset (i.e., in-distribution) or not. In empirical settings, OOD samples are realized as data samples from different datasets (usually with disjoint label sets) (Nguyen et al., 2015; Ming et al., 2022) . Most prior work formalizes the detection problem by designing a scoring function and applying a threshold. Performance is measured in terms of the false positive rate (FPR95) of OOD examples at the threshold when true positive rate (TPR) of in-distribution examples is as high as 95%, along with AUROC and AUPR. Starting with a baseline for OOD detection which uses maximum softmax probability (MSP) (Hendrycks & Gimpel, 2016) , recent studies have designed various scoring functions based on the outputs from the final or penultimate layer (Liu et al., 2020; DeVries & Taylor, 2018) , or a combination of different intermediate layers of DNN model (Lee et al., 2018; Raghuram et al., 2021) .

2.3. LABEL MEMORIZATION

Label memorization, as noted by Feldman & Zhang (2020) , is a concept where ML models remember information from training samples (likely) entirely to ensure good performance (when evaluated on those or similar samples). Brown et al. (2021) note that for certain types of inputs (which we shall define soon), memorization is essential to create a performant model. We begin by defining certain salient concepts related to memorization as in their original work.

