MEMBERSHIP LEAKAGE IN PRE-TRAINED LANGUAGE MODELS

Abstract

Pre-trained language models are becoming a dominating component in NLP domain and have achieved state-of-the-art in various downstream tasks. Recent research has shown that language models are vulnerable to privacy leakage of their training data, such as text extraction and membership leakage. However, existing works against NLP applications mainly focus on the privacy leakage of text generation and downstream classification, and the privacy leakage of pre-trained language models is largely unexplored. In this paper, we take the first step toward systematically auditing the privacy risks of pre-trained language models through the lens of membership leakage. In particular, we focus on membership leakage of pre-training data in the exposure of downstream models adapted from pre-trained language models. We conduct extensive experiments on a variety of pre-trained model architectures and different types of downstream tasks. Our empirical evaluations demonstrate that membership leakage of pre-trained language models exists even when only the downstream model output is exposed, thereby posing a more severe risk than previously thought. We further conduct sophisticated ablation studies to analyze the relationship between membership leakage of pre-trained models and the characteristic of downstream tasks, which can guide developers or researchers to be vigilant about the vulnerability of pre-trained language models. Lastly, we explore possible defenses against membership leakage of PLMs and propose two promising defenses based on empirical evaluations.

1. INTRODUCTION

Nowadays, pre-trained language models (PLMs), represented by BERT (Devlin et al., 2019) , have revolutionized the natural language processing community (Wolf et al., 2019; Vaswani et al., 2017; Munikar et al., 2019) . PLMs are typically pre-trained on large-scale corpora to learn some universal linguistic representations and are then fine-tuned for downstream domain-specific tasks (Sun et al., 2019; Shen et al., 2021) . Concretely, downstream model owners can add only a few task-specific layers on top of the PTMs to adapt to their own tasks, such as text classification, named entity recognition (NER), and Q&A. This training paradigm not only can avoid training new models from scratch, but also form the basis of state-of-the-art results across NLP. Despite its novel advantage in adapting downstream tasks, PLMs are essentially DNN models. Recent studies (Erkin et al., 2009; Liu et al., 2022; Choo et al., 2021; Li et al., 2021) have shown that machine learning models (e.g., image classifiers) are vulnerable to privacy attacks, such as attribute and membership inference attacks. Yet, existing privacy attacks against language models have mainly focused on text generation and downstream text classification (Song & Shmatikov, 2019; Shejwalkar et al., 2021) . To our knowledge, the potential privacy risks of pre-training data for PLMs have never been explored. To fill this gap, we take the first step towards systematically audit the privacy risks of PLMs through the lens of membership inference: An adversary aims to infer whether a data sample is part of PLMs' training data. In particular, given the realistic and common scenario that downstream service providers are more likely to build models adapted from PLMs, we consider that the adversary access only these downstream service models deployed online. Here, PLMs allow adding any task-specific layers to fit any type of downstream task, such as classification (Shejwalkar et al., 2021) , NER (Mc-Callum & Li, 2003), and Q&A (Bordes et al., 2014) . We further consider another realistic scenario where no additional information about the target PLMs is available to the adversary other than the output, i.e., black-box setting. We perform an extensive measurement study of membership inference, jointly, over four different PLMs architectures (BERT,ALBERT,RoBERTa, XLNet) and five different downstream datasets that refers to three downstream tasks. Our evaluations show membership leakage of pre-training data exists even when only the output of the downstream model is exposed, regardless of the PLMs architecture and downstream tasks,and thereby pose a more severe risk than previously thought. We further analyze the relationship between membership leakage and the characteristic of downstream tasks. We also conduct sophisticated ablation studies guideline inventors to be vigilant about the vulnerability of pretrained NLP models Lastly, we explore possible defenses that can prevent membership leakage of PLMs and propose two promising defenses based on empirical evaluation results.

Contributions.

• We pioneer to conduct the first investigation on membership leakage of PLMs' pre-training data with only downstream model output exposed. • We conduct extensive experiments on a variety of PLMs architectures and different types of downstream tasks. Our empirical evaluations demonstrate that membership leakage of PLMs exists even when only the downstream model is exposed. • We conduct sophisticated ablation studies to analyze the relationship between membership leakage and the characteristic of downstream tasks. • We explore possible defenses against membership leakage of PLMs and propose two promising defenses.

2.1. PRE-TRAINED LANGUAGE MODELS

Pre-trained Encoder. Nowadays, large-scale pre-trained language models (PLMs) are pushing natural language processing to a new era. They are typically pre-trained on large corpora by selfsupervised learning and fine-tuned for different types of downstream tasks. The first generation of PLMs was BERT Devlin et al. ( 2019), whose pre-training tasks were masked language modeling (MLM) and next sentence prediction (NSP). Since then, many variants of BERT have emerged to improve the learning ability of language models, such as ALBERT (Lan et al., 2020), RoBERTa (Liu et al., 2019 ), XLNet (Yang et al., 2019) . Downstream Tasks. PLM trained on a large corpus can learn generic linguistic representations, which can benefit in a wide range of downstream tasks such as text classification, named entity recognition (NER), and question and answering (Q&A). PLMs unify these tasks into a common pre-training and fine-tuning pipeline and achieves superior performance across them (Gururangan et al., 2020; Qiu et al., 2020) , and thereby the pre-training and fine-tuning pipeline has become the most widely applied downstream model construction paradigm. PLMs, despite their high level of generic linguistic representations, also rely on large-scale corpora that contains private/sensitive pre-training data, such as phone numbers, addresses, and biomedical data (in BioBERT (Lee et al., 2020)). Therefore, the vulnerability of PLMs to privacy leakage deserves our attention, as well as proactive assessment. Further, this is also important in light of the latest regulations under the EU General Data Protection Regulation (GDPR) umbrellafoot_0 which require data owners to have greater control over their data. In addition, since PLMs serve downstream tasks, this means that downstream models adapted from PLMs are actually more common in the real world. Therefore, these concerns and realities drive our attention to the privacy leakage that exists in PLMs and that only the downstream models on them are exposed.

2.2. MEMBERSHIP INFERENCE ATTACKS

Membership inference attack is a kind of data inference attack which aims to infer whether the data sample was used to train the target machine learning model (Hu et al., 2021; Carlini et al., 2021;  



https://gdpr-info.eu

