EXPLORING VULNERABILITIES OF BERT-BASED APIS

Abstract

Natural language processing (NLP) tasks, ranging from text classification to text generation, have been revolutionised by pretrained BERT models. This allows corporations to easily build powerful APIs by encapsulating fine-tuned BERT models. These BERT-based APIs are often designed to not only provide reliable service but also protect intellectual properties or privacy-sensitive information of the training data. However, a series of privacy and robustness issues may still exist when a fine-tuned BERT model is deployed as a service. In this work, we first present an effective model extraction attack, where the adversary can practically steal a BERT-based API (the target/victim model). We then demonstrate: (1) how the extracted model can be further exploited to develop effective attribute inference attack to expose sensitive information of the training data of the victim model; (2) how the extracted model can lead to highly transferable adversarial attacks against the victim model. Extensive experiments on multiple benchmark datasets under various realistic settings validate the potential privacy and adversarial vulnerabilities of BERT-based APIs.

1. INTRODUCTION

The emergence of Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) has revolutionised the natural language processing (NLP) field, leading to state-of-the-art performance on a wide range of NLP tasks with minimal task-specific supervision. In the meantime, with the increasing success of contextualised pretrained representations for transfer learning, powerful NLP models can be easily built by fine-tuning the pretrained models like BERT or XLNet (Yang et al., 2019) . Building NLP models on pretrained representations typically only require several task-specific layers or just a single feedforward layer on top of BERT. To protect data privacy, system integrity and Intellectual Property (IP), commercial NLP models such as task-specific BERT models are often made indirectly accessible through pay-per-query prediction APIs (Krishna et al., 2019) . This leaves model prediction the only information an attacker can access. Prior works have found that existing NLP APIs are still vulnerable to model extraction attack, which reconstructs a copy of the remote NLP model based on carefully-designed queries and the outputs of the API (Krishna et al., 2019; Wallace et al., 2020) . Pretrained BERT models further make it easier to apply model extraction attack to specialised NLP models obtained by fine-tuning pretrained BERT models (Krishna et al., 2019) . In addition to model extraction, it is important to ask the following two questions: 1) will the extracted model also leaks sensitive information about the training data in the target model; and 2) whether the extracted model can cause more vulnerabilities of the target model (i.e. the black-box API). To answer the above two questions, in this work, we first launch a model extraction attack, where the adversary queries the target model with the goal to steal it and turn it into a white-box model. With the extracted model, we further demonstrate that: 1) it is possible to infer sensitive information about the training data; and 2) the extracted model can be exploited to generate highly transferable adversarial attacks against the remote victim model behind the API. Our results highlight the risks of publicly-hosted NLP APIs being stolen and attacked if they are trained by fine-tuning BERT. Contributions: First, we demonstrate that the extracted model can be exploited by an attribute inference attack to expose sensitive information about the original training data, leading to a significant privacy leakage. Second, we show that adversarial examples crafted on the extracted model are highly transferable to the target model, exposing more adversarial vulnerabilities of the target model. Third, extensive experiments with the extracted model on benchmark NLP datasets highlight the potential privacy issues and adversarial vulnerabilities of BERT-based APIs. We also show that both attacks developed on the extracted model can evade the investigated defence strategies.

2.1. MODEL EXTRACTION ATTACK (MEA)

Model extraction attacks (also referred to as "stealing" or "reverse-engineering") have been studied both empirically and theoretically, for simple classification tasks (Tramèr et al., 2016) , vision tasks (Orekondy et al., 2019) , and NLP tasks (Krishna et al., 2019; Wallace et al., 2020) . As opposed to stealing parameters (Tramèr et al., 2016) , hyperparameters (Wang & Gong, 2018), architectures (Oh et al., 2019) , training data information (Shokri et al., 2017) and decision boundaries (Tramèr et al., 2016; Papernot et al., 2017) , in this work, we attempt to create a local copy or steal the functionality of a black-box victim model (Krishna et al., 2019; Orekondy et al., 2019) , that is a model that replicates the performance of the victim model as closely as possible. If reconstruction is successful, the attacker has effectively stolen the intellectual property. Furthermore, this extracted model could be used as a reconnaissance step to facilitate later attacks (Krishna et al., 2019) . For instance, the adversary could use the extracted model to facilitate private information inference about the training data of the victim model, or to construct adversarial examples that will force the victim model to make incorrect predictions. Fredrikson et al. (2014) first proposed model inversion attack on biomedical data. The goal is to infer some missing attributes of an input feature vector based on the interaction with a trained ML model. Since deep neural networks have the ability to memorise arbitrary information (Zhang et al., 2017) , the private information can be memorised by BERT as well, which poses a threat to information leakage (Krishna et al., 2019) . In NLP application, the input text often provides sufficient clues to portray the author, such as gender, age, and other important attributes. For example, sentiment analysis tasks often have privacy implications for authors whose text is used to train models. Prior works (Coavoux et al., 2018) have shown that user attributes can be easily detectable from online review data, as used extensively in sentiment analysis results (Hovy et al., 2015) . One might argue that sensitive information like gender, age, location and password are all not explicitly included in model predictions. Nonetheless, model predictions are produced from the input text, it can meanwhile encode personal information which might be exploited for adversarial usages, especially a modern deep learning model owns more capacity than they need to perform well on their tasks (Zhang et al., 2017) . The naive solution of removing protected attributes is insufficient: other features may be highly correlated with, and thus predictive of, the protected attributes (Pedreshi et al., 2008) .

2.3. ADVERSARIAL TRANSFERABILITY AGAINST NLP SYSTEM

An important property of adversarial examples is their transferability (Szegedy et al., 2014; Goodfellow et al., 2015; Papernot et al., 2017) . It has been shown that adversarial examples generated against one network can also successfully fool other networks (Liu et al., 2016; Papernot et al., 2017) , especially the adversarial image examples in computer vision. Similarly, in NLP domain, adversarial examples that are designed to manipulate the substitute model can also be misclassified by the target model are considered transferable (Papernot et al., 2017; Ebrahimi et al., 2018b) . Adversarial transferability against NLP system remains largely unexplored. Few recent works have attempted to transfer adversarial examples to the NLP systems (Sun et al., 2020; Wallace et al., 2020) , however, it is oblivious how the transferability works against BERT-based APIs, and whether the transferability would succeed when the victim model and the substitute (extracted) model have different architectures.

