EXPLORING VULNERABILITIES OF BERT-BASED APIS

Abstract

Natural language processing (NLP) tasks, ranging from text classification to text generation, have been revolutionised by pretrained BERT models. This allows corporations to easily build powerful APIs by encapsulating fine-tuned BERT models. These BERT-based APIs are often designed to not only provide reliable service but also protect intellectual properties or privacy-sensitive information of the training data. However, a series of privacy and robustness issues may still exist when a fine-tuned BERT model is deployed as a service. In this work, we first present an effective model extraction attack, where the adversary can practically steal a BERT-based API (the target/victim model). We then demonstrate: (1) how the extracted model can be further exploited to develop effective attribute inference attack to expose sensitive information of the training data of the victim model; (2) how the extracted model can lead to highly transferable adversarial attacks against the victim model. Extensive experiments on multiple benchmark datasets under various realistic settings validate the potential privacy and adversarial vulnerabilities of BERT-based APIs.

1. INTRODUCTION

The emergence of Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) has revolutionised the natural language processing (NLP) field, leading to state-of-the-art performance on a wide range of NLP tasks with minimal task-specific supervision. In the meantime, with the increasing success of contextualised pretrained representations for transfer learning, powerful NLP models can be easily built by fine-tuning the pretrained models like BERT or XLNet (Yang et al., 2019) . Building NLP models on pretrained representations typically only require several task-specific layers or just a single feedforward layer on top of BERT. To protect data privacy, system integrity and Intellectual Property (IP), commercial NLP models such as task-specific BERT models are often made indirectly accessible through pay-per-query prediction APIs (Krishna et al., 2019) . This leaves model prediction the only information an attacker can access. Prior works have found that existing NLP APIs are still vulnerable to model extraction attack, which reconstructs a copy of the remote NLP model based on carefully-designed queries and the outputs of the API (Krishna et al., 2019; Wallace et al., 2020) . Pretrained BERT models further make it easier to apply model extraction attack to specialised NLP models obtained by fine-tuning pretrained BERT models (Krishna et al., 2019) . In addition to model extraction, it is important to ask the following two questions: 1) will the extracted model also leaks sensitive information about the training data in the target model; and 2) whether the extracted model can cause more vulnerabilities of the target model (i.e. the black-box API). To answer the above two questions, in this work, we first launch a model extraction attack, where the adversary queries the target model with the goal to steal it and turn it into a white-box model. With the extracted model, we further demonstrate that: 1) it is possible to infer sensitive information about the training data; and 2) the extracted model can be exploited to generate highly transferable adversarial attacks against the remote victim model behind the API. Our results highlight the risks of publicly-hosted NLP APIs being stolen and attacked if they are trained by fine-tuning BERT. Contributions: First, we demonstrate that the extracted model can be exploited by an attribute inference attack to expose sensitive information about the original training data, leading to a significant privacy leakage. Second, we show that adversarial examples crafted on the extracted model are highly

