EXTRACTING ROBUST MODELS WITH UNCERTAIN EX-AMPLES

Abstract

Model extraction attacks are proven to be a severe privacy threat to Machine Learning as a Service (MLaaS). A variety of techniques have been designed to steal a remote machine learning model with high accuracy and fidelity. However, how to extract a robust model with similar resilience against adversarial attacks is never investigated. This paper presents the first study toward this goal. We first analyze that those existing extraction solutions either fail to maintain the model accuracy or model robustness, or lead to the robust overfitting issue. Then we propose Boundary Entropy Searching Thief (BEST), a novel model extraction attack to achieve both accuracy and robustness extraction under restricted attack budgets. BEST generates a new kind of uncertain examples for querying and reconstructing the victim model. These samples have uniform confidence scores across different classes, which can perfectly balance the trade-off between model accuracy and robustness. Extensive experiments demonstrate that BEST outperforms existing attack methods over different datasets and model architectures under limited data. It can also effectively invalidate state-of-the-art extraction defenses. Our codes can be found in https://github.com/GuanlinLee/BEST.

1. INTRODUCTION

Recent advances in deep learning (DL) and cloud computing technologies boost the popularity of Machine Learning as a Service (MLaaS), e.g., AWS SageMaker (sag, 2022), Azure Machine Learning (azu, 2022) . This service can significantly simplify the DL application development and deployment at a lower cost. Unfortunately, it also brings new privacy threats: an adversarial user can query a target model and then reconstruct it based on the responses (Tramèr et al., 2016; Orekondy et al., 2019; Jagielski et al., 2020b; Yuan et al., 2020; Yu et al., 2020) . Such model extraction attacks can severely compromise the intellectual property of the model owner (Jia et al., 2021) , and facilitate other black-box attacks, e.g., data poisoning (Demontis et al., 2019 ), adversarial examples (Ilyas et al., 2018 ), membership inference (Shokri et al., 2017) . Existing model extraction attacks can be classified into two categories (Jagielski et al., 2020b) . ( 1) Accuracy extraction aims to reconstruct a model with similar or superior accuracy compared with the target model. ( 2) Fidelity extraction aims to recover a model with similar prediction behaviors as the target one. In this paper, we propose and consider a new category of attacks: robustness extraction. As DNNs are well known to be vulnerable to adversarial attacks (Szegedy et al., 2014) , it is common to train highly robust models for practical deployment, especially in critical scenarios such as autonomous driving (Shen et al., 2021 ), medical diagnosis (Rendle et al., 2016) and anomaly detection (Goodge et al., 2020) . Then an interesting question is: given a remote robust model, how can the adversary extract this model with similar robustness as well as accuracy under limited attack budgets? We believe this question is important for two reasons. (1) With the increased understanding of adversarial attacks, it becomes a trend to deploy robust machine learning applications in the cloud (Goodman & Xin, 2020; Rendle et al., 2016; Shafique et al., 2020) , giving the adversary opportunities to steal the model. ( 2) Training a robust model usually requires much more computation resources and data (Schmidt et al., 2018; Zhao et al., 2020) , giving the adversary incentives to steal the model. We review existing attack techniques and find that they are incapable of achieving this goal, unfortunately. Particularly, there can be two kinds of attack solutions. (1) The adversary adopts clean samples to query and extract the victim model (Tramèr et al., 2016; Orekondy et al., 2019; Pal et al., 2020) . However, past works have proved that it is impossible to obtain a robust model only from clean et al., 2020) . We will conduct experiments to validate the limitations of prior works in Section 3. To overcome these challenges in achieving robustness extraction, we design a new attack methodology: Boundary Entropy Searching Thief (BEST). The key insight of BEST is the introduction of uncertain examples (UEs). These samples are located close to the junctions of classification boundaries, making the model give uncertain predictions. We synthesize such samples based on their prediction entropy. Using UEs to query the victim model, the adversary can asymptotically shape the classification boundary of the extracted model following that of the victim model. With more extraction epochs, the boundaries of the two models will be more similar, and the overfitting phenomenon will be mitigated. We perform comprehensive experiments to show that BEST outperforms different types of baseline methods over various datasets and models. For instance, BEST can achieve 13% robust accuracy and 8% clean accuracy improvement compared with the JBDA attack (Papernot et al., 2017) on CIFAR10.

2. THREAT MODEL

We consider the standard MLaaS scenario, where the victim model M V is deployed as a remote service for users to query. We further assume this model is established with adversarial training (Madry et al., 2018; Zhang et al., 2019; Li et al., 2022) , and exhibits certain robustness against AEs. We consider adversarial training as it is still regarded as the most promising strategy for robustness enhancement, while some other solutions (Xu et al., 2017; Zhang et al., 2021a; Gu & Rigazio, 2014; Papernot et al., 2017) were subsequently proved to be ineffective against advanced adaptive attacks ( Athalye et al., 2018; Tramer et al., 2020) . We will consider more robustness approaches in future work (Section 7 Prior works have made different assumptions about the adversary's knowledge of query samples. Some attacks assume the adversary has access to the original training set (Tramèr et al., 2016; Jagielski et al., 2020b; Pal et al., 2020) , while some others assume the adversary can obtain the distribution of training samples (Papernot et al., 2017; Orekondy et al., 2019; Chandrasekaran et al., 2020; Yu et al., 2020; Pal et al., 2020) . Different from those works, we consider a more practical adversary's capability: the adversary only needs to collect data samples from the same task domain of the victim model, which do not necessarily follow the same distribution of the victim's training set. This is feasible as the adversary knows the task of the victim model, and he can crawl relevant images from the Internet. More advanced attacks (e.g., data-free attacks (Truong et al., 2021; Kariyappa et al., 2021) ) will be considered as future work. The adversary can collect a small-scale dataset D A with such samples to query the victim model M V . We consider two practical scenarios for the MLaaS: the service can return the predicted logits vector Y (Tramèr et al., 2016; Orekondy et al., 2019; Jagielski et al., 2020b; Pal et al., 2020) or a hard label



Figure 1: Model extraction results on CIFAR10. The victim model is ResNet18 trained by PGD-AT on CIFAR10. The adversary model is ResNet18. Black solid and dashed lines in each figure denote the clean and robust accuracy of the victim model. data (Zhao et al., 2020; Rebuffi et al., 2021). Thus, these methods cannot preserve the robustness of a robust victim model, although they can effectively steal the model's clean accuracy. (2) The adversary crafts adversarial examples (AEs) to query and rebuild the victim model (Papernot et al., 2017; Yu et al., 2020). Unfortunately, building models with AEs leads to two unsolved problems: (1) improving the robustness with AEs inevitably sacrifices the model's clean accuracy (Tsipras et al., 2019); (2) with more training epochs, the model's robustness will decrease as it overfits the generated AEs (Riceet al., 2020). We will conduct experiments to validate the limitations of prior works in Section 3.

). An adversarial user A aims to reconstruct this model just based on the returned responses. The extracted model M A should have a similar prediction performance as the target one, for both clean samples (clean accuracy) and AEs (robust accuracy). A has no prior knowledge of the victim model, including the model architecture, training algorithms, and hyperparameters. He is not aware of the adversarial training strategy for robustness enhancement, either. A can adopt a different model architecture for building M A , which can still achieve the same behaviors as the target model M V .

