EXTRACTING ROBUST MODELS WITH UNCERTAIN EX-AMPLES

Abstract

Model extraction attacks are proven to be a severe privacy threat to Machine Learning as a Service (MLaaS). A variety of techniques have been designed to steal a remote machine learning model with high accuracy and fidelity. However, how to extract a robust model with similar resilience against adversarial attacks is never investigated. This paper presents the first study toward this goal. We first analyze that those existing extraction solutions either fail to maintain the model accuracy or model robustness, or lead to the robust overfitting issue. Then we propose Boundary Entropy Searching Thief (BEST), a novel model extraction attack to achieve both accuracy and robustness extraction under restricted attack budgets. BEST generates a new kind of uncertain examples for querying and reconstructing the victim model. These samples have uniform confidence scores across different classes, which can perfectly balance the trade-off between model accuracy and robustness. Extensive experiments demonstrate that BEST outperforms existing attack methods over different datasets and model architectures under limited data. It can also effectively invalidate state-of-the-art extraction defenses. Our codes can be found in https://github.com/GuanlinLee/BEST.

1. INTRODUCTION

Recent advances in deep learning (DL) and cloud computing technologies boost the popularity of Machine Learning as a Service (MLaaS), e.g., AWS SageMaker (sag, 2022), Azure Machine Learning (azu, 2022) . This service can significantly simplify the DL application development and deployment at a lower cost. Unfortunately, it also brings new privacy threats: an adversarial user can query a target model and then reconstruct it based on the responses (Tramèr et al., 2016; Orekondy et al., 2019; Jagielski et al., 2020b; Yuan et al., 2020; Yu et al., 2020) . Such model extraction attacks can severely compromise the intellectual property of the model owner (Jia et al., 2021) , and facilitate other black-box attacks, e.g., data poisoning (Demontis et al., 2019) , adversarial examples (Ilyas et al., 2018) , membership inference (Shokri et al., 2017) . Existing model extraction attacks can be classified into two categories (Jagielski et al., 2020b) . ( 1) Accuracy extraction aims to reconstruct a model with similar or superior accuracy compared with the target model. ( 2) Fidelity extraction aims to recover a model with similar prediction behaviors as the target one. In this paper, we propose and consider a new category of attacks: robustness extraction. As DNNs are well known to be vulnerable to adversarial attacks (Szegedy et al., 2014) , it is common to train highly robust models for practical deployment, especially in critical scenarios such as autonomous driving (Shen et al., 2021) , medical diagnosis (Rendle et al., 2016) and anomaly detection (Goodge et al., 2020) . Then an interesting question is: given a remote robust model, how can the adversary extract this model with similar robustness as well as accuracy under limited attack budgets? We believe this question is important for two reasons. (1) With the increased understanding of adversarial attacks, it becomes a trend to deploy robust machine learning applications in the cloud (Goodman & Xin, 2020; Rendle et al., 2016; Shafique et al., 2020) , giving the adversary opportunities to steal the model. ( 2) Training a robust model usually requires much more computation resources and data (Schmidt et al., 2018; Zhao et al., 2020) , giving the adversary incentives to steal the model. We review existing attack techniques and find that they are incapable of achieving this goal, unfortunately. Particularly, there can be two kinds of attack solutions. (1) The adversary adopts clean samples to query and extract the victim model (Tramèr et al., 2016; Orekondy et al., 2019; Pal et al., 2020) . However, past works have proved that it is impossible to obtain a robust model only from clean data (Zhao et al., 2020; Rebuffi et al., 2021) . Thus, these methods cannot preserve the robustness of a robust victim model, although they can effectively steal the model's clean accuracy. (2) The adversary crafts adversarial examples (AEs) to query and rebuild the victim model (Papernot et al., 2017; Yu et al., 2020) . Unfortunately, building models with AEs leads to two unsolved problems: (1) improving the robustness with AEs inevitably sacrifices the model's clean accuracy (Tsipras et al., 2019) ; (2) with more training epochs, the model's robustness will decrease as it overfits the generated AEs (Rice et al., 2020) . We will conduct experiments to validate the limitations of prior works in Section 3. To overcome these challenges in achieving robustness extraction, we design a new attack methodology: Boundary Entropy Searching Thief (BEST). The key insight of BEST is the introduction of uncertain examples (UEs). These samples are located close to the junctions of classification boundaries, making the model give uncertain predictions. We synthesize such samples based on their prediction entropy. Using UEs to query the victim model, the adversary can asymptotically shape the classification boundary of the extracted model following that of the victim model. With more extraction epochs, the boundaries of the two models will be more similar, and the overfitting phenomenon will be mitigated. We perform comprehensive experiments to show that BEST outperforms different types of baseline methods over various datasets and models. For instance, BEST can achieve 13% robust accuracy and 8% clean accuracy improvement compared with the JBDA attack (Papernot et al., 2017) on CIFAR10.

2. THREAT MODEL

We consider the standard MLaaS scenario, where the victim model M V is deployed as a remote service for users to query. We further assume this model is established with adversarial training (Madry et al., 2018; Zhang et al., 2019; Li et al., 2022) , and exhibits certain robustness against AEs. We consider adversarial training as it is still regarded as the most promising strategy for robustness enhancement, while some other solutions (Xu et al., 2017; Zhang et al., 2021a; Gu & Rigazio, 2014; Papernot et al., 2017) were subsequently proved to be ineffective against advanced adaptive attacks (Athalye et al., 2018; Tramer et al., 2020) . We will consider more robustness approaches in future work (Section 7). An adversarial user A aims to reconstruct this model just based on the returned responses. The extracted model M A should have a similar prediction performance as the target one, for both clean samples (clean accuracy) and AEs (robust accuracy). A has no prior knowledge of the victim model, including the model architecture, training algorithms, and hyperparameters. He is not aware of the adversarial training strategy for robustness enhancement, either. A can adopt a different model architecture for building M A , which can still achieve the same behaviors as the target model M V . Prior works have made different assumptions about the adversary's knowledge of query samples. Some attacks assume the adversary has access to the original training set (Tramèr et al., 2016; Jagielski et al., 2020b; Pal et al., 2020) , while some others assume the adversary can obtain the distribution of training samples (Papernot et al., 2017; Orekondy et al., 2019; Chandrasekaran et al., 2020; Yu et al., 2020; Pal et al., 2020) . Different from those works, we consider a more practical adversary's capability: the adversary only needs to collect data samples from the same task domain of the victim model, which do not necessarily follow the same distribution of the victim's training set. This is feasible as the adversary knows the task of the victim model, and he can crawl relevant images from the Internet. More advanced attacks (e.g., data-free attacks (Truong et al., 2021; Kariyappa et al., 2021) ) will be considered as future work. The adversary can collect a small-scale dataset D A with such samples to query the victim model M V . We consider two practical scenarios for the MLaaS: the service can return the predicted logits vector Y (Tramèr et al., 2016; Orekondy et al., 2019; Jagielski et al., 2020b; Pal et al., 2020) or a hard label Y (Tramèr et al., 2016; Papernot et al., 2017; Jagielski et al., 2020b; Pal et al., 2020) for every query sample. For each case, our attack is able to extract the model precisely. Attack cost. Two types of attack budgets are commonly considered in model extraction attacks. (1) Query budget B Q : this is defined as the number of queries the adversary sends to the victim model. As commercial MLaaS systems adopt the pay-as-you-use business scheme, the adversary wishes to perform fewer queries while achieving satisfactory attack performance. (2) Synthesis budget B S : this is defined as the computation cost (e.g., number of optimization iterations) to generate each query sample. A smaller B S will be more cost-efficient to the adversary and reduce the attack time. The design of a model extraction attack needs to consider the reduction of both budgets. To reduce the attack cost, we assume the adversary can download a public pre-trained model and then build the extracted model from it. This assumption is reasonable, as there are lots of public model zoos offering pre-trained models of various AI tasks (e.g., Hugging Face (hug, 2022) and ModelZoo (mod, 2022) ). It is also adopted in (Yu et al., 2020) , and justified in (Jagielski et al., 2020a) . The training set of the pre-trained model can be totally different from that of the victim model.

3. EXISTING ATTACK STRATEGIES AND THEIR LIMITATIONS

A variety of attack techniques have been proposed to extract models with high accuracy and fidelity, which can be classified into the following two categories. Extraction with Clean Samples. The adversary samples query data from a public dataset offline and trains the extracted model based on the data and victim model's predictions. The earlier work (Tramèr et al., 2016) adopts this simple strategy, and we denote this attack as "Vanilla" in the rest of this paper. Later on, advanced attacks are proposed, which leverage active learning (Chandrasekaran et al., 2020) to generate samples for querying the victim model and refine the local copy iteratively. Typical examples include Knockoff Nets (Orekondy et al., 2019) and ActiveThief (Pal et al., 2020) attacks. The adversary gets a huge database of different natural images. For each iteration, he actively searches the best samples from this database based on his current model for extraction. Extraction with Adversarial Examples. The adversary crafts AEs to identify the classification boundaries. A representative example is CloudLeak (Yu et al., 2020) . The adversary generates AEs based on a local surrogate model as the query samples. These AEs with the victim model's predictions form the training set for the adversary to train the extracted model. Some attacks also combine active learning to iteratively generate AEs. For instance, in the JBDA attack (Papernot et al., 2017) , the adversary follows the FGSM (Goodfellow et al., 2015) idea to generate perturbed samples, queries the data, and then refines his local model repeatedly. Limitations. These solutions may work well for accuracy or fidelity extraction. However, they are not effective in robustness extraction. We analyze their limitations from the following perspectives. First, according to previous studies (Zhao et al., 2020; Rebuffi et al., 2021) , it is impossible to train a robust model only with clean samples. Therefore, the techniques using clean samples cannot steal the victim model's robustness. To confirm this conclusion, we train a robust model using the PGD-AT approach (Madry et al., 2018) . This model adopts the ResNet-18 architecture (He et al., 2016) and is trained over CIFAR10. The black solid and dashed lines in Figure 1a denote the clean accuracy and robust accuracy of this model. We consider the scenario where this model only returns the predicted hard label for each queryfoot_0 . Then we adopt the Vanilla and Knockoff Nets attack techniques to extract this model using the samples from part of the CIFAR10 test setfoot_1 . Figure 1a shows the model accuracy over different extraction epochs, which is evaluated by another part of the CIFAR10 test set, disjoint with the extraction set. We observe that for these two approaches, the clean accuracy of the stolen model is very high. However, the robust accuracy of the replicated model against the PGD20 attack is close to 0, which indicates the extracted model does not inherit the robustness from the victim model at all. Second, we consider the techniques based on AEs. Training (extracting) models with AEs can incur two unsolved issues. (1) The participation of AEs in model training can sacrifice the model's clean accuracy (Tsipras et al., 2019) . Figure 1b shows the attack results of CloudLeak and JBDA. We observe that the clean accuracy of the extracted model from JBDA drops significantly compared to the extracted accuracy from Vanilla and Knockoff Nets (Figure 1a ). ( 2) Training a model with AEs can easily make the model overfit over the training data (Rice et al., 2020) , which significantly decreases the robustness of the adversary's model with more query samples. From Figure 1b , we observe the robust accuracy in JBDA decreases before the extraction is completed. It is worth noting that the clean and robust accuracy of CloudLeak is very close to the clean sample based approaches (Vanilla and Knockoff Nets). The reason is that pre-generated AEs have lower transferability towards the victim model compared to queries in active learning, and can still be regarded as clean samples. 2) FineTune: the adversary first extracts the model with CloudLeak, and then fine-tunes it with AEs. We adopt the same training hyperparameters and protocol in (Rice et al., 2020) . Furthermore, to avoid potential overfitting in the adversarial training process, we adopt Self-Adaptive Training (SAT) (Huang et al., 2020) combining with PGD-AT (Madry et al., 2018) in both attacks. The hyperparameters of SAT follow its paper. Figure 1c shows the extraction results of these two attacks. We observe that their clean accuracy is still compromised, and robust accuracy decays at the beginning (i.e., robust overfitting). The main reason is that the adversary does not have enough data to apply adversarial training due to the attack budget constraint (5,000 samples in our experiments), which could easily cause training overfitting and low clean accuracy. We provide more experiment results in Appendix C.7 to show the advantages of our BEST.

4. METHODOLOGY

We introduce a new attack methodology, which can extract both the clean accuracy and robustness of the victim model. Figure 1d shows the extraction results of our method under the same setting as other approaches. We observe it can effectively overcome the above challenges and outperform other attack strategies. Below we give the detailed mechanism and algorithm of our solution.

4.1. UNCERTAIN EXAMPLE

Our methodology is inspired by the limitations of AE-based attacks. It is well-known that AEs are very close to the model's classification boundaries, and can better depict the boundary shape (He et al., 2018) . However, they also exhibit the γ-useful robust feature (Ilyas et al., 2019) , i.e., having very high confidence scores (larger than γ) for the correct label. Such robust features can lead to clean accuracy degradation (Tsipras et al., 2019; Ilyas et al., 2019) as well as robust overfitting (Rice et al., 2020) . Therefore, to precisely extract both the clean and robust accuracy of the victim model, the query samples should satisfy two properties: (P1) they cannot have the robust feature obtained from the AE generation process to avoid overfitting; (P2) they should reflect the shape of the model classification boundaries. These properties motivate us to design a new way to depict the victim model's boundary. We propose a novel uncertain example (UE), which can meet the above requirements and is qualified for model extraction. It is formally defined as follows. Definition 1 (δ-uncertain example) Given a model M : R N → R n , an input x ∈ R N is said to be a δ-uncertain example (δ-UE), if it satisfies the following relationship: softmax(M (x)) max -softmax(M (x)) min ≤ δ (1) Figure 2 illustrates the positions of clean samples, AEs and UEs. Compared to other types of samples, a UE aims to make the model confused about its label. Clearly, every sample in R N is a 1-UE. To query and extract the victim model, we expect to make δ as small as possible. On one hand, a UE with a small δ does not have the robust feature, satisfying the property P1. On the other hand, a sample far away from the model's classification boundary normally has higher prediction confidence (Cortes & Vapnik, 1995; Mayoraz & Alpaydin, 1999) . The uncertainty in the UEs makes them closer to the boundary, satisfying the property P2. Therefore, model extraction with UEs can better preserve the clean accuracy without causing robust overfitting, compared to AE-based approaches.

4.2. BOUNDARY ENTROPY SEARCHING THIEF ATTACK

We propose Boundary Entropy Searching Thief (BEST), a novel extraction attack based on UEs. Particularly, similar as (Pal et al., 2020; Papernot et al., 2017; Orekondy et al., 2019) , we also adopt the active learning fashion, as it can give more accurate extraction results than the non-active attacks. In each iteration, the adversary queries the victim model with δ-UEs. Based on the responses, he refines the local model to make it closer to the victim one in terms of both clean and robust accuracy. For some active learning based attacks (e.g., Knockoff Nets, ActiveThief), the adversary searches for the best sample from a huge database for each query. This strategy is not feasible under our threat model, where the adversary has a limited number of data samples. Besides, it is hard to directly sample qualified δ-UEs from the adversary's training set D A : according to previous studies (Li & Liang, 2018; Allen-Zhu et al., 2019) , the trained model will gradually converge on the training set, making the training samples far from the boundary and reducing the chances of finding UEs under a small δ. Instead, the adversary can synthesize UEs from natural data in each iteration. This can be formulated as a double-minimization problem, with the following objective: min M A L(x, y, M A ) min x (softmax(M A (x)) max -softmax(M A (x)) min ). In the inner minimization, we first identify the UE x that makes its confidence variance as small as possible. end for 15: end while 16: Return MA Algorithm 1 describes the BEST attack in detail. We first define a new uncertain label Y p with the same confidence score of each class (Line 2). In each iteration within the query budget B Q , we collect some natural samples from the adversary's dataset D A and synthesize the corresponding UEs. We adopt the Kullback-Leibler divergence KLD(•, •) to compute the distance between softmax(M A (X i )) and Y p (Line 7), and apply the PGD technique (Madry et al., 2018) under the synthesis budget B S to make softmax(M A (X i )) closer to Y p , i.e., minimizing δ (Line 8). Then we query the victim model M V with the generated UEs and obtain the corresponding responses (Line 10). Different from previous works, we only need to obtain the hard label from M V , which is enough for minimizing the Cross-Entropy loss L for model training (Line 12), which makes it harder to defeat our attack, as it can invalidate the mainstream extraction defenses which perturb the logits vectors (Lee et al., 2019) . Note that Line 12 represents the training process on a batch of data. Specifically, we use data X with the batch size of 128 in our experiments, i.e., the adversary first queries the victim model with 128 data samples, uses the 128 sample-response pairs to train his local model, and then adds 128 to the query budget b q .

5.1. CONFIGURATIONS

Datasets and Models. Our attack method is general for different datasets, models, and adversarial training strategies. Without loss of generality, we choose two datasets: CIFAR10 (Krizhevsky et al., 2009) and CIFAR100 (Krizhevsky et al., 2009) , which are the standard datasets for adversarial training studies (Madry et al., 2018; Zhang et al., 2019; Rice et al., 2020; Jang et al., 2019; Raghunathan et al., 2018; Xiao et al., 2020; Balaji et al., 2019) . Prior model extraction attacks adopt data samples from or following the same distribution of the victim's training set (Tramèr et al., 2016; Jagielski et al., 2020b; Pal et al., 2020; Papernot et al., 2017; Orekondy et al., 2019; Chandrasekaran et al Yu et al., 2020; Pal et al., 2020) , which may not be possible in some practical scenarios. Our attack only requires the data from the same task domain. In our implementation, we split the test sets of CIFAR10 and CIFAR100 into two disjointed parts: (1) an extraction set D A is used by the adversary to steal the victim model; (2) a validation set D T is used to evaluate the attack results and the victim model's performance during its training process. Both D A and D T contain 5,000 samples. We also evaluate other types of extraction set in Section 5.3. The adversary adopts pre-trained models of various architectures trained over Tiny-ImageNet. The benefit from the pre-trained models is evaluated in Appendix C.3. We consider two types of extraction outcomes. (1) Best model with the highest robustness: the adversary picks the model with the highest robustness (against PGD20) during extraction. (2) Final model after extraction: the adversary picks the model from the last epoch. The victim model M V is selected from ResNet-18 (ResNet) (He et al., 2016) or WideResNet-28-10 (WRN) (Zagoruyko & Komodakis, 2016) . The adversary model M A may be different from the victim model, and we use two more model structures in our experiments, i.e., MobileNetV2 (MobileNet) (Sandler et al., 2018) , and VGG19BN (VGG) (Simonyan & Zisserman, 2015) . We adopt two mainstream adversarial training approaches, i.e., PGD-AT (Madry et al., 2018) and TRADES (Zhang et al., 2019) , to enhance the robustness of victim models. This results in ResNet-AT (WRN-AT) and ResNet-TRADES (WRN-TRADES), respectively. The clean and robust accuracy of the victim models can be found in Appendix B.3. Baselines. We adopt five different baseline methods for comparison. The first two are representatives of model extraction attacks discussed in Section 3. (1) Vanilla (Tramèr et al., 2016) is the most basic extraction technique using clean samples to query the victim model. (2) JBDA (Papernot et al., 2017) leverages active learning to generate AEs, which gives the best extraction performance over other methods in Section 3. We also choose three robust knowledge distillation methods as our baselines: (3) ARD (Goldblum et al., 2020) , (4) IAD (Zhu et al., 2022) and ( 5) RSLAD (Zi et al., 2021) . Robust knowledge distillation aims to train a student model from a large teacher model, where the student model can obtain better robustness than the teacher model. This is very similar to our robustness extraction goal. However, it requires the user to have the entire training set, as well as white-box access to the teacher model, which disobeys our threat model. So, we modify these methods with the same knowledge of the victim model and dataset for fair comparisons. For the details of robust knowledge distillation methods, we introduce them in Appendix A.2 and explain the differences between knowledge distillation methods and model extraction attacks. Due to the page limit, we leave the baseline details and settings of these attacks in Appendix B.1. Metrics. We consider three metrics to comprehensively assess the attack performance. For clean accuracy evaluation, we measure clean accuracy (CA), which is the accuracy of the extracted models over clean samples in D T . We also consider relative clean accuracy (rCA), which checks whether the M A gives the same label as M V for each x i in D T . For robustness evaluation, we measure robust accuracy (RA) against various adversarial attacks. We choose four L ∞ -norm non-targeted attacks: PGD20, PGD100 (Madry et al., 2018) , CW100 (Carlini & Wagner, 2017) and AutoAttack (AA) (Croce & Hein, 2020) . The attack settings are ϵ = 8/255 and η = 2/255. The number of attack steps is 20 for PGD20, and 100 for PGD100 and CW100. The results under L 2 -norm attacks can be found in Appendix C.1. The formal definitions of our metrics can be found in Appendix B.4.

5.2. MAIN RESULTS

Comparisons with Baselines. We compare the attack effectiveness of BEST with other baselines. Due to page limit, we only show the results under one configuration: M V is ResNet-AT, M A is ResNet, and the dataset is CIFAR10. The other configurations give the same conclusion, and the results can be found in Appendix C.8. (Rice et al., 2020) . In contrast, our BEST can reduce the accuracy gap between the best model and the final model. Because the UEs generated in BEST give the extracted model a lower risk to overfit data. Third, when the victim model returns logits vectors, the (relative) clean accuracy (r)CA of the baseline methods increases, while the model robustness decreases. Because the robust features make the victim model give more uncertain predictions, and learning such logits vectors directly is more difficult. Our BEST does not depend on the returned prediction type. Impact of Model Architecture and Adversarial Training Strategy. We first consider the case where the adversary adopts different model architectures from the victim. Table 2 shows the results when we vary the architecture of the adversary model. We observe that our methodology enables the adversary to obtain the maximal performance within the selected architecture. The clean and robust accuracy of VGG and MobileNet is a bit lower than ResNet and WRN, which is caused by the capability of the architectures themselves. Table 3 shows the attack performance against different victim model architectures with different adversarial training approaches. We observe that the attack performance is very stable across different configurations, and the deviations of rCA and robustness RA do not exceed 7% and 4%, respectively. Impact of Attack Budgets. We first explore how the query budget B Q can affect the performance of our BEST. We perform model extraction with different sizes of D A from 1,000 to 5,000. Figure 3a shows the clean and robust accuracy trends during the extraction under different query budgets. We clearly observe that a larger B Q can increase both the clean and robust accuracy. Importantly, even using a very small D A , the overfitting issue does not occur at the end of the attack, which indicates our method is stable and powerful. We give a detailed analysis in Appendix C.2. We further consider the impact of the synthesis budget. We vary the value of B S and measure the relative clean accuracy rCA and robust accuracy RA against AA for the extracted model with the highest robustness during extraction. The results are shown in Figure 3b . First, we observe that our BEST can achieve excellent attack performance even under a very low synthesis budget. Second, a larger B S will not increase the rCA and RA significantly. We think this is because it is easy to generate query samples, and increasing B S does not improve the quality or quantity of the query samples with a smaller δ. This indicates our attack is much more efficient than previous works, which rely on larger synthesis budgets. We give a more detailed analysis in Appendix C.2, where we use a single V100 GPU card to generate all required data in one epoch, and present the GPU time (in seconds) to prove our method's efficiency. 

5.3. MODEL EXTRACTION WITH DIFFERENT TYPES OF DATA

In the above experiments, the adversary adopts the samples from the same distribution of the victim model's test data to synthesize uncertain examples. In this section, we consider and evaluate some alternatives for query sample generation. Incorporating Training Samples. In some cases, the adversary may have the victim's original training data, e.g., the victim's model is trained over a public dataset. Then the adversary can add the training samples into D A for model extraction. This threat model has been considered in prior works (Tramèr et al., 2016; Jagielski et al., 2020b; Pal et al., 2020) . In our experiments, we first set D A with 5,000 samples of the test data's distribution, and then add different numbers of victim's training samples into D A . Figure 3c shows the extracted clean and robust accuracy with different configurations. We observe that the incorporation of training samples is very helpful for improving the attack performance since they are directly related to the victim model. Even with 1,000 training samples, the clean and robust accuracy is improved by 2.64% and 1.88%, respectively. Applying Data Augmentations. Data augmentation has been a popular strategy to enhance the model's robustness. We can also leverage this technique to generate uncertain examples, which could possibly improve the attack performance. Table 4 compares the results with and without augmentation. Details about the adopted augmentation operations can be found in Appendix B.1. Clearly, when the adversary uses data augmentation to first augment the clean sample and then generate the query sample, the clean accuracy and robustness are significantly higher than the case without data augmentation. Besides, using data augmentation can also help the adversary bypass the victim's defense, which will be discussed in Section 5.4.

5.4. BYPASSING MODEL EXTRACTION DEFENSES

Past works proposed several defense solutions to alleviate model extraction threats, which can be classified into three categories. The first kind is to add perturbations into the logits vectors without changing the prediction labels (Lee et al., 2019) . Since our BEST only needs the hard labels of the query samples to extract models, such defenses do not work. The second type is to detect malicious query samples. Upon the identification of a suspicious sample, the victim model will return an incorrect prediction. We consider two typical detection methods. (1) PRADA (Juuti et al., 2019) is a global detection approach. It detects malicious samples based on a priori hypothesis that the differences between normal samples in the same class obey a Gaussian distribution, while the differences between synthesized samples often follow long-tailed distributions. We reproduce this method and evaluate its effectiveness in detecting BEST. We observe that initially, PRADA needs to establish knowledge about the anomalous distributions of malicious samples. After 6,180 queries, it is able to identify each uncertain example. To bypass such detection, the adversary can apply data augmentation for generating uncertain examples (Section 5.1). The randomness in these augmentation operations can disrupt the defender's knowledge about anomalous queries. Our experiments show that PRADA fails to detect any adversary's query sample generated with data augmentation. ( 2) SEAT (Zhang et al., 2021b ) is an account-based detection method. It detects and bans suspicious accounts which send similar query samples multiple times. To bypass SEAT, the adversary only needs to register more accounts and use them to query the victim model, which can reduce the attack cost as well (Appendix C.2). The third kind of strategy is to increase the computational cost of model extraction. Dziedzic et al. (2022) introduced proof-of-work (POW) to increase the query time of malicious samples. This is a strong defense against existing extraction attacks with specially-crafted query samples. We observe the query cost of our attack grows exponentially with the number of queries. It is interesting to improve our attack to bypass this method. For instance, the adversary can try to behave like normal users when querying the model. He can set up large numbers of accounts and ensure the queries in each account will not exceed the privacy budget. Since there are no evaluations about the possibility of bypassing POW with multiple accounts in (Dziedzic et al., 2022) , we will consider it as an interesting future work. We discuss more details in Appendix C.9.

5.5. MORE EVALUATIONS

We perform more perspectives of evaluations, and the results are in the Appendix. In particular, Extracting Models with Out-Of-Distribution Data. We further consider a weaker adversary, who can only obtain out-of-distribution data to query the victim model. The results and analysis can be found in Appendix C.4. BEST can still restore the victim model in some degree. In Appendix C.5, we prove the using additional out-of-distribution data can improve the extraction results. Extracting Non-robust Models. Our method is general and can extract non-robust models as well. Given a model trained normally, BEST is able to precisely extract its clean accuracy. Extraction results can be found in Appendix C.6. Transferability Stabilization. Our attack enjoys high transferability stabilization (Papernot et al., 2017) , i.e., AEs generated from the victim model can achieve similar accuracy as the extracted models, and vice versa. We demonstrate this feature with different model architectures in Appendix C.10.

6. DISCUSSIONS

The traditional model extraction problem was introduced many years ago and has been well-studied. In contrast, this is the first time to propose robustness extraction. As an initial attempt, our attack method also has some limitations. We expect this paper can attract more researchers to explore this problem and come up with better solutions. Below, we discuss some open questions for future work. • Although our method outperforms existing SOTA solutions, there still exists a robustness gap between the extracted model and the victim model. One possible solution to reduce such a gap is to increase the number of query samples (Appendix C.5). In the future, it is important to improve the extracted robustness in a more efficient way. • In this paper, we mainly consider adversarial training for building a robust model, which is the most popular strategy. There can be other robust solutions, e.g., certified defense (Cohen et al., 2019; Li et al., 2019) , which will be considered in the future. Besides, we mainly focus on the image classification task. It is also interesting to extend this problem to other AI tasks and domains. • Recent works proposed data-free attacks (Truong et al., 2021; Kariyappa et al., 2021) , where the adversary trains a GAN to generate query samples from noise. We find these techniques cannot achieve promising results for extracting the model's robustness. How to design data-free techniques for robustness extraction is a challenging problem, and we leave it for future work. , 2020) . There are three basic types of sampling-based active learning algorithms. The first type is randomly sampling, which means for each query, the adversary randomly sends some samples to the victim, and uses the returned label to train his model. The second one is uncertainty strategy (Lewis & Gale, 1994) . The adversary will choose the most uncertain samples to query the victim and use them to train his model. The third one is k-center strategy (Sener & Savarese, 2018; Pal et al., 2020) . The adversary will generate cluster centers based on each sample's prediction, and then choose the most distant sample for each cluster to make the query sample. The model is trained based on these samples, and the adversary will update the cluster centers after every training step. We consider two settings for the victim's MLaaS: returning the logits vector or the hard label for each query. For the former setting, all baseline methods adopt Kullback-Leibler divergence as the loss function. For the later setting, we replace the Kullback-Leibler divergence with the Cross-Entropy loss, which is the same as previous works. For both our method and baselines, we apply data augmentation when generating query samples. It includes central cropping, adding Gaussian noise, random image flipping, and random rotation. We show this augmentation can help increase the attack performance and bypass the defense in the experiments. In We restrict the perturbation size and the number of iterations in the perturbation generation process, which is consistent with other baselines. Similar to our method, the query data in JBDA, ARD, IAD and RSLAD are derived from the clean samples. For JBDA, the query data are the clean data after Jacobian augmentation. For ARD, IAD and RSLAD, the query data are the clean data adding adversarial perturbation.

B.3 DETAILS OF VICTIM MODELS

In Table 1 , we show the detailed information of all victim models used in our experiments, including their clean accuracy and robustness under various attacks. 

B.4 DETAILS OF METRICS

The formula of rCA is: rCA(M A , M V , D T ) = 1 N N i=1 1(max(M A (x i )) = max(M V (x i ))), x i ∈ D T (3) where M A is the adversary's model, M V is the victim's model, and D T is the validation set. The formula of RA is: RA(M A , D T ) = 1 N N i=1 Pr[M A (x i + ϵ i ) = M A (x i )|ϵ i ∈ B p (0,ϵ) ], x i ∈ D T ( ) where p is the norm and ϵ is the maximum perturbation margin, which together constrain the perturbation ϵ i in a hypersphere B p (0,ϵ) , whose center is the origin.

C MORE RESULTS

C.1 EXTRACTION RESULTS UNDER L 2 -NORM ATTACKS In (Zhao et al., 2020) . We can adopt early learning decay to decrease the query budget. In our experiments, all the highest robustness models are with query budgets of about 1K -5K * 100. With the early learning decay method, we restore the victim model with 5K * 80 query budgets. The results in Table 3 indicate that reducing the query budgets will not significantly decrease the restored model's clean accuracy and robustness. On the other hand, we find that using more accounts can reduce query costs. For example, AWS provides a Free Tier for new accounts to analyze 5,000 images per month for free 3 . Google provides all accounts with a discount of predicting 1,000 images per month free 4 . Microsoft provides all accounts with a discount of predicting 5,000 images per month for freefoot_4 . It is feasible to use more accounts to steal the victim model, which can significantly reduce the financial cost as creating new accounts is easy and trivial. Hence, the query budget is not the principal limitation in model extraction attacks. Synthesis Budget Analysis. Our method is computational efficiency because the scale of UEs that need to be generated in one training epoch for the adversary is small. 

C.3 IMPACT OF PRE-TRAINED MODELS

We evaluate the improvement from a pre-trained model in Table 4 . The results indicate that under limited data, the pre-trained model can significantly improve clean accuracy and robustness. In fact, using pre-trained models does not affect the superiority of our method, due to the following reasons. First, all the baseline methods adopt the same pre-trained model to initialize the adversary's model. This gives us very fair comparisons. Second, the adopted pre-trained models are normal without any robustness features. This explains why other baseline methods using these models cannot extract the robustness of the victim model (Section 5.2 and Appendix C.8). Third, Table 4 proves that even without pre-trained models, our method can still restore robustness from the victim models. The main reason is that using UEs to query the victim model can obtain the most informative outputs from the victim model and using such samples to train the adversary's model can better shape the classification boundaries to fit the victim model's boundaries, which makes the adversary's model achieve higher robustness. On the other hand, using pre-trained models in model extraction attacks is based on a very practical fact that they are widely existed in various tasks, beyond image domain and computer vision tasks, and can be easily downloaded. For example, the well-known website ModelZoo mod (2022) provides various pre-trained models for different tasks, including natural language processing, text-to-speech, audio generation, and image-to-text. It covers different intelligent tasks, like NLP, Audio, and Multimodality.

C.4 ADOPTING DIFFERENT DISTRIBUTIONS OF SAMPLES

We consider another scenario where the adversary does not know the distribution of the victim model's test data. He may use samples from a different distribution to synthesize the uncertain examples for extraction. Table 5 shows the evaluation result of such a case, where the victim model is trained over CIFAR10, while the adversary uses data from CIFAR10 (in-distribution) as well as SVHN, CIFAR100, STL10 (out-of-distribution) to perform attacks. To be specific, the data distribution of STL10 is the closest one to the CIFAR10, while the data distribution of SVHN is the furthest one to the CIFAR10. We observe that model extraction with a different distribution of samples has much lower clean and robust accuracy. Combining the results in the above paragraph, we conclude that a reduced gap between the distributions of the victim's training data and the adversary's extraction data can increase the clean accuracy and robust accuracy. We provide more discussions about how to enhance the attack with out-of-the-distribution data in Appendix C.5.

C.5 ADOPTING MORE SAMPLES

We explore how to further improve the results under our threat model. In experiments in our main paper, there are only 5,000 data that the adversary can use to query the victim model, so the gap between the victim model and the restored model can be decreased by adding more data. So, we compare the results under 5,000 CIFAR10 data and 5,000 CIFAR100 or 5,000 STL10 data. The results in Table 6 indicate that increasing the number of query data is an efficient way to improve our results, even though the data distribution is different. It is to say that our method can suit a mixture of distributions, which is meaningful if there is not enough data from a single distribution for the adversary to collect.

C.6 EXTRACTING NON-ROBUST VICTIM MODEL

In addition to robust models, our approach can also extract non-robust models, just for clean accuracy. Figure 4 In Tables 8 to 22 , we display the results of different attack scenarios on CIFAR10. In Tables 23 to 38 Based on the results of JBDA, ARD, IAD, RSLAD and BEST, we can find that the robust feature and overfitting can have a close connection. First, the technique used to augment the clean samples in JBDA is very close to the FGSM attack, which will introduce robust features into the generated samples. Straightforwardly, when comparing our method with ARD, IAD and RSLAD (these three attacks generated adversarial examples first, which contain robust features), we also find that the robust features in query samples of ARD, IAD and RSLAD cause overfitting. So, the Property P1 can be proven from the experiments results. Our UEs do not contain robust features, and obtain the robustness of the victim model by shaping the restored classification boundaries as the victim model's boundaries. Comparing the results between ARD, IAD, RSLAD and our method, we find that even using stronger AEs, the ARD, IAD and RSLAD cannot beat our method. So, the JBDA with FGSM will not outperform our method. To summarize, we conclude that BEST has three main advantages compared to other baselines. First, BEST can restore high clean accuracy and relative clean accuracy, which is impossible for robust knowledge distillation methods (e.g., ARD, IAD and RSLAD). 

C.9 ATTACK AGAINST POW

There are two ways to implement the POW defense (Dziedzic et al., 2022) in MLaaS. The first way is to count the per-query cost for each user. In this way, the adversary cannot adopt multiple accounts to decrease the total time cost. Furthermore, the time cost will grow at a linear speed, if the cost for each query is almost the same. For our model extraction attack method, because the adversary needs lots of queries to restore the robustness of the victim model, the total time cost is not negligible. The second way is to count the cumulative cost of queries for each user. Based on this implementation, the time cost for a query will increase exponentially. For a normal user, it can introduce additional waiting time (Dziedzic et al., 2022) . For an adversary, due to the privacy leakage caused by the query samples, the time cost will be thousands of times larger than that of the normal query. For our BEST, because the uncertain samples will obtain the boundary information from the victim model, which is a type of privacy leakage, the total query time will be too long and unacceptable. Overall, the POW attack is robust to defend against our model extraction attack, because of its diverse implementation methods. It will be our future work to explore how to overcome such a defense with a robust model extraction attack. For instance, the adversary can try to behave like a normal user when querying the model. Since normal users may also possibly send normal images which the victim model has low confidence or uncertainty about their classes (e.g., images in the wild following different distributions from the training set), to reduce such false positives and make the service practical, the model owner should allow certain privacy budget for each account. Then the adversary can set up a large number of accounts and ensure the queries in each account will not exceed such privacy budget. Although the POW paper discussed that the adoption of multiple accounts can be defeated by summing over all the users, we believe the adversary can still succeed if he tries to mimic normal users for each account. The more accounts he has, the higher feasible it is to mimic normal users within the privacy budget.

C.10 TRANSFERABILITY STABILIZATION

Transferability stabilization is defined as the AEs generated from the victim model that can achieve similar accuracy over the extracted models. Simultaneously, it requires the extracted models with different structures can generate AEs having similar transferability among each extracted model. Our BEST can help M A have various architectures obtain similar classification boundaries as the victim model's ones to achieve transferability stabilization. 



For all model extraction attacks in Figure1, we adopt the same pre-trained model, which is trained on Tiny-Imagenet, to initialize the adversary's local model as an extraction start. More details of the pre-trained model can be found in Appendix B.1. ActiveThief adopts the similar active technique as Knockoff Nets, so we omit its results here. https://aws.amazon.com/rekognition/pricing/?nc1=h_ls https://cloud.google.com/vision/product-search/pricing https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ computer-vision/



Figure 1: Model extraction results on CIFAR10. The victim model is ResNet18 trained by PGD-AT on CIFAR10. The adversary model is ResNet18. Black solid and dashed lines in each figure denote the clean and robust accuracy of the victim model.

Figure 2: An illustration of clean samples, AEs and UEs. Each color represents a class.Third, we further consider a straightforward strategy specifically for the robustness extraction scenario: the adversary conducts the accuracy extraction of the victim model, followed by adversarial training to obtain the robustness. We implement two attacks following this strategy: (1) LabelDataset: the adversary first obtains a labeled dataset from the victim model with Vanilla or CloudLeak, and then adopts adversarial training to train his model. (2) FineTune: the adversary first extracts the model with CloudLeak, and then fine-tunes it with AEs. We adopt the same training hyperparameters and protocol in(Rice et al., 2020). Furthermore, to avoid potential overfitting in the adversarial training process, we adopt Self-Adaptive Training (SAT)(Huang et al., 2020) combining with PGD-AT(Madry et al., 2018) in both attacks. The hyperparameters of SAT follow its paper. Figure1cshows the extraction results of these two attacks. We observe that their clean accuracy is still compromised, and robust accuracy decays at the beginning (i.e., robust overfitting). The main reason is that the adversary does not have enough data to apply adversarial training due to the attack budget constraint (5,000 samples in our experiments), which could easily cause training overfitting and low clean accuracy. We provide more experiment results in Appendix C.7 to show the advantages of our BEST.

(a) Impact of the query budget. (b) Impact of the synthesis budget. (c) Adding victim's training data.

Figure 3: Exploration of the attack budget and training data. The dataset is CIFAR10. The victim model is ResNet-AT. The adversary model is ResNet.

DETAILS OF CONFIGURATIONSIn all experiments, the learning rate of model extraction is set as 0.1 at the beginning and decays at the 100th and 150th epoch with a factor of 0.1. The optimizer in all experiments is SGD, with a start learning rate of 0.1, momentum of 0.9 and weight decay of 0.0001. The total number of extraction epochs is 200. In each epoch, the adversary queries all data in his training set D A . The batch size is 128. The hyperparameter in JBDA for Jacobian matrix multiplication is β = 0.1. For ARD, IAD, RSLAD and our BEST, the hyperparameters for query sample generation under L ∞ -norm are ϵ = 8/255, η = 2/255 and B S = 10.For all baseline methods and our BEST, the adversary adopts a pre-trained model to facilitate the model extraction process. Specifically, the pretrained model is used to initialize the adversary's local model, which means that the adversary uses a pretrained model as a start to restore the victim's model. All baseline methods and our attack follow the same attack pipeline, i.e., using the same pre-trained model as a start and restoring the victim model by training the pre-trained model with the query data. All the pre-trained models are downloaded from the open repository in GitHub. The pre-trained models are trained on Tiny-ImageNet. There are four network structures for the pre-trained models: ResNet, VGG, MobileNet and WideResNet.

Figure 4: Clean and robust accuracy of extracting a nonrobust model.

, we display the results of different attack scenarios on CIFAR100. The victim models include ResNet-AT, ResNet-TRADES, WRN-AT, and WRN-TRADES. The adversary models include ResNet, WRN, VGG, and MobileNet. Clearly, our BEST outperforms other baselines under various settings. And we have the same conclusions as in the main paper. Especially, when the adversary adopts VGG as his model to steal a victim model with logits, other baselines cannot make the model converge on CI-FAR100. This is because using logits as labels can introduce more noise during the training process, and training VGG is more difficult when compared with training other models. Our BEST can keep stable when the adversary uses VGG as his model, as our method only requires the hard labels.

Figure 5: Transferability stabilization of our BEST. The dataset of (a), (b), (e) and (f) is CIFAR10. The dataset of (c), (d), (g) and (h) is CIFAR100. The victim model of (a) and (c) is ResNet-AT. The victim model of (e) and (g) is WRN-AT. The victim model of (b) and (d) is ResNet-TRADES. The victim model of (f) and (h) is WRN-TRADES. We generate adversarial examples by using PGD100. The vertical axis represents the model which generates adversarial examples. The horizontal axis represents the model which is attacked by other models' adversarial examples. The number inside each square is the prediction accuracy.

Then in the outer minimization, we optimize the adversary's model M A with such UE x and its label y obtained from the victim model's prediction to minimize the loss function L. Our UEs are generated based on the adversary's restored model M A . Because we want the adversary to modify classification boundaries as much as possible to get close to the victim model's boundaries, the information obtained from the victim model should be maximized, which can be achieved by querying the victim model with UEs.

.,

Table 1 shows the comparison results. First, our BEST generally performs much better than the baseline methods. It has similar clean accuracy CA as Vanilla, which is significantly higher than other methods. For robustness, it also outperforms these baselines, especially Vanilla. Vanilla can only obtain clean accuracy, but not Results of BEST under different adversary model architectures. Victim model is ResNet-AT.

Results of BEST under different victim model architectures and adversarial training approaches. The adversary model is ResNet.

Attack results with and without data augmentation. The victim model is ResNet-AT trained on CIFAR10. The adversary model is ResNet. The adversary's dataset is from CIFAR10.

The sampling-based active learning methods require millions or billions of data, and their threat model assumes that the adversary can have lots of data and use them to query the victim model. However, this threat model gives the adversary too much power, and the sampling process needs lots of local computation, which decreases the willingness to steal the victim model. That is the reason that the robust performance of JBDA is better than CloudLeak. Compared with our BEST, conventional model extraction attacks cannot generate classification boundary-sensitive queries. In our experiment, we show previous works cannot steal the victim model's robustness and analyze the gap between robust model extraction and naive model extraction. The core insight behind these methods is using the student model to generate adversarial examples and using the prediction of clean images from the teacher model as the label of these adversarial examples. Specifically, ARD adopts PGD attack (Madry et al., 2018) to generate adversarial examples for the student model, and uses Kullback-Leibler divergence to minimize the differences between the predictions of the student model and the predictions of the teacher model for the clean data and adversarial examples. IAD further proposes an adaptive distillation process to overcome the challenge that the teacher model will return unreliable answers, with longer training time. IAD adopts PGD attack to generate adversarial examples for the student model, and uses teacher guidance loss combined with a student introspection loss to train the student model. RSLAD replaces the PGD generated adversarial examples with TRADES (Zhang et al., 2019) generated adversarial examples in ARD to train the student model. There are two main differences between knowledge distillation and model extraction. First, in knowledge distillation, the user has full knowledge of the teacher model's training set and details of the model. So the user can adopt the same training set to train a student model. Second, in knowledge distillation, the user can obtain the logits vectors from the teacher model, So the user can adopt a loss function such as Kullback-Leibler divergence or mean squared error loss to align the logits vectors of the student model and the teacher model. In contrast, in the model extraction scenario, the user normally does not have the same training set, and does not get the whole logits vectors for the query samples. Hence, the adversary has to use a different set of query samples and mainly adopts the cross-entropy loss to train the local model. So, blindly using robust knowledge distillation as a model extraction attack to steal the victim model's robustness can cause reliability problems. In our experiments, we perform comprehensive experiments to verify that under the model extraction threat model, robust knowledge distillation cannot guarantee advanced results.

terms of the query budgets, for all the methods in our experiments, we use all data in the extraction set in one training epoch. So, the number of training epochs is proportional to the query budget. For instance, 1 training epoch means the query budget of 5,000, and 10 training epoch means the query budget of 50,000. We use 200 training epochs for all methods as an upper bound and compare the best results during the model extraction process and the last results each method can obtain. Furthermore, we discuss how to reduce our query budget in Appendix C.2. The query budget will not cause more financial cost in practice, as many MLaaS providers offer certain free queries for users.B.2 DETAILS OF QUERY SAMPLESIn our experiments, our attack augments the clean data during the model extraction attack. For our BEST, the query data are the clean data adding uncertain perturbation, converting into UEs. As shown in Algorithm 1 in Section 4.2, the perturbation is generated by solving a minimization problem.

The detailed information of victim models.



Model extraction attack results on CIFAR10.

Model extraction attack results under different Query Budgets.

To better quantify the B S in the model extraction attack, we measure the time cost for the UE generation process. In our experiments, the adversary only needs to generate 5,000 UEs in one epoch. Specifically, when B S = 10, for ResNet-18, it costs about 16s on V100 to generate 5,000 UEs. For WRN-28-10, it costs about 80s on V100 to generate 5,000 UEs. The reason is that in our method, we adopt a similar training pipeline as in the adversarial training. First, the UE generation process is similar to the AE generation process. We only modify the loss function in the original AE generation process. So, the time cost for AE generation and UE generation is the same. Second, the model extraction requires the adversary to query the victim model, which will not cost too much time if we do not consider the network latency. Third, the model training process is the same as the adversarial training. Overall, model extraction attacks are more efficient when restoring a huge deep learning model.

Model extraction attack results when the pre-trained model is adopted or not.

Results of different query distributions. The victim model is ResNet-AT trained on CIFAR10. The adversary model is ResNet.architecture and CIFAR10 dataset). The adversary uses ResNet for model extraction. Black solid and dashed lines denote the clean and robust accuracy of the victim model. We observe that the extracted model can inherit clean accuracy as well as non-robustness (against PGD20) from the victim model. Therefore, we can draw two conclusions: (1) our BEST is general for both robust and non-robust models. (2) For robustness extraction, the high robustness of the extracted model is indeed learned from the victim, rather than from the synthesized uncertain examples.

Model extraction attack results under different extraction datasets.

Comparisons between BEST and Extraction-AT. The victim model is ResNet-AT. The adversary's model is ResNet.

The reason is that BEST adopts UEs to reshape the local model's boundaries to be similar to the victim's boundaries, obtaining higher clean accuracy. Second, BEST can obtain high robustness under limited clean data when restoring a robust victim model. Because UEs can help the local model obtain a similar classification boundary as the victim model's boundary, models restored with BEST can exhibit similar behaviors on clean data and adversarial examples, which is challenging for other baseline methods. Third, BEST can relieve the annoying robust overfitting problem. The robust overfitting is very common and severe in ARD, IAD and RSLAD. However, our method does not rely on the adversarial examples. The results indicate that our proposed UEs can successfully address the robust overfitting challenges. Overall, BEST is better than previous baselines and achieves higher clean accuracy and robust accuracy under limited clean data.

Results of model extraction attacks on CIFAR10. The victim model is WRN-AT. The adversary model is ResNet.

Results of model extraction attacks on CIFAR10. The victim model is ResNet-TRADES. The adversary model is ResNet.

To verify this point, we plot the adversarial examples' transferability in Figure 5. The results show that the accuracy of the adversary model under the victim model's adversarial examples is close to the accuracy of the victim model's accuracy. Furthermore, the adversary models having different structures can obtain similar accuracy under attacks from other structure adversarial model's adversarial examples. These two points indicate that our BEST can make M A achieve transferability stabilization. Results of model extraction attacks on CIFAR10. The victim model is WRN-TRADES. The adversary model is ResNet.

Results of model extraction attacks on CIFAR10. The victim model is ResNet-AT. The adversary model is WRN.

Results of model extraction attacks on CIFAR10. The victim model is WRN-AT. The adversary model is WRN.

Results of model extraction attacks on CIFAR10. The victim model is ResNet-TRADES. The adversary model is WRN.

Results of model extraction attacks on CIFAR10. The victim model is WRN-TRADES. The adversary model is WRN.

Results of model extraction attacks on CIFAR10. The victim model is ResNet-AT. The adversary model is VGG.

Results of model extraction attacks on CIFAR10. The victim model is WRN-AT. The adversary model is VGG.

Results of model extraction attacks on CIFAR10. The victim model is ResNet-TRADES. The adversary model is VGG.

Results of model extraction attacks on CIFAR10. The victim model is WRN-TRADES. The adversary model is VGG.

Results of model extraction attacks on CIFAR10. The victim model is ResNet-AT. The adversary model is MobileNet.

Results of model extraction attacks on CIFAR10. The victim model is WRN-AT. The adversary model is MobileNet.

Results of model extraction attacks on CIFAR10. The victim model is ResNet-TRADES. The adversary model is MobileNet.

Results of model extraction attacks on CIFAR10. The victim model is WRN-TRADES. The adversary model is MobileNet.

Results of model extraction attacks on CIFAR100. The victim model is ResNet-AT. The adversary model is ResNet.

Results of model extraction attacks on CIFAR100. The victim model is WRN-AT. The adversary model is ResNet.

Results of model extraction attacks on CIFAR100. The victim model is ResNet-TRADES. The adversary model is ResNet.

Results of model extraction attacks on CIFAR100. The victim model is WRN-TRADES. The adversary model is ResNet.

Results of model extraction attacks on CIFAR100. The victim model is ResNet-AT. The adversary model is WRN.

Results of model extraction attacks on CIFAR100. The victim model is WRN-AT. The adversary model is WRN.

Results of model extraction attacks on CIFAR100. The victim model is ResNet-TRADES. The adversary model is WRN.

Results of model extraction attacks on CIFAR100. The victim model is WRN-TRADES. The adversary model is WRN.

Results of model extraction attacks on CIFAR100. The victim model is ResNet-AT. The adversary model is VGG.

Results of model extraction attacks on CIFAR100. The victim model is WRN-AT. The adversary model is VGG.

Results of model extraction attacks on CIFAR100. The victim model is ResNet-TRADES. The adversary model is VGG.

Results of model extraction attacks on CIFAR100. The victim model is WRN-TRADES. The adversary model is VGG.

Results of model extraction attacks on CIFAR100. The victim model is ResNet-AT. The adversary model is MobileNet.

Results of model extraction attacks on CIFAR100. The victim model is WRN-AT. The adversary model is MobileNet.

Results of model extraction attacks on CIFAR100. The victim model is ResNet-TRADES. The adversary model is MobileNet.

Results of model extraction attacks on CIFAR100. The victim model is WRN-TRADES. The adversary model is MobileNet.

8. ACKNOWLEDGEMENT

This work is supported under the RIE2020 Industry Alignment Fund-Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s). It is also supported in part by Singapore Ministry of Education (MOE) AcRF Tier 2 MOE-T2EP20121-0006 and AcRF Tier 1 RS02/19. Furthermore, we appreciate the help from Dr. Adam Dziedzic and anonymous reviewers during the rebuttal periods.

