EXTRACTING ROBUST MODELS WITH UNCERTAIN EX-AMPLES

Abstract

Model extraction attacks are proven to be a severe privacy threat to Machine Learning as a Service (MLaaS). A variety of techniques have been designed to steal a remote machine learning model with high accuracy and fidelity. However, how to extract a robust model with similar resilience against adversarial attacks is never investigated. This paper presents the first study toward this goal. We first analyze that those existing extraction solutions either fail to maintain the model accuracy or model robustness, or lead to the robust overfitting issue. Then we propose Boundary Entropy Searching Thief (BEST), a novel model extraction attack to achieve both accuracy and robustness extraction under restricted attack budgets. BEST generates a new kind of uncertain examples for querying and reconstructing the victim model. These samples have uniform confidence scores across different classes, which can perfectly balance the trade-off between model accuracy and robustness. Extensive experiments demonstrate that BEST outperforms existing attack methods over different datasets and model architectures under limited data. It can also effectively invalidate state-of-the-art extraction defenses. Our codes can be found in https://github.com/GuanlinLee/BEST.

1. INTRODUCTION

Recent advances in deep learning (DL) and cloud computing technologies boost the popularity of Machine Learning as a Service (MLaaS), e.g., AWS SageMaker (sag, 2022), Azure Machine Learning (azu, 2022) . This service can significantly simplify the DL application development and deployment at a lower cost. Unfortunately, it also brings new privacy threats: an adversarial user can query a target model and then reconstruct it based on the responses (Tramèr et al., 2016; Orekondy et al., 2019; Jagielski et al., 2020b; Yuan et al., 2020; Yu et al., 2020) . Such model extraction attacks can severely compromise the intellectual property of the model owner (Jia et al., 2021) , and facilitate other black-box attacks, e.g., data poisoning (Demontis et al., 2019 ), adversarial examples (Ilyas et al., 2018 ), membership inference (Shokri et al., 2017) . Existing model extraction attacks can be classified into two categories (Jagielski et al., 2020b) . ( 1) Accuracy extraction aims to reconstruct a model with similar or superior accuracy compared with the target model. ( 2) Fidelity extraction aims to recover a model with similar prediction behaviors as the target one. In this paper, we propose and consider a new category of attacks: robustness extraction. As DNNs are well known to be vulnerable to adversarial attacks (Szegedy et al., 2014) , it is common to train highly robust models for practical deployment, especially in critical scenarios such as autonomous driving (Shen et al., 2021 ), medical diagnosis (Rendle et al., 2016) and anomaly detection (Goodge et al., 2020) . Then an interesting question is: given a remote robust model, how can the adversary extract this model with similar robustness as well as accuracy under limited attack budgets? We believe this question is important for two reasons. (1) With the increased understanding of adversarial attacks, it becomes a trend to deploy robust machine learning applications in the cloud (Goodman & Xin, 2020; Rendle et al., 2016; Shafique et al., 2020) , giving the adversary opportunities to steal the model. ( 2) Training a robust model usually requires much more computation resources and data (Schmidt et al., 2018; Zhao et al., 2020) , giving the adversary incentives to steal the model. We review existing attack techniques and find that they are incapable of achieving this goal, unfortunately. Particularly, there can be two kinds of attack solutions. (1) The adversary adopts clean samples to query and extract the victim model (Tramèr et al., 2016; Orekondy et al., 2019; Pal et al., 2020) . However, past works have proved that it is impossible to obtain a robust model only from clean

