DUAL STUDENT NETWORKS FOR DATA-FREE MODEL STEALING

Abstract

Data-free model stealing aims to replicate a target model without direct access to either the training data or the target model. To accomplish this, existing methods use a generator to produce samples in order to train a student model to match the target model outputs. To this end, the two main challenges are estimating gradients of the target model without access to its parameters, and generating a diverse set of training samples that thoroughly explores the input space. We propose a Dual Student method where two students are symmetrically trained in order to provide the generator a criterion to generate samples that the two students disagree on. On one hand, disagreement on a sample implies at least one student has classified the sample incorrectly when compared to the target model. This incentive towards disagreement implicitly encourages the generator to explore more diverse regions of the input space. On the other hand, our method utilizes gradients of student models to indirectly estimate gradients of the target model. We show that this novel training objective for the generator network is equivalent to optimizing a lower bound on the generator's loss if we had access to the target model gradients. In other words, our method alters the standard data-free model stealing paradigm by substituting the target model with a separate student model, thereby creating a lower bound which can be directly optimized without additional target model queries or separate synthetic datasets. We show that our new optimization framework provides more accurate gradient estimation of the target model and better accuracies on benchmark classification datasets. Additionally, our approach balances improved query efficiency with training computation cost. Finally, we demonstrate that our method serves as a better proxy model for transfer-based adversarial attacks than existing data-free model stealing methods.

1. INTRODUCTION

Model stealing has been shown to be a serious vulnerability in current machine learning models. Machine learning models are increasingly being deployed in products where the model's output is accessible through APIs, also known as Machine Learning as a Service. Companies put a large amount of effort into training these models through the collection and annotation of large amounts of data. However, recent work has shown that the ability to query a model and get its outputwithout access to the target model's weights -enables adversaries to utilize different model stealing approaches, where the attacker can train a student model to have similar functionality to the target model (Kesarwani et al., 2018; Yu et al., 2020; Yuan et al., 2022; Truong et al., 2021; Sanyal et al., 2022) . Two major motivations for stealing a private model are utilizing the stolen model for downstream adversarial attacks as well as monetary gains, therefore, model stealing methods present an increasing problem (Tramèr et al., 2016; Zhang et al., 2022) . Data-free model stealing is a generalization of black-box model stealing which is an extension of knowledge distillation. In all three areas, the aim is to obtain a student model which imitates the target model. However, while knowledge distillation environments typically retain full knowledge of the target model training data and weights, black-box model stealing eliminates the need to have access to the target model weights and training data. Going one step further, black-box model stealing typically uses real-world data samples to train the student network, whereas data-free model stealing removes that requirement by leveraging a generator. As the relationship between the substitute dataset and true dataset is typically unknown, the removal of the substitute data results in more generalizable approaches. Existing methods for data-free model stealing involve either using a min-max adversarial approach (Truong et al., 2021; Kariyappa et al., 2021; Zhang et al., 2022) , or a generator-discriminator setup incorporating Generative Adversarial Networks (GANs) (Sanyal et al., 2022) . DFME (Truong et al., 2021) and DFMS-HL (Sanyal et al., 2022) are the two most recent and highest performing methods for each of these approaches, respectively. In both approaches, a student is optimized to minimize the distance between the student and target model outputs. However, in the former, a generator is optimized to maximize the distance between student and target model outputs, while in the latter, the generator-discriminator is optimized to generate samples similar to a synthesized dataset which have balanced student classifications. Even though utilizing a synthesized dataset relaxes the truly data-free requirement, the current state of the art (SOTA) is provided by the DFMS-HL method which does utilize synthetic data, while DFME method provided the previous SOTA and doesn't require synthetic data. Our proposed Dual Student method alters the generator-student min-max framework by training two student models to match the target model instead of one. This allows the generator to use one of the students as a proxy for the target model, resulting in a much more effective training criterion. The two students are symmetrically trained to match the target model, while employing the new generator objective to generate samples on which the two students disagree. Disagreement between the two students implies disagreement between at least one of the students and the target model, resulting in the generated sample being a hard sample (c.f. hard sample mining (Shrivastava et al., 2016) ) to at least one student. This hard sample provides the driving force to explore controversial regions in the input space on which the students and teacher disagree on. Finding samples on which the student and teacher disagree is also an implicit goal of DFME. However, while DFME does optimize the generator to find disagreement between the student and target, it uses gradient estimations of the target. Our method, on the other hand, uses the true gradients of the second student which can be obtained using direct backpropagation. Using two student models as a proxy for the target model results in better target model gradient estimates than data-free model stealing methods like DFME, which use explicit gradient estimates. Formalizing our setup, substituting a second student model for the target model creates an alternative lower bound for the student-teacher optimization. Changing the training objective of the generatorstudent framework to more directly optimize to match the target model results in student model gradients being better aligned with the target model gradients. In our experiments, we show the gradients used when training dual students is closer to using the true gradients of the target model than the estimation method used by DFME (Truong et al., 2021) . This important change eliminates the need of some existing approaches (Truong et al., 2021; Kariyappa et al., 2021) to estimate the gradient through the target model. Gradient estimation techniques used in these existing works assume the target model provides soft-label outputs -removing the gradient estimation removes this dependence on soft-labels and allows existing methods to be easily extended to the more difficult hard-label setting. Removing the gradient estimation also reduces the number of queries made to the target model. Finally, we also explore a more difficult data-free setting where the number of classes is not know in advance. To the best of our knowledge, we are the first to address this setting and provide a solution to this new challenging data-free setup. In summary, our contributions in this paper are as follows: (1) We propose the Dual Student method which provides a tractable approach for the data-free model extraction problem with minimal added components. This includes a mathematical reframing of the data-free model stealing objective along with an empirical comparison with the previous formulation provided by DFME (Truong et al., 2021) . ( 2) The Dual Student method can be incorporated into existing soft-label approaches to extend them to hard-label settings, and has better classification accuracy than existing methods. (3) We show the effectiveness of utilizing the Dual Student setup to fine-tune fully trained models by

