IDEAL: QUERY-EFFICIENT DATA-FREE LEARNING FROM BLACK-BOX MODELS

Abstract

Knowledge Distillation (KD) is a typical method for training a lightweight student model with the help of a well-trained teacher model. However, most KD methods require access to either the teacher's training data or model parameter, which is unrealistic. To tackle this problem, recent works study KD under data-free and black-box settings. Nevertheless, these works require a large number of queries to the teacher model, which incurs significant monetary and computational costs. To address these problems, we propose a novel method called query-effIcient Datafree lEarning from blAck-box modeLs (IDEAL), which aims to query-efficiently learn from black-box model APIs to train a good student without any real data. In detail, IDEAL trains the student model in two stages: data generation and model distillation. Note that IDEAL does not require any query in the data generation stage and queries the teacher only once for each sample in the distillation stage. Extensive experiments on various real-world datasets show the effectiveness of the proposed IDEAL. For instance, IDEAL can improve the performance of the best baseline method DFME by 5.83% on CIFAR10 dataset with only 0.02× the query budget of DFME.

1. INTRODUCTION

Knowledge Distillation (KD) has emerged as a popular paradigm for model compression and knowledge transfer Gou et al. (2021) . The goal of KD is to train a lightweight student model with the help of a well-trained teacher model. Then, the lightweight student model can be easily deployed to resource-limited edge devices such as mobile phones. In recent years, KD has attracted significant attention from various research communities, e.g., computer vision Wang (2021) ; Passalis et al. (2020) ; Hou et al. (2020) ; Li et al. (2020) , natural language processing Hinton et al. (2015) ; Mun et al. (2018) ; Nakashole & Flauger (2017) ; Zhou et al. (2020b) , and recommendation systems Kang et al. (2020) ; Wang et al. (2021a) ; Kweon et al. (2021) ; Shen et al. (2021) . However, most KD methods are based on several unrealistic assumptions: (1) users can directly access teacher's training data; (2) the teacher model is considered as a white-box model, i.e., model parameters and structure information can be fully utilized. For example, to facilitate the training process, FitNets Romero et al. (2015) uses not only the original training data, but also the output information from the teacher's intermediate layers. However, in real-world applications, the teacher model is usually provided by a third party. Thus the teacher's training data is usually not public and unable to access. In fact, the teacher model is mostly trained by big companies with extensive amounts of data and plenty of computation resources, which is the core competitiveness of companies. As a result, the specific parameters and structural information of the teacher model are never exposed in the real world. Consequently, accessing the teacher model or teacher's training data render these KD methods impractical in reality. To solve these problems, some recent studies Truong et al. (2021b) ; Fang et al. (2021a) attempt to learn from a black-box teacher model without any real data, i.e., data-free black-box KD. These Table 1 : An empirical study of previous methods with a limited number of queries (we set the query budget Q = 25K for MNIST, Q = 250K for CIFAR10, and Q = 2M for CIFAR100.) in various scenarios. We also adopt CMI Fang et al. (2021b) for hard-label scenarios and name it "CMI * ". (3) the teacher model only returns a category index for each sample, i.e., hard-label; and (4) the number of queries is limited, i.e., query-efficient. To better understand the difficulty of this setting, we report the top-1 test accuracy of student models under different scenarios with a limited query budgetfoot_1 in Table 1 . As shown in Table 1 , we have some valuable observations: (1) In white-box scenarios, data-free KD can achieve satisfied performance, but when the model API is restricted to only hard labels, CMI Fang et al. (2021b) suffers from serious performance degradation. It indicates that logits can provide more information for training, while hard labels are more difficult; (2) With the same number of queries, the performance of these methods dramatically decrease under the black-box scenarios. Furthermore, the performance of data-free black-box KD with hard labels is only 14.28% on CIFAR10 dataset, which is close to random guess (10%). Consequently, in this paper, we focus primarily on how to query-efficiently train a good student model from black-box models with hard labels, which is very practical but challenging. For this purpose, we propose a novel method called query-effIcient Data-free lEarning from blAckbox modeLs (IDEAL), which trains the student model with two stages: a data generation stage and a model distillation stage. Instead of utilizing the teacher model (as in previous methods Truong et al. (2021b) ), we propose to adopt the student model to train the generator in the first stage, which can solve the hard-label issue and largely reduce the number of queries to the teacher model. In the second stage, we train a student model that has similar predictions as the teacher model on the synthetic samples. As a result, IDEAL requires a much less query budget than previous methods, which saves a lot of money and becomes more practical in reality. In summary, our main contributions include: • New Problem: We focus on how to query-efficiently train a good student model from blackbox models with only hard labels. To the best of our knowledge, our setting is the most practical and challenging to date. • More Efficient: We propose a novel method called IDEAL, which does not require any query in the data generation stage and queries the teacher only once for each sample in the distillation stage. Thus IDEAL can train a high-performance student with a small number of queries. By contrast, our study considers a much more challenging scenario, in which a black-box teacher only returns the top-1 class. Moreover, in real-world scenarios, these black-box models usually charge for each query. To achieve good performance, these methods require millions of queries, which consume a lot of computing resources and money in real-world scenarios.

2.3. COMPARISON WITH RELATED WORKS

The most related work is ZSDB3 Wang (2021) , which also studied data-free black-box distillation with hard labels. It proposes to generate pseudo samples distinguished by the teacher's decision boundaries and then reconstruct the soft labels for distillation. More specifically, it calculates the minimal ℓ 2 -norm distance between the current sample and those of other classes (measured by the teacher model) and uses the zeroth-order optimization method to estimate the gradient of the teacher model, which requires a large number of queries, making ZSBD3 not practical in real-world scenarios. By contrast, we consider a more challenging and practical setting where only a very small number of queries (to the teacher model) is allowed, i.e., query efficient. For example, ZSDB3 requires about 1000 queries to reconstruct the soft label (logits) of a single sample on MNIST dataset, while our method only requires one query, which hugely reduces the number of queries by 1000×. the synthetic sample (generated by G) and the corresponding prediction score. Subscript i denotes the i-th sample, e.g., xi denotes the i-th synthetic sample. Superscript j denotes the j-th epoch, e.g., D j denotes the set of synthetic samples generated in j-th epoch. C is the number of classes and B is the batch size. We use (ŷ) k to denote the k-th element of outputs ŷ, i.e., the prediction score of the k-th class.

3.2. OVERVIEW

In data-free black-box KD, the teacher's model and data are not accessible, and we are only given the prediction of a sample by the teacher model. In particular, we focus on a more practical and challenging setting where only a category index for each sample (i.e., hard-label) is given by the teacher model. Since each query to the teacher costs money, we consider the scenario with limited queries, i.e., query-efficient. Our goal is to query-efficiently learn from black-box models to train a good student without any real data. To achieve this goal, we propose a novel method called IDEAL, which consists of two stages: a data generation stage and a model distillation stage. In the first stage, instead of utilizing the teacher model (as in previous methods Truong et al. (2021b) ; Kariyappa et al. (2021) ), we propose to adopt the student model to train the generator, which can solve the hard-label issue and largely reduce the number of queries to the teacher model. In the second stage, we utilize the teacher model and synthetic samples to train the student model. The generator and student model are iteratively trained for E epochs. The training procedure is demonstrated in the Appendix (see Algorithm 1) and the illustration of the training process of IDEAL is shown in Fig. 1 .

3.3. DATA GENERATION

In data-free setting, we are unable to access the original training data for training the student model. Therefore, in the first stage, we aim to train a generator to generate the desired synthetic data (to train the student model). According to the finding in Zhang et al. (2022) , we reinitialize the generator at each epoch. The data generation procedure is illustrated in Fig. 1(a) . The first step is generating the synthetic sample. Given a random noise z (sampled from a standard Gaussian distribution) and a corresponding random one-hot label y (sampled from a uniform distribution), the generator G aims to generate a desired synthetic sample x corresponding to label y. Specifically, we feed z into the generator G and compute the synthetic sample as follows: x = G(z; θ G ), where θ G is the parameter of G. The synthetic samples are used to train G. In the second step, we compute the prediction score of  ŷ = S(x; θ S ), where S and θ S are student model and model parameters. The third step is optimizing the generator. We propose to train a generator that considers both confidence and balancing.

3.3.1. CONFIDENCE

First, we need to consider confidence, i.e., the synthetic sample is classified to the specified class with high confidence. To achieve this goal, we minimize the difference between the prediction score ŷ and the specified label y: L ce = CE(ŷ, y), where CE(•, •) is the cross-entropy (CE) loss. Actually, in the training process, the generator G can quickly converge when using L ce . Since the generated data x is fitted for the student S, and we intend to generate data according to the knowledge of the teacher's model, we must avoid overfitting to S. Therefore, we need to control the number of iterations E G in data generation. Too few iterations may lead to poor data, while too many iterations may lead to overfitting. See the detailed experiments in the Section 4.1.

3.3.2. BALANCING

Second, we need to consider balancing, i.e., the number of synthetic samples in each class should be balanced. Although we uniformly sample the specified label y, we observe that the prediction score ŷ is not balanced, i.e., the prediction score is high on some classes but low on the other classes. This leads to class imbalance of the generated synthetic samples. Motivated by Chen et al. (2019) , we employ the information entropy loss to measure the class balance of the synthetic samples. In particular, given a batch of synthetic samples {x i } B i=1 and corresponding prediction scores {ŷ i } B i=1 , where B is the batch size, we first compute the average of the prediction scores as follows: ŷavg = 1 B B i=1 ŷi . Then, we compute the information entropy loss as follows: L inf o = 1 C C k=1 (ŷ avg ) k log((ŷ avg ) k ), where (ŷ avg ) k is the k-th element of ŷavg , i.e., the average prediction score of the k-th class. When L inf o takes the minimum, each element in ŷavg would equal to 1 C , which implies that G can generate synthetic samples of each class with an equal probability. By combining the above losses, we can obtain the generator loss as follows: L gen = L ce + λL inf o , ( ) where λ is the scaling factor. By minimizing L gen , we train a generator that generates desired balanced synthetic samples.

3.4. MODEL DISTILLATION

In the second stage, we train the student model S with teacher model T and the synthetic samples. The training process is illustrated in Fig. 1 (b). Our goal is to obtain a student model S that has the same predictions as teacher model T on the synthetic samples (generated by generator G). In particular, we first sample the random noise and generate synthetic sample x with the generator. Second, we feed x into the black-box teacher model and obtain its label as follows: y T = T (x) We treat y T as the ground-truth label of x. Since the teacher model only returns hard-label, y T is a ground-truth one-hot label. Afterwards, we feed x into the student model and obtain the prediction score as follows: ŷ = S(x; θ S ). Last, we optimize the student model by minimizing the CE loss as follows: L md = CE(ŷ, y T ) By minimizing L md , the student model can have similar predictions as the teacher model on the synthetic samples, which leads to a desired student model. In 2020a)). In fact, these techniques are essential data-free distillation methods in black-box scenarios. 3) Furthermore, we compare our method with ZSDB3 Wang ( 2021), which also focuses on improving the performance of the black-box data-free distillation in label-only scenarios. Since we consider the limited query budget scenario, we adopt the same query budget Q for all methods. In particular, we set the query budget Q = 25K for MNIST, Q = 100K for FMNIST and SVHN. Besides, the default query budget Q = 250K for CIFAR10 and ImageNet subset. For large datasets with a large number of classes (i.e., CIFAR100 and Tiny-ImageNet), we set the query budget Q = 2M . For our method, each sample only needs to query the teacher model once, so the total number of queries is Q = B × E, where B is the batch size and E denotes the training epochs. To update the generator, we use the Adam Optimizer with learning rate η G = 1e -3. To train the student model, we use the SGD optimizer with momentum=0.9 and learning rate η S = 1e -2. We set the batch size B = 250 for MNIST, FMNIST, SVHN, CIFAR10, and ImageNet subset, and B = 1000 for CIFAR100 and Tiny-ImageNet datasets. By default, we set the number of iterations in data generation E G = 5 and the scaling factor λ = 5. The number of epochs E is computed according to the query budget. For evaluation, We run experiments for 3 times, and report the average top-1 test accuracy.

4.2.1. PERFORMANCE COMPARISON ON SMALL DATASET

First, we show the results of different KD methods on MNIST, FMNIST, SVHN, CIFAR10, and ImageNet subset using various teacher models in Table 2 . From the table, we observe that: (1) Our proposed IDEAL outperforms all the baseline methods on all datasets. For instance, our method achieves 87.65% accuracy on SVHN dataset when the teacher model is ResNet-18, whereas the best baseline method DFME achieves only 64.82% accuracy under the same query budget. In general, IDEAL improves the performance of the best baseline by at least 20% under the same settings. (2) The black-box teacher models trained on MNIST are much easier for the student to learn. Even with very few queries, the student model of our proposed IDEAL achieves over 96% accuracy on MNIST. We argue that this is reasonable because this task is simple for neural networks to solve, and the underlying representations are easy to learn. However, even for such a simple task, other methods cannot derive a good student model with the same small query budget. For example, when learning from the black-box AlexNet trained on MNIST, the best baseline DFME only achieves 66.45% accuracy. (3) DAFL and ZSKT have the worst performance on all datasets. For example, the accuracy of ZSKT is only 12.56% when the teacher model is AlexNet on CIFAR10, which is close to random guess (10%). We conjecture this is because white-box KD methods are not suitable in black-box scenarios. These methods mainly depend on white-box information, such as model structure and probability or logits returned by the teacher model. Therefore, using these methods in black-box scenarios will significantly reduce their effectiveness. Following the settings in DaST Zhou et al. (2020a) , we also conduct KD in a real-world scenario. In particular, we adopt the API provided by Microsoft Azurefoot_2 (trained on MNIST dataset) as the teacher model and utilize LeNet Lecun et al. (1998) as the student model. As illustrated in Fig. 2 , our method converges quickly and is very stable compared to other methods. Actually, our method achieves over 98% test accuracy after 10,000 queries, which implies that our proposed method is also effective and efficient for real-world APIs. In previous experiments, we consider training the student model with a limited query budget.

4.2.4. PERFORMANCE UNDER DIFFERENT QUERY BUDGET

4XHU\×. 7HVW$FFXUDF\ 4XHU\EXGJHWQ = . =6'% ')0( 2XUV 4XHU\×. 4XHU\EXGJHWQ = 0 =6'% ')0( As described in previous studies Zhou et al. (2020a) ; Truong et al. (2021a) ; Wang (2021) , these methods require millions of queries to the black-box model. Therefore, we have increased the number of queries of other baseline methods to provide a more comprehensive comparison, but without increasing the number of queries in our method. More specifically, we increase the number of queries required by other baseline methods (ZSDB3 and DFME) on CIFAR10 dataset from 200K to 10M . Fig. 3 illustrates the training curves of these methods with Q = 200K and Q = 10M , respectively. Note that ZSDB3, DFME can achieve the highest accuracy of 56.39% and 57.94% respectively (right panel in Fig. 3 ), when a large number of queries are involved. By contrast, our approach achieves 63.77% with only 0.02× the query budget of both ZSDB3 and DFME. It validates the effectiveness of our method to perform query-efficient KD. Figure 4 : Visualization of data generated by different methods on MNIST. Our approach can synthesize more diverse data, there is a clear visual distinction between samples in different classes. In this subsection, we present some synthesised examples of ZSDB3, DFME, and our method to evaluate the visual diversity. As can be seen in Fig. 4 , images generated by ZSDB3 are all of very low quality, which cannot show any meaningful patterns. And the image samples generated by ZSDB3 and DFME both exhibit very similar patterns, which implies that the synthetic data has low sample diversity. By contrast, our proposed approach can synthesize more meaningful and diverse data. We observe that the images generated by our method have more different patterns, which indicates that our proposed IDEAL can synthesize more diverse data. It also proves that it is feasible and effective for our model to replace T with S in generator training without gradient estimation.

4.2.6. EFFECT INVESTIGATION OF DIFFERENT MODULES

In this section, we evaluate the contributions of different loss functions in Equation 6 used during data generation, and discuss the effect of re-initializing the generator. As shown in Table 4 , removing both the generator and information loss L inf or can lead to significant performance degradation. Moreover, our model suffers from an obvious degradation when the generator re-initializing strategy is abandoned, especially on SVHN, CIFAR10, and ImageNet-subset. In fact, since the generator is reinitialized in each epoch during training, our method does not depend on the generator from the previous round. In other words, we do not need to train the generator and the student model adversarially, and therefore we do not require a large number of training iterations to guarantee convergence. Besides, we find a significant degradation when we remove L ce , which demonstrates its effectiveness in the data generation. The experiments verify that all modules are essential in our method.

4.2.7. EFFECT INVESTIGATION OF E G

We also conduct ablation study to investigate the effect of E G on the data generation stage. As show in Table 5 in Appendix, we modify the value of E G and report the top-1 test accuracy. We can observe that too small or too large E G is hard to obtain the optimal solution. To better understand the impact of E G , we show the t-SNE visualization of synthetic data in Fig. 5 in Appendix. More detailed results can be referred to the Appendix A.0.1.

5. CONCLUSION

In this paper, we propose query-effIcient Data-free lEarning from blAck-box modeLs (IDEAL) in order to query-efficiently train a good student model from black-box teacher models under the datafree and hard-label setting. To the best of our knowledge, our setting is the most practical and challenging to date. Extensive experiments on various real-world datasets show the effectiveness of our proposed IDEAL. For instance, IDEAL can improve the performance of the best baseline method DFME by 5.83% on CIFAR10 dataset with only 0.02× the query budget of DFME. We envision this work as a milestone for query-efficient and data-free learning from black-box models.

6. ACKNOWLEDGEMENT

This work is funded by Sony AI.

A APPENDIX

A.0.1 EFFECT INVESTIGATION OF E G . In Fig. 5 , clearly, the student can easily identify the synthetic data when E G = 50 (the training accuracy on synthetic data is 100%, while the test accuracy on CIFAR10 is 58.69%), but when E G = 10, the student cannot distinguish the data accurately (the training accuracy is 63.78%, while the test accuracy is 68.82%). We guess that, a small value of E G leads to poor quality of the generated data (a large loss) while a large value of E G leads to a student model that overfits to the synthetic data. Thus, from the empirical experiments For fair comparisons, we use the same generator StyleGan for all methods in our experiments. We also introduce the effects of different sizes of generators as shown in Table 6 , where DCGAN, Style-GAN and Transformer-GAN have small, medium and large parameters. Different generative models have negligible effect on the performance of our method. Besides, our method still outperforms the best baseline when using generators with different sizes.

A.0.3 CLASS IMBALANCE IN SYNTHETIC DATA

To avoid class imbalance, we generate the same number of samples per class. As illustrated in Table 7 , even if we use some SOTA re-weighting methods to assign different weights to our model, the accuracy drop caused by the class imbalance can not be entirely eliminated. Hence, it is effective to consider class-balanced generation for each class, i.e., the number of synthetic samples per class is balanced. Generator 



https://cloud.google.com/bigquery The detailed settings can be found in Section 4.1. https://azure.microsoft.com/en-us/services/machine-learning/



Figure 1: Illustration of the training process of our proposed IDEAL. The left panel demonstrates the data generation stage. In this stage, we train a generator, that can generate desired synthetic samples, with the student model. The right panel shows the model distillation stage, which trains a student model that has similar predictions as the teacher model on the synthetic samples.

Figure 2: Transferring the knowledge of the online model on Microsoft Azure to the student.

Figure 3: Analyses of our method and other comparison methods (ZSDB3, DFME) with a small query budget (Q = 200K) and a large query budget (Q = 10M ).

Figure 5: T-SNE visualization of synthetic data on CIFAR10 and the corresponding training loss of the generator. When E G = 10, the features are not well separated, indicating that the student can still learn from synthetic data.

need to access the private data and can train the student model with the class probabilities returned by the teacher model. However, in real-world scenarios, the pre-trained model on the remote server may only provide APIs for inference purpose (e.g., commercial cloud services), these APIs usually return the top-1 class (i.e., hard label) of the given queries. For example, Google BigQuery 1 provides APIs for several applications. Such APIs only return a category index for each sample instead of the class probabilities. Moreover, these APIs usually charge for each query to the teacher model, and thus budget should be considered in the process of query. Nevertheless, previous

SOTA Results: Extensive experiments on various real-world datasets demonstrate the efficacy of our proposed IDEAL. For instance, IDEAL can improve the performance of the best baseline method (DFME) by 33.46% on MNIST dataset.

model is not accessible in black-box setting, thus unable to conduct backpropagation. Previous black-box KD methodsTruong et al. (2021b);Wang (2021);Kariyappa et al. (2021) used gradient estimation methods to obtain an approximate gradient. Nevertheless, they need to estimate the gradient from the black-box teacher model, which requires a large number of queries (to the teacher model), which is not practical. Moreover, in the hard-label setting, the prediction score is not accessible. To this end, we propose to use the student model (instead of the teacher model) to compute the prediction score of x. The detail of the student model is discussed in Section 3.4. Note that in this stage, we do not train the student model and keep the parameter of the student model fixed. By utilizing the student model, we can directly conduct backpropagation and compute the gradient of the model without querying the teacher model. In this way, we can avoid the hard-label problem and the large number of queries at the same time. The prediction score is computed as follows:

Accuracy (%) of student models trained with various teacher models on MNIST, FMNIST, SVHN, CIFAR10, and ImageNet subset. Best results are in bold. Best results of the baselines are underlined. "Improvement" denotes the improvements of IDEAL compared with the best baseline.

Accuracy (%) of student models on datasets with hundreds of classes. We use ResNet-18 as the default student network, and all results are tested under the same query budget Q = 2M .In addition to the performance on small datasets, the performance of the black-box distillation method on large datasets deserves further investigation. Data-free knowledge distillation has historically performed poorlyZhou et al. (2020a);Chen et al. (2019) for datasets with a large number of classes (e.g. Tiny-ImageNet and CIFAR100), since it is very difficult to generate synthetic data with particularly rich class diversity. Thus, we also conduct experiments on datasets with more classes (at least 100 classes). Table3demonstrates the results of all methods on CIFAR100 and Tiny-ImageNet. As shown in Table3, it is also difficult for all these methods to produce a good student model in the black-box scenario. However, our proposed IDEAL consistently achieves the best performance on these large datasets. For example, IDEAL outperforms the best baseline DFME by 16.95% on CIFAR100 with ResNet-34. When compared with other baseline methods, our model achieves significant performance improvement by a large margin of over 18%.

Ablation studies by cutting of different modules.

Table 5, we set E G = 5 for MNIST, E G = 10 for CIFAR10 and ImageNet subset. ResNet-18 30.25 57.56 68.82 62.45 58.69 ImageNet subset VGG-16 ResNet-18 26.38 52.59 57.95 48.67 41.73

The influence of different number of iterations E G on data generation. We report the top-1 test accuracy (%).

The effect of the generator.

