DREAM: DOMAIN-FREE REVERSE ENGINEERING AT-TRIBUTES OF BLACK-BOX MODEL

Abstract

Deep learning models are usually black boxes when deployed on machine learning platforms. Prior works have shown that the attributes (e.g., the number of convolutional layers) of a target black-box neural network can be exposed through a sequence of queries. There is a crucial limitation that these works assume the dataset used for training the target model to be known beforehand, and leverage this dataset for model attribute attack. However, it is difficult to access the training dataset of the target black-box model in reality. Therefore, whether the attributes of a target black-box model could be still revealed in this case is doubtful. In this paper, we investigate a new problem of Domain-free Reverse Engineering the Attributes of a black-box target Model, called DREAM, without requiring the availability of target model's training dataset, and put forward a general and principled framework by casting this problem as an out of distribution (OOD) generalization problem. At the heart of our framework, we devise a multi-discriminator generative adversarial network (MDGAN) to learn domain invariant features. Based on these features, we can learn a domain-free model to inversely infer the attributes of a target black-box model with unknown training data. This makes our method one of the kinds that can gracefully apply to an arbitrary domain for model attribute reverse engineering with strong generalization ability. Extensive experimental studies are conducted and the results validate the superiority of our proposed method over the baselines.

1. INTRODUCTION

With its commercialization, machine learning as a service (MLaaS) is becoming more and more popular, and providers are paying more attention to the privacy of models and the protection of intellectual property. Generally speaking, the machine learning service deployed on the cloud platform is a black box, where users can only obtain outputs by providing inputs to the model. The attributes of the model such as architecture, training set, training method, are concealed by provider. However, if such a deployment is safe? Once the attributes of the model are revealed, it will be beneficial to many downstream attacking tasks, e.g., adversarial example generation (Moosavi-Dezfooli et al., 2016 ), model inversion (He et al., 2019 ), etc. (Oh et al., 2018) has conducted model reverse engineering to reveal model attributes, as shown in the left of Figure 1 . They first collect a large set of white-box models which are trained based on the same datasets as the target black-box model, e.g., the MNIST hand-written dataset (Lecun et al., 1998) . Given a sequence of input queries, the outputs of white-box models can be obtained. After that, a meta-classifier is trained to learn a mapping between model outputs and model attributes. For inference, outputs of the target black-box model are fed into the meta-classifier to predict model attributes. The promising results demonstrate the feasibility of model reverse engineering. However, a crucial limitation in (Oh et al., 2018) is that they assume the dataset used for training the target model to be known in advance, and leverage this dataset for meta-classifier learning. In most application cases, the training data of a target black-box model is unknown. When the domain of training data of the target black-box model is inconsistent with that of the set of constructed white-box models, the meta-classifier is usually unable to generalize well on the target black-box model. To verify this point, we train three black-box models with the same architecture on three different datasets, Photo, Cartoon and Sketch (Li et al., 2017) , respectively. We use the method in (Oh et al., 2018) to train a meta-classifier on the white-box models which are trained on the Cartoon An ideal meta-classifier should be well trained based on outputs of white-box models, and predict well on outputs of the target black-box model, even if white-box and blackbox models are trained using data of different domains. In light of this, we cast such a problem as an out of distribution (OOD) generalization problem, and propose a novel framework DREAM: Domain-free Reverse Engineering the Attributes of black-box Model. In the field of computer vision, out of distribution generalization learning has been widely studied in recent years (Shen et al., 2021) , where its main goal is to learn a model on data of one or multiple domains, and generalize well on data of another domain unseen during training. One kind of mainstream OOD learning approaches is to extract domain invariant features from data of multiple different domains, and utilize the domain invariant features for downstream tasks (Li et al., 2018; Kim et al., 2021; Zhou et al., 2021b) . These methods mainly focus on image or video data, and have shown powerful performance. Back to our problem, the black-box models deployed on cloud platform provide their functionality and which categories they can output. Therefore, we can collect data with the same label but different distribution as domains to train white-box models and obtain their probability outputs. since the data we concentrate on is related to the outputs of machine learning models, e.g., probability value, how to design an effective OOD learning method over this type of data has not been explored. To this end, we design a multi-discriminator generative adversarial network (MDGAN) to learn domain invariant features from the outputs of white-box models trained on multi-domain data. Based on learnt domain invariant features, we learn a domain-free reverse



Figure 2: The performance of (Oh et al., 2018) on three datasets.In this paper, we investigate the problem of black-box model attribute reverse engineering, no longer requiring training data of the target model is available, as shown in the right of Figure1. Obviously, when feeding the same input queries to models with the same architecture but training data of different domains, the output distributions of these models are usually different. Thus, in our problem setting, a key point is how to bridge the gap between the output distributions of white-box and target black-box models, due to the lack of the target model's training data. An ideal meta-classifier should be well trained based on outputs of white-box models, and predict well on outputs of the target black-box model, even if white-box and blackbox models are trained using data of different domains.

Previous work (left) assumes the dataset used to train the target black-box model is given beforehand, and requires to use the same dataset to train white-box models. Our DREAM framework (right) relaxes the condition that training data of black-box model is no longer required to be available, and proposes a domain-free method to infer attributes of a black-box model. Our idea is casting the problem into an out-of-distribution learning problem, and designing a GAN(Goodfellow et al., 2014)   based network (MDGAN) to learn domain invariant features for black-box model attribute inference.

