PROTECTING DNNS FROM THEFT USING AN ENSEM-BLE OF DIVERSE MODELS

Abstract

Several recent works have demonstrated highly effective model stealing (MS) attacks on Deep Neural Networks (DNNs) in black-box settings, even when the training data is unavailable. These attacks typically use some form of Out of Distribution (OOD) data to query the target model and use the predictions obtained to train a clone model. Such a clone model learns to approximate the decision boundary of the target model, achieving high accuracy on in-distribution examples. We propose Ensemble of Diverse Models (EDM) to defend against such MS attacks. EDM is made up of models that are trained to produce dissimilar predictions for OOD inputs. By using a different member of the ensemble to service different queries, our defense produces predictions that are highly discontinuous in the input space for the adversary's OOD queries. Such discontinuities cause the clone model trained on these predictions to have poor generalization on in-distribution examples. Our evaluations on several image classification tasks demonstrate that EDM defense can severely degrade the accuracy of clone models (up to 39.7%). Our defense has minimal impact on the target accuracy, negligible computational costs during inference, and is compatible with existing defenses for MS attacks.

1. INTRODUCTION

MS attacks allow an adversary with black-box access to the predictions of the target model to copy its functionality and create a high-accuracy clone model, posing a threat to the confidentiality of proprietary DNNs. Such attacks also open the door to a wide range of security vulnerabilities including adversarial attacks (Goodfellow et al., 2014) that cause misclassification, membership-inference attacks (Shokri et al., 2017) that leak membership, and model-inversion attacks (Fredrikson et al., 2015) that reveal the data used to train the model. MS is carried out using the principle of Knowledge distillation (KD), wherein the adversary uses a dataset D to query the target model. The predictions of the target on D are then used to train a clone model that replicates the target model's functionality. Since access to training data is limited in most real-world settings, attacks typically use some form of OOD data to perform KD. Clone models trained in this way closely approximate the decision boundaries of the target model, achieving high-accuracy on in-distribution examples. The goal of this paper is to defend against MS attacks by creating a target model that is inherently hard to steal using Knowledge Distillation with OOD data. Our key observation is the existing MS attacks (Orekondy et al., 2019a; Papernot et al., 2017; Juuti et al., 2019) implicitly assume that the target model produces continuous predictions in the input space. We hypothesise that making the predictions of the target model discontinuous makes MS attacks harder to carry out. To this end, we propose Ensemble of Diverse Models (EDM) to defend against MS attacks. The models in EDM are trained using a novel diversity loss to produce dissimilar predictions on OOD data. Each input query to EDM is serviced by a single model that is selected from the ensemble using an input-based hashing function. We develop a DNN-based perceptual hashing algorithm for this purpose, which is invariant to simple transformations of the input and prevents adaptive attacks. Since different models in the ensemble are used to service different queries and the models are trained to produce dissimilar predictions, the adversary obtains predictions that are highly discontinuous in the input space. The clone model, when trained on these predictions, tries to approximate the complex discontinuous decision boundary of the EDM. Our empirical evaluations show that the resulting clone model generalizes poorly on in-distribution data, degrading clone accuracy and reducing the efficacy of MS attacks. In contrast to existing defenses that rely on perturbing output predictions, our defense does not require modifying the output and instead uses a diverse ensemble to produce discontinuous predictions that are inherently hard to steal. We illustrate the working of our defense by comparing the results of MS attacks on two target models-(i) Undefended baseline (ii) EDM -trained on a toy binary-classification problem shown in Fig. 1 . The training data and predictions of the target model under attack are shown in Fig. 1a . We use a set of OOD points as shown in Fig. 1b to query the target model and obtain its predictions. The predictions of the target model on the OOD data is finally used to train the clone model to replicate the functionality of the target as shown in Fig. 1c . For the undefended baseline target, which uses a simple linear model to produce continuous predictions, the clone model obtained by the attack closely approximates the decision boundary of the target model, achieving good classification accuracy on in-distribution data. The EDM target consists of two diverse models (Model-A and Model-B) that produce dissimilar predictions on the OOD data. EDM uses an input-based hash function to select the model that services the input query. As a result, the target model produces highly discontinuous decisions on OOD data (Fig. 1a ,b for EDM). The clone model trained on this data fails to capture any meaningful decision boundary (Fig. 1c for EDM) and produces poor accuracy on the classification task, making MS attacks harder to carry out. In summary, our paper makes the following key contributions: 1. We propose a novel Diversity Loss function to train an Ensemble of Diverse Models (EDM) that produce dissimilar predictions on OOD data.



We propose using EDM to defend against model stealing attacks. Our defense creates discontinuous predictions for the adversary's OOD queries, making MS attacks harder to carry out, without causing degradation to the model's accuracy on benign queries. We develop a DNN-based perceptual hash function, which produces the same hash value even with large changes to the input, making adaptive attacks harder to carry out.



Figure 1: Toy experiment comparing the efficacy of MS attacks between (i) Undefended Baseline and (ii) EDM targets. (a) Predictions and training data of the target model (b) OOD surrogate data used by the adversary to query the target (c) Clone model trained using the predictions of the target on the surrogate data. The discontinuous predictions produced by EDM make MS attacks less effective compared to an undefended baseline. (Best viewed in color)

