DATASET INFERENCE: OWNERSHIP RESOLUTION IN MACHINE LEARNING

Abstract

With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model's decision surface, but this is insufficient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key observation that knowledge contained in the stolen model's training set is what is common to all stolen copies. The adversary's goal, irrespective of the attack employed, is always to extract this knowledge or its by-products. This gives the original model's owner a strong advantage over the adversary: model owners have access to the original training data. We thus introduce dataset inference, the process of identifying whether a suspected model copy has private knowledge from the original model's dataset, as a defense against model stealing. We develop an approach for dataset inference that combines statistical testing with the ability to estimate the distance of multiple data points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and ImageNet show that model owners can claim with confidence greater than 99% that their model (or dataset as a matter of fact) was stolen, despite only exposing 50 of the stolen model's training points. Dataset inference defends against state-of-the-art attacks even when the adversary is adaptive. Unlike prior work, it does not require retraining or overfitting the defended model.

1. INTRODUCTION

Machine learning models have increasingly many parameters (Brown et al., 2020; Kolesnikov et al., 2019) , requiring larger datasets and significant investment of resources. For example, OpenAI's development of GPT-3 is estimated to have cost over USD 4 million (Li, 2020) . Yet, models are often exposed to the public to provide services such as machine translation (Wu et al., 2016) or image recognition (Wu et al., 2019) . This gives adversaries an incentive to steal models via the exposed interfaces using model extraction. This threat raises a question of ownership resolution: how can an owner prove that another suspect model stole their intellectual property? Specifically, we aim to determine whether a potentially stolen model was derived from an owner's model or dataset. An adversary may derive and steal intellectual property from a victim in many ways. A prominent way is (1) model extraction (Tramèr et al., 2016) , where the adversary exploits access to a model's (1.a) prediction vectors (e.g., through an API) to reproduce a copy of the model at a lower cost than what is incurred in developing it. Perhaps less directly, (1.b) the adversary could also use the victim model as a labeling oracle to train their model on an initially unlabeled dataset obtained either from a public source or collected by the adversary. In a more extreme threat model, (2) the adversary could also get access to the dataset itself which was used to train the victim model and train their own model by either (2.a) distilling the victim model, or (2.b) training from scratch altogether. Finally, adversaries may gain (3) complete access to the victim model, but not the dataset. This may happen when a victim wishes to open-source their work for academic purposes but disallows its commercialization, or simply via insider-access. The adversary may (3.a) fine-tune over the victim model, or (3.b) use the victim for data-free distillation (Fang et al., 2019) . Preventing all forms of model stealing is impossible without decreasing model accuracy for legitimate users: model extraction adversaries can obfuscate malicious queries as legitimate ones from the expected distribution. Most prior efforts thus focus on watermarking models before deployment. Rather than preventing model stealing, they aim to detect theft by allowing the victim to claim ownership by verifying that a suspect model responds with the expected outputs on watermarked inputs. This strategy not only requires re-training and decreases model accuracy, it can also be vulnerable to adaptive attacks that lessen the impact of watermarks on the decision surface during extraction. Thus, recent work that has managed to prevail (Yang et al., 2019) despite distillation (Hinton et al., 2015) or extraction (Jia et al., 2020) , has suffered a trade-off in model performance. In our work, we make the key observation that all stolen models necessarily contain direct or indirect information from the victim model's training set. This holds regardless of how the adversary gained access to the stolen model. This leads us to propose a fundamentally different defense strategy: we identify stolen models because they possess knowledge contained in the private training set of the victim. Indeed, a successful model extraction attack will distill the victim's knowledge of its training data into the stolen copy. Hence, we propose to identify stolen copies by showing that they were trained (at least partially and indirectly) on the same dataset as the victim. We call this process dataset inference (DI). In particular, we find that stolen models are more confident about points in the victim model's training set than on a random point drawn from the task distribution. The more an adversary interacts with the victim model to steal it, the easier it will be to claim ownership by distinguishing the stolen model's behavior on the victim model's training set. We distinguish a model's behavior on its training data from other subsets of data by measuring the 'prediction certainty' of any data point: the margin of a given data point to neighbouring classes. At its core, DI builds on the premise of input memorization, albeit weak. One might think that DI succeeds only for models trained on small datasets when overfitting is likely. Surprisingly, in practice, we find that even models trained on ImageNet end up memorizing training data in some form. Among related work discussed in § 2, distinguishing a classifier's behavior on examples from its train and test sets is closest to membership inference (Shokri et al., 2017) . Membership inference (MI) is an attack predicting whether individual examples were used to train a model or not. Dataset inference flips this situation and exploits information asymmetry: the potential victim of model theft is now the one testing for membership and naturally has access to the training data. Whereas MI typically requires a large train-test gap because such a setting allows a greater distinction between individual points in and out the training set (Yeom et al., 2018; Choo et al., 2020) , dataset inference succeeds even when the defender has slightly better than random chance of guessing membership correctly; because the victim aggregates the result of DI over multiple points from the training set. In summary, our contributions are: • We introduce dataset inference as a general framework for ownership resolution in machine learning. Our key observation is that knowledge of the training set leads to information asymmetry which advantages legitimate model owners when resolving ownership. • We theoretically show on a linear model that the success of MI decreases with the size of the training set (as overfitting decreases), whereas DI is independent of the same. Despite the failure of MI on a binary classification task, DI still succeeds with high probability. • We propose two different methods to characterize training vs. test behavior: targeted adversarial attacks in the white-box setting, and a novel 'Blind Walk' method for the black-box label-only setting. We then create a concise embedding of each data point that is fed to a confidence regressor to distinguish between points inside and outside a model's training set. Hypothesis testing then returns the final ownership claim. • Unlike prior efforts, our method not only helps defend ML services against model extraction attacks, but also in extreme scenarios such as complete theft of the victim's model or training data. In § 7, we also introduce and evaluate our approach against adaptive attacks. • We evaluate our method on the CIFAR10, SVHN, CIFAR100 and ImageNet datasets and obtain greater than 99% confidence in detecting model or data theft via the threat models studied in this work, by exposing as low as 50 random samples from our private dataset.

