DATASET INFERENCE: OWNERSHIP RESOLUTION IN MACHINE LEARNING

Abstract

With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model's decision surface, but this is insufficient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key observation that knowledge contained in the stolen model's training set is what is common to all stolen copies. The adversary's goal, irrespective of the attack employed, is always to extract this knowledge or its by-products. This gives the original model's owner a strong advantage over the adversary: model owners have access to the original training data. We thus introduce dataset inference, the process of identifying whether a suspected model copy has private knowledge from the original model's dataset, as a defense against model stealing. We develop an approach for dataset inference that combines statistical testing with the ability to estimate the distance of multiple data points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and ImageNet show that model owners can claim with confidence greater than 99% that their model (or dataset as a matter of fact) was stolen, despite only exposing 50 of the stolen model's training points. Dataset inference defends against state-of-the-art attacks even when the adversary is adaptive. Unlike prior work, it does not require retraining or overfitting the defended model.

1. INTRODUCTION

Machine learning models have increasingly many parameters (Brown et al., 2020; Kolesnikov et al., 2019) , requiring larger datasets and significant investment of resources. For example, OpenAI's development of GPT-3 is estimated to have cost over USD 4 million (Li, 2020) . Yet, models are often exposed to the public to provide services such as machine translation (Wu et al., 2016) or image recognition (Wu et al., 2019) . This gives adversaries an incentive to steal models via the exposed interfaces using model extraction. This threat raises a question of ownership resolution: how can an owner prove that another suspect model stole their intellectual property? Specifically, we aim to determine whether a potentially stolen model was derived from an owner's model or dataset. An adversary may derive and steal intellectual property from a victim in many ways. A prominent way is (1) model extraction (Tramèr et al., 2016) , where the adversary exploits access to a model's (1.a) prediction vectors (e.g., through an API) to reproduce a copy of the model at a lower cost than what is incurred in developing it. Perhaps less directly, (1.b) the adversary could also use the victim model as a labeling oracle to train their model on an initially unlabeled dataset obtained either from a public source or collected by the adversary. In a more extreme threat model, (2) the adversary could also get access to the dataset itself which was used to train the victim model and train their own model by either (2.a) distilling the victim model, or (2.b) training from scratch altogether. Finally, adversaries may gain (3) complete access to the victim model, but not the dataset. This may happen when a victim wishes to open-source their work for academic purposes but disallows its

