ON THE ROBUSTNESS OF DATASET INFERENCE Anonymous

Abstract

Machine learning (ML) models are costly to train as they can require a significant amount of data, computational resources and technical expertise. Thus, they constitute valuable intellectual property that needs protection from adversaries wanting to steal them. Ownership verification techniques allow the victims of model stealing attacks to demonstrate that a suspect model was in fact stolen from theirs. Although a number of ownership verification techniques based on watermarking or fingerprinting have been proposed, most of them fall short either in terms of security guarantees (well-equipped adversaries can evade verification) or computational cost. A fingerprinting technique introduced at ICLR '21, Dataset Inference (DI), has been shown to offer better robustness and efficiency than prior methods. The authors of DI provided a correctness proof for linear (suspect) models. However, in a subspace of the same setting, we prove that DI suffers from high false positives (FPs) -it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. We further prove that DI also triggers FPs in realistic, non-linear suspect models. We then confirm empirically that DI in the black-box setting leads to FPs, with high confidence. Second, we show that DI also suffers from false negatives (FNs) -an adversary can fool DI by regularising a stolen model's decision boundaries using adversarial training, thereby leading to an FN. To this end, we demonstrate that black-box DI fails to identify a model adversarially trained from a stolen dataset -the setting where DI is the hardest to evade. Finally, we discuss the implications of our findings, the viability of fingerprintingbased ownership verification in general, and suggest directions for future work.

1. INTRODUCTION

Machine learning (ML) models are being developed and deployed at an increasingly faster rate and in several application domains. For many companies, they are not just a part of the technological stack that offers an edge over the competitors but a core business offering. Hence, ML models constitute valuable intellectual property that needs to be protected. Model stealing is considered one of the most serious attack vectors against ML models (Kumar et al., 2019) . The goal of a model stealing attack is to obtain a functionally equivalent copy of a victim model that can be used, for example, to offer a competing service, or avoid having to pay for the use of the model. In the white-box attack, the adversary obtains the exact copy of the victim model, for example by reverse engineering an application containing an embedded model (Deng et al., 2022) . In contrast, in black-box attacks (known as model extraction attacks) (Papernot et al., 2017; Orekondy et al., 2019; Tramèr et al., 2016) the adversary gleans information about victim model via its predictive interface. Two possible approaches to defend against model extraction are 1) detection (Juuti et al., 2019; Atli et al., 2020; Zheng et al., 2022) and 2) prevention (Orekondy et al., 2020; Mazeika et al., 2022; Dziedzic et al., 2022) . However, a powerful, yet realistic attacker can circumvent these defenses (Atli et al., 2020) . An alternative defense applicable to both white-box and black-box model theft is based on deterrence. It concedes that the model will eventually get stolen. Therefore, an ownership verification technique that can identify and demonstrate a suspect model as having been stolen can serve as a deterrent against model theft. Early research in this field focused on watermarking based on embedding triggers or backdoors (Zhang et al., 2018; Uchida et al., 2017; Adi et al., 2018) into the weights of the model. Unfortunately, all watermarking schemes were shown to be brittle (Lukas et al., 2022) in that an attacker can successfully remove the watermark from a protected stolen model without incurring a substantial loss in model utility. An alternative approach to ownership verification is fingerprinting. Instead of embedding a trigger or backdoor in the model, one can extract a fingerprint that matches only the victim model, and models derived from it. Fingerprinting works both against white-box and black-box attacks, and does not affect the performance of the model. Although several fingerprinting schemes have been proposed, some are not rigorously tested against model extraction (Cao et al., 2021; Pan et al., 2022) and others can be computationally expensive to derive (Lukas et al., 2021) . In this backdrop, Dataset Inference (DI), which appeared in ICLR 2021 (Maini et al., 2021) promises to be an effective fingerprinting mechanism. Intuitively, it leverages the fact that if model owners trained their models on private data, knowledge about that data can be used to identify all stolen models. DI was shown to be effective against white-box and black-box attacks and is efficient to compute (Maini et al., 2021) . It was also shown not to conflict with any other defenses (Szyller & Asokan, 2022). Given its promise, the guarantees provided by DI merits closer examination. In this work, we first show that DI suffers from false positives (FPs) -it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. The authors of DI provided a correctness proof for a linear model. However, DI in fact suffers from high FPs, unless two assumptions hold: (1) a large noise dimension, as explained in the original paper and (2) a large proportion of the victim's training data is used during ownership verification, as we prove in this paper. Both of these assumptions are unrealistic in a subspace of the linear case used by DI: (i) we prove that large noise dimension can lead to low accuracy in the resulting model , and (ii) revealing too much of the victim's (private) training data is detrimental to privacy. Furthermore, we prove that DI also triggers FPs in realistic, non-linear models. We then confirm empirically that DI leads to FPs, with high confidence in the black-box verification setting, "black-box DI", where the DI verifier has access only to the inference interface of a suspect model, but not its internals . We also show that black-box DI suffers from false negatives (FNs): an adversary who has in fact stolen a victim model can avoid detection by regularising their model with adversarial training. We provide empirical evidence that an adversary who steals the victim's dataset itself and adversarially trains a model can evade detection by DI. We claim the following contributions: • Following the same simplified theoretical analysis used by the original paper (Maini et al., 2021) , in a subspace of the linear case used by DI, we show that for a linear suspect model, a) high-dimensional noise (as required in (Maini et al., 2021) 

2. DATASET INFERENCE PRELIMINARIES

Dataset Inference (DI) aims to determine whether a suspect model f SP was obtained by an adversary A who has stolen a model (f A ) derived from a victim V's private data S V , or belongs to an independent party I (f I ). DI relies on the intuition that if a model is derived from S V , this information can be identified from all models. DI measures the prediction margins of a suspect model around private and public samples: distance from the samples to the model's decision boundaries. If

