ON THE ROBUSTNESS OF DATASET INFERENCE Anonymous

Abstract

Machine learning (ML) models are costly to train as they can require a significant amount of data, computational resources and technical expertise. Thus, they constitute valuable intellectual property that needs protection from adversaries wanting to steal them. Ownership verification techniques allow the victims of model stealing attacks to demonstrate that a suspect model was in fact stolen from theirs. Although a number of ownership verification techniques based on watermarking or fingerprinting have been proposed, most of them fall short either in terms of security guarantees (well-equipped adversaries can evade verification) or computational cost. A fingerprinting technique introduced at ICLR '21, Dataset Inference (DI), has been shown to offer better robustness and efficiency than prior methods. The authors of DI provided a correctness proof for linear (suspect) models. However, in a subspace of the same setting, we prove that DI suffers from high false positives (FPs) -it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. We further prove that DI also triggers FPs in realistic, non-linear suspect models. We then confirm empirically that DI in the black-box setting leads to FPs, with high confidence. Second, we show that DI also suffers from false negatives (FNs) -an adversary can fool DI by regularising a stolen model's decision boundaries using adversarial training, thereby leading to an FN. To this end, we demonstrate that black-box DI fails to identify a model adversarially trained from a stolen dataset -the setting where DI is the hardest to evade. Finally, we discuss the implications of our findings, the viability of fingerprintingbased ownership verification in general, and suggest directions for future work.

1. INTRODUCTION

Machine learning (ML) models are being developed and deployed at an increasingly faster rate and in several application domains. For many companies, they are not just a part of the technological stack that offers an edge over the competitors but a core business offering. Hence, ML models constitute valuable intellectual property that needs to be protected. Model stealing is considered one of the most serious attack vectors against ML models (Kumar et al., 2019) . The goal of a model stealing attack is to obtain a functionally equivalent copy of a victim model that can be used, for example, to offer a competing service, or avoid having to pay for the use of the model. In the white-box attack, the adversary obtains the exact copy of the victim model, for example by reverse engineering an application containing an embedded model (Deng et al., 2022) . In contrast, in black-box attacks (known as model extraction attacks) (Papernot et al., 2017; Orekondy et al., 2019; Tramèr et al., 2016) the adversary gleans information about victim model via its predictive interface. Two possible approaches to defend against model extraction are 1) detection (Juuti et al., 2019; Atli et al., 2020; Zheng et al., 2022) and 2) prevention (Orekondy et al., 2020; Mazeika et al., 2022; Dziedzic et al., 2022) . However, a powerful, yet realistic attacker can circumvent these defenses (Atli et al., 2020) . An alternative defense applicable to both white-box and black-box model theft is based on deterrence. It concedes that the model will eventually get stolen. Therefore, an ownership verification technique that can identify and demonstrate a suspect model as having been stolen can serve as a 1

