ON THE ROBUSTNESS OF DATASET INFERENCE Anonymous

Abstract

Machine learning (ML) models are costly to train as they can require a significant amount of data, computational resources and technical expertise. Thus, they constitute valuable intellectual property that needs protection from adversaries wanting to steal them. Ownership verification techniques allow the victims of model stealing attacks to demonstrate that a suspect model was in fact stolen from theirs. Although a number of ownership verification techniques based on watermarking or fingerprinting have been proposed, most of them fall short either in terms of security guarantees (well-equipped adversaries can evade verification) or computational cost. A fingerprinting technique introduced at ICLR '21, Dataset Inference (DI), has been shown to offer better robustness and efficiency than prior methods. The authors of DI provided a correctness proof for linear (suspect) models. However, in a subspace of the same setting, we prove that DI suffers from high false positives (FPs) -it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. We further prove that DI also triggers FPs in realistic, non-linear suspect models. We then confirm empirically that DI in the black-box setting leads to FPs, with high confidence. Second, we show that DI also suffers from false negatives (FNs) -an adversary can fool DI by regularising a stolen model's decision boundaries using adversarial training, thereby leading to an FN. To this end, we demonstrate that black-box DI fails to identify a model adversarially trained from a stolen dataset -the setting where DI is the hardest to evade. Finally, we discuss the implications of our findings, the viability of fingerprintingbased ownership verification in general, and suggest directions for future work. Machine learning (ML) models are being developed and deployed at an increasingly faster rate and in several application domains. For many companies, they are not just a part of the technological stack that offers an edge over the competitors but a core business offering. Hence, ML models constitute valuable intellectual property that needs to be protected. Model stealing is considered one of the most serious attack vectors against ML models (Kumar et al., 2019) . The goal of a model stealing attack is to obtain a functionally equivalent copy of a victim model that can be used, for example, to offer a competing service, or avoid having to pay for the use of the model. In the white-box attack, the adversary obtains the exact copy of the victim model, for example by reverse engineering an application containing an embedded model (Deng et al., 2022) . In contrast, in black-box attacks (known as model extraction attacks) (Papernot et al., 2017; Orekondy et al., 2019; Tramèr et al., 2016) the adversary gleans information about victim model via its predictive interface. Two possible approaches to defend against model extraction are 1) detection (Juuti et al., 



deterrent against model theft. Early research in this field focused on watermarking based on embedding triggers or backdoors (Zhang et al., 2018; Uchida et al., 2017; Adi et al., 2018) into the weights of the model. Unfortunately, all watermarking schemes were shown to be brittle (Lukas et al., 2022) in that an attacker can successfully remove the watermark from a protected stolen model without incurring a substantial loss in model utility. An alternative approach to ownership verification is fingerprinting. Instead of embedding a trigger or backdoor in the model, one can extract a fingerprint that matches only the victim model, and models derived from it. Fingerprinting works both against white-box and black-box attacks, and does not affect the performance of the model. Although several fingerprinting schemes have been proposed, some are not rigorously tested against model extraction (Cao et al., 2021; Pan et al., 2022) and others can be computationally expensive to derive (Lukas et al., 2021) . In this backdrop, Dataset Inference (DI), which appeared in ICLR 2021 (Maini et al., 2021) promises to be an effective fingerprinting mechanism. Intuitively, it leverages the fact that if model owners trained their models on private data, knowledge about that data can be used to identify all stolen models. DI was shown to be effective against white-box and black-box attacks and is efficient to compute (Maini et al., 2021) . It was also shown not to conflict with any other defenses (Szyller & Asokan, 2022) . Given its promise, the guarantees provided by DI merits closer examination. In this work, we first show that DI suffers from false positives (FPs) -it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. The authors of DI provided a correctness proof for a linear model. However, DI in fact suffers from high FPs, unless two assumptions hold: (1) a large noise dimension, as explained in the original paper and (2) a large proportion of the victim's training data is used during ownership verification, as we prove in this paper. Both of these assumptions are unrealistic in a subspace of the linear case used by DI: (i) we prove that large noise dimension can lead to low accuracy in the resulting model , and (ii) revealing too much of the victim's (private) training data is detrimental to privacy. Furthermore, we prove that DI also triggers FPs in realistic, non-linear models. We then confirm empirically that DI leads to FPs, with high confidence in the black-box verification setting, "black-box DI", where the DI verifier has access only to the inference interface of a suspect model, but not its internals . We also show that black-box DI suffers from false negatives (FNs): an adversary who has in fact stolen a victim model can avoid detection by regularising their model with adversarial training. We provide empirical evidence that an adversary who steals the victim's dataset itself and adversarially trains a model can evade detection by DI. We claim the following contributions: • Following the same simplified theoretical analysis used by the original paper (Maini et al., 2021) , in a subspace of the linear case used by DI, we show that for a linear suspect model, a) high-dimensional noise (as required in (Maini et al., 2021) leads to low model accuracy (Lemma 1, Section 3.1), and 2) DI suffers from FPs unless a large proportion of private data is revealed during ownership verification (Theorem 1, Section 3.1); • Extending the analysis to non-linear suspect models, using a PAC-Bayesian framework (Neyshabur et al., 2018) , we show that DI suffers from FPs in non-linear models regardless of how much private data is revealed (Theorem 2, Section 3.2.1); • We empirically demonstrate the existence of FPs in a realistic black-box DI setting (Section 3.2.2); • We show empirically that black-box DI also suffers from FNs: using adversarial training to regularise the decision boundaries of a stolen model can successfully evade detection by DI while incurring only a modest loss in accuracy (≈ 6pp) (Section 4);

2. DATASET INFERENCE PRELIMINARIES

Dataset Inference (DI) aims to determine whether a suspect model f SP was obtained by an adversary A who has stolen a model (f A ) derived from a victim V's private data S V , or belongs to an independent party I (f I ). DI relies on the intuition that if a model is derived from S V , this information can be identified from all models. DI measures the prediction margins of a suspect model around private and public samples: distance from the samples to the model's decision boundaries. If In the rest of this section, we explain the theoretical framework that DI uses -consisting of a linear suspect model -the embedding generation necessary for using DI with realistic non-linear suspect models, and the verification procedure. A summary of the notation used throughout this work appears in Table 1 .

2.1. THEORETICAL FRAMEWORK

The original DI paper (Maini et al., 2021) used a linear suspect model to theoretically prove the guarantees provided by DI. We first explain how DI works in this setting. Setup. Consider a data distribution D, such that any input-label pair (x, y) can be described as: y ∼ {-1, +1}, x 1 = y • u ∈ R K , x 2 ∼ N (0, σ 2 I) ∈ R D , where x = (x 1 , x 2 ) ∈ R K+D and u ∈ R K is a fixed vector. The last D dimensions of x represent Gaussian noise (with variance σ 2 ). Structure of the linear model. Assuming a linear model f , with weights w = (w 1 , w 2 ), such that f (x) = w 1 • x 1 + w 2 • x 2 , then the final classification decision is sgn(f (x)). With the weights initialized to zero, f learns the weights using gradient descent with learning rate 1 until yf (x) is maximized. Given a private training dataset S V ∼ D = {(x (i) , y (i) )|i = 1, ..., m}, and a public dataset S 0 ∼ D (both of size m), then w 1 = mu and w 2 = m i=1 y (i) x (i) 2 regardless of the batch size. In DI, the prediction margin p(•) is used to imply the confidence of f in its prediction. It is defined as the margin (distance) of a data point from the decision boundary. p(x) ≜ y • f (x). (1) The authors (Maini et al., 2021) show that the difference of expected prediction margins of two datasets S V and S 0 is Dσ 2 . The threshold can be set λ ∈ (0, Dσ 2 ), and by estimating the difference of the prediction margins on S 0 and S V on f SP , DI is able to distinguish whether that model is stolen. Note that DI uses approximations of the prediction margins based on embeddings. The theoretical framework assumes that the approximations are accurate, and we can use them directly for the theoretical analysis (Equation 1). For the linear model, the margins can be computed analytically; however, in Section 2.2, we explain how the approximations of the margins are obtained.

2.2. EMBEDDING GENERATION

In order to use DI one needs to generate embeddings of the samples. V queries their model f V with samples in their private dataset S V and public dataset S 0 , and assigns the labels b = 1 and b = 0 respectively. The authors propose two methods of generating the embeddings: a white-box approach (MinGD) and a black-box one (Blind Walk). In this work, we use only Blind Walk as it outperforms MinGD in most experimental setups in the original work, and is more realistic, as it only requires access to the API of the suspect model. Blind Walk estimates the prediction margin of a sample by measuring its robustness to random noise. For a sample (x, y), to compute the margin, first choose a random direction δ, and take k ∈ N steps in the same direction until the misclassification f (x + kδ) ̸ = y. This is repeated multiple times to increase the size of the embedding. As reported in (Maini et al., 2021) , obtaining embeddings for 100 samples can take up to 30, 000 queries. Having obtained the embeddings, V trains a regression model g V that predicts the confidence that a sample contains private information from S V .

2.3. OWNERSHIP VERIFICATION

Using the scores from g V and the membership labels, V creates vectors c and c V of equal size from S V and S 0 , respectively. Then for a null hypothesis H 0 : µ < µ V where µ = c and μ = cV are mean confidence scores. The test rejects H 0 and rules that the suspect model is 'stolen', or gives an inconclusive result. To verify whether f SP is stolen or independent, V obtains the embeddings by querying the model (using Blind Walk) using samples from S V and S 0 . Then they use the embeddings to obtain the confidence scores from the g V , and performs a hypothesis test on the two distributions of scores.

3. FALSE POSITIVES IN DATASET INFERENCE

To generate the embeddings for a specific sample in the private dataset S V , DI requires querying the suspect model f SP hundreds of times. To reduce the total number of queries, DI was shown to be effective with only 10 private samples with at least 95% confidence. Additionally, DI requires a large random noise dimension D such that probability of success increases to 1 as D → ∞. In this section, we prove that these two assumptions are not realistic in the case of a linear model: 1) DI is susceptible to false positives (FPs) unless V reveals a large number of samples; 2) a large D will harm the utility of the model (Section 3.1). 6). V needs to use many private samples to guarantee low false positive rate. Furthermore, we find that the theoretical results on linear suspect models which say that the margins on different models are distinguishable with some strict conditions do not hold for more realistic non-linear suspect models. Using a PAC-Bayesian margin based generalization bound (Neyshabur et al., 2018) we prove that models trained on the same distribution are indistinguishable, and will trigger FPs (Section 3.2.1. Next, we provide empirical evidence for the existence of FPs (Section 3.2.2).

3.1. LINEAR SUSPECT MODELS

In section 2, we have a distribution D set up for linear models. The linear model f should correctly classify most of the randomly picked data from this distribution. However, in a subspace of the linear case used by DI, we find that the dimension of the noise part of x needs to be small, otherwise it will harm the utility of the model. Lemma 1 (Need for Bounding Noise Dimension). Let f be a linear model trained on S ∼ D. For a sample (x, y) sampled from D which is independent of S, assuming that ||u|| 2 ≤ 1 √ m and σ 2 > 1 √ m , then, the linear model f correctly classifies (x, y) with a probability larger than 0.9 only if D < 10. The details of the proof are in the Appendix A. Lemma 1 shows that if the dimension of x 2 , which follows N (0, σ 2 ), is large, then the noise will dominate f and mislead it into making incorrect predictions. For example, set D = 1000 and assume that the variance of x 2 is 0.25 (close to the CIFAR10 dataset). Then, f can correctly classify a sample that is different from f 's training set with a probability up to 0.69. Theorem 1 (Existence of False Positives with Linear Suspect Models). Let f I be a linear classifier trained on the independent dataset S I ∼ D with accuracy more than 0.9. Assume that |S I | = m, ||u|| 2 ≤ 1 √ m and σ 2 > 1 √ m . Let k be the number of samples estimated required for the verification. Then, the probability that V mistakenly decides that f I is a stolen model P [Ψ(f I , S V ; D) = 1] > 1 -Φ( √ k √ m ). Where Ψ is V's decision function (Maini et al., 2021) : Ψ(f SP , S; D) = 1, if f SP ∼ f A , 0, if f SP ∼ f I , Proof. Recall that V tries to reveal only a few samples during the verification. For a distribution D where ||u|| ≤ 1 √ m and σ 2 > 1 √ m . Following the intuition from DI (Yeom et al., 2018) , for satisfactory performance, DI must minimise both false positives and false negatives. Hence, the objective function is defined as: min λ P[Ψ(f I , S V ; D) = 1] + P[Ψ(f V , S V ; D) = 0] 2 , where the margin of D is estimated using S V and S 0 . Note that we are only interested in the false positives P[Ψ(f I , S V ; D) = 1], let S I = {(x (i) , y (i) )|i = 1, ..., m}, S k * be a subset of S * consisting of k samples. P[Ψ(f I , S V ; D) = 1] = P[E (x,y)∈S k V [yf I (x)] -E (x,y)∈S k 0 [yf I (x)] ≥ λ] = P[E (x,y)∈S k V [ m i y (i) x (i) 2 x 2 ] -E (x,y)∈S k 0 [ m i y (i) x (i) 2 x 2 ] ≥ λ] = P[ 1 k k j m i y (i) x (i) 2 x (j) 2 - 1 k k p m i y (i) x (i) 2 x (p) 2 ≥ λ]. Recall that x (i) 2 , x 2 and x (p) 2 are D-dimensional vectors sampled independently from N (0, σ 2 ). Using central limit theorem we can approximate the terms. We have m i y (i) x (i) 2 ∼ N (0, mσ 2 ). Then, we can approximate 1 k k j m i y (i) x (i) 2 x (j) 2 by t 1 ∼ N (0, mD k σ 4 ) and approximate 1 k k p m i y (i) x (i) 2 x (p) 2 by t 2 ∼ N (0, mD k σ 4 ) (Maini et al., 2021) . Thus, we get t ∼ N (0, 2mD k σ 4 ), and P[Ψ(f I , S V ; D) = 1] = P[t ≥ λ] = P[ 2mD k σ 2 Z ≥ λ] = P[Z ≥ √ kλ √ 2mDσ 2 ] = 1-Φ( √ kλ √ 2mDσ 2 ), (5) where Z ∼ N (0, 1). The optimal threshold is given as λ = Dσ 2 2 , P[Ψ(f I , S V ; D) = 1] = 1 -Φ( √ kD 2 √ 2m ). From Equation 6, we see that the probability of false positives relies on the number of points used for the verification k m and the size of D. Combining with Lemma 1, the proof is complete. In other words, the success of DI is directly related to the number of samples used for the verification. This is similar to the analysis of failure of membership inference in the original paper when the k is extremely low, e.g. only 10 samples. In the DI paper, it was explained that DI succeeds because it calculates the average margin for multiple verification samples; whereas membership inference fails as it relies on per-sample decision. So when the number of tested samples is smaller, the success rate of DI will be close to 0.5, just like for membership inference. In Figure 1 , we show the probability of an FP (Equation 6) for different values of k; even for k = 10000 the probability is 0.309 . Hence, even the simple linear setup, Ψ(f, S; D) has false positives with high probability; in particular, when the fraction of tested samples is small.

3.2. NON-LINEAR SUSPECT MODELS

Having demonstrated the limitations of the linear model, we now focus on non-linear suspect models. The intuition is based on the margin-based generalization bounds. Note that the generalization bounds states that the expected error of the margin based loss function is bounded, and the bound is mostly related to the distribution (Neyshabur et al., 2018) . Since DI assumes all the datasets follow the distribution D, our intuition is to directly use the generalization bounds and the triangle inequality to prove the similarity of the models trained on the same distribution.

3.2.1. THEORETICAL MOTIVATION

Let f w be a real-valued classifier f w : X → R k , ||x|| ≤ B with parameters w = {W i } d i=1 . For any distribution D and margin p(f, x) = f (x)[y] -max j̸ =y f (x)[j] ≤ γ, where γ > 0. The margin is same as for the linear model with labels y ∈ {-1, +1}. Then, we define the margin loss function as: L γ (f, y) = P (x,y)∼D [f (x)[y] -max j̸ =y f (x)[j] ≤ γ]. Note that the PAC-Bayes framework (Neyshabur et al., 2018) provides guarantees for any classifier f trained on data from a given distribution. We define the expected loss of a classifier f on distribution D as L D := E (x,y)∼D [L(f (x), y)] and the empirical loss on a dataset S as LS := 1 m (x,y)∈S [L(f (x), y)]. Then, for a d-layer feed-forward network f with parameters w = {W i } d i=1 and ReLU activation (Neyshabur et al., 2018) . The empirical loss is very close to the expected loss. For any σ, γ > 0, with probability 1 -σ over the training set, we have: |L D (f S ) -LS (f S )| ≤ O(ϵ), where ϵ = B 2 d 2 hln(dh) d i=1 ||Wi|| 2 2 d i=1 ||W i || 2 F ||W i || 2 2 +ln dm σ γ 2 m , and h is the upper bound dimension for {W i } d i=1 . This PAC-Bayes based generalization guarantee states that for a model f , the distance between the empirical loss and the expected loss is bounded, and the bound can be very small when the model's margin is large. Thus, we can expect that the margins of f on any dataset that follows a given distribution to be similar. This contradicts the intuition of DI. Moreover, since DI assumes that S V and S I follow the same distribution D, we can show that the margins for f V and f I are similar to each other. Theorem 2 (k-independent False Positives with Non-linear Suspect Models). For the victim private dataset S V ∼ D and an independent dataset S I ∼ D, let f w be a d-layer feed-forward network with ReLU activations and parameters w = {W i } d i=1 . Assume that f V is trained on S V and f I is trained on S I , f V and f I have the same structure. Then, for any B, d, h, ϵ > 0 and any x ∈ X , there exist a prior P on w, s.t. with probability at least 1 2 , |E(p(f V , x) -p(f I , x))| ≤ ϵ. ( ) The details of the proof are in the Appendix A. Hence, for any two models trained on the same distribution, the expectation of margins for any sample are similar. Given that DI works by distinguishing the difference of margins for two models, it will result in false positives with probability at least 1 2 (Theorem 2).

3.2.2. EMPIRICAL EVIDENCE

Having proved the existence of FPs for non-linear models, we now focus on empirically confirming it. First, recall the original experiment setup (Maini et al., 2021) ; let us consider the following two models: 1) f V trained using S V , and 2) f 0 trained using S 0 . In the original formulation, e.g. for CIFAR10, CIFAR10-train (50, 000 samples) is used as S V , and CIFAR10-test is used as S 0 (10, 000 samples). Recall that V uses their S V and S 0 to obtain the embeddings that are then used to train the regression model g V . DI was shown to be effective against several post-processing used to obtain dependent models which are expected to be flagged as stolen -true positives However, the independent model f 0 is trained on S 0 -the same data that is used to train g V . This means that the same dataset S 0 is used both to train g V and subsequently, to evaluate it. This is likely to introduce a bias that overestimates the efficacy of g V and DI as a whole. To address this, and test whether DI works for a more reasonable data split, we use the following setup: 1) randomly split CIFAR10-train into two subsets (A train and B train ) of 25, 000 samples each; 2) assign S V = A train , and train f V using it; 3) continue using CIFAR10-test as S 0 (nothing changes), and train f 0 using it; 4) g V is trained using the embedding for S 0 and the new S V , obtained from the new f V ; 5) assign S I = B train , independent data of a third-party I, who trains their model f I . This way, we have an independent model f I that was trained on data from the same distribution D as S V but data that was not seen by g V 1 . Recall that to determine whether the model is stolen, DI obtains the embeddings for private (S V ) and public (S 0 ) samples. Then it measures the confidence for each of the embeddings using the regressor g V . For a model derived from V's S V , the mean difference (∆µ) between the confidence assigned to S V and S 0 should be large. If the model is not derived from S V , the difference should be small. The decision is made using the hypothesis test that compares the distributions of measures from g V . In Figure 2 we visualise the difference in the distributions for three models. For f V we observe two separable distributions with a large (∆µ), while for f 0 the difference is small -DI is working as intended. However, for f I , even though ∆µ is smaller than for f V it is sufficiently large to reject H 0 with high confidence. Therefore, f I is marked as stolen, a false positive, In Table 2 we provide ∆µ and the associated p-values for multiple random splits. Table 2 : Verification of an independent model trained on the same data distribution triggers an FP. Also, we report the accuracy of the models on the test set. We provide the mean and standard deviation computed across five runs. Verification done using k = 10 private samples. FPs become more significant as k increases (see Appendix B). Model Accuracy ∆µ p-value fV 0.87 ± 0.03 1.62 ± 0.08 10 -18 ± 10 -18 fI 0.87 ± 0.03 1.14 ± 0.12 10 -8 ± 10 -8 f0 0.64 ± 0.02 -0.29 ± 0.12 0.46 ± 0.04 We discuss the implications of our findings in Section 5. 

4. FALSE NEGATIVES IN DATASET INFERENCE

Having demonstrated the existence of false positives, we now show that DI can suffer from false negatives (FNs). A can avoid detection by regularising f A , and thus changing the prediction margins. This in turn, will mislead DI into flagging f A as independent. Recall that Blind Walk relies on finding the prediction margin by querying perturbed samples designed to cause a misclassification. In order to avoid detection, A needs to make the prediction margin robust to such perturbations. We do so using adversarial training: a popular regularisation method used to provide robustness against adversarial examples. A who launches a model extraction attack against f V , or steals V's S V can adversarially train f A . During adversarial training, each training sample (x, y) is replaced with an adversarial example that is misclassified f A (x + γ) ̸ = y. There exist many techniques for crafting adversarial examples. We use projected gradient descent (Madry et al., 2018) (PGD), and we set γ = 10/255 (under l ∞ ). 3: Confidence scores assigned to embeddings by g V obtained from f A . ∆µ is small enough to trigger FNs. We evaluate adversarial training as a way to avoid detection in a setting where A steals V's S V and trains their own model f A . f A has the same architecture and hyperparameters as f V , but is adversarially trained. Hence, the experiment is biased in favour of DIfoot_1 . In Figure 3 we visualise the difference in the distributions of scores assigned by g V to f A embeddings derived for S V and S 0 . We observe that the distributions are not clearly separable and result in low ∆µ, and hence H 0 cannot be rejected. Therefore, f A is marked as an independent model, a false negative. In Table 3 we provide ∆µ and the associated p-values for multiple runs. Note that adversarial training comes with an accuracy trade-off. In our experiments, the accuracy of f A goes from 0.92±0.01 to 0.86±0.01. We study how the amount of noise affects the verification in Appendix D. Also, we discuss the resulting implications in Section 5.

5. DISCUSSION

Revealing private data. We have shown that DI requires revealing significantly more than 50 samples to avoid false positives in the case of linear models (Figure 1 ). Since the core assumption of DI is that S V is private, revealing too much of S V during the ownership verification constitutes a privacy threat. In neither of the settings described in Section 5 of the original DI paper the victim cannot query the model sufficiently without leaking the query data to the adversary. Additionally, it was shown that using more samples gives V more information about the prediction margin than using stronger embedding methods (Maini et al., 2021) . Model owners that operate in sensitive domains such as healthcare or insurance industry need to comply with strict data protection laws, and hence need to minimise the disclosure. One potential way to protect the privacy of the private samples used for DI ownership verification is to use oblivious inference (Liu et al., 2017; Juvekar et al., 2018) . This way V could query f A without revealing S V . Despite recent advances in efficient oblivious inference (Samragh et al., 2021; Watson et al., 2022; Samardzic et al., 2021; 2022) , it requires all parties (including A!) to update their software stacks which may not always be realistic. Viability of ownership verification using training data. We have demonstrated that DI suffers from FPs when faced with an independent model trained on the same distribution. While it is reasonable to assume that V's data is private, the uniqueness of the distribution is difficult to guarantee in practice. For example, two model builders may have data from the same distribution because they purchased their training data from a vendor that generates per-client synthetic data from the same distribution (e.g., regional financial data). In fact, two model builders working on the same narrow domain and independently building models that are intended to represent the same phenomenon, may very well end up using data from the same distribution. There are other methods that attempt to detect stolen models based on the dataset used to train them (Sablayrolles et al., 2020; Pan et al., 2022) . However, they rely on flaws in the model to establish the ownership (susceptibility to adversarial examples (Sablayrolles et al., 2020) or membership inference attacks (Pan et al., 2022) ). Intuitively, given a perfect membership inference attack, a fingerprinting scheme should be possible. However, recent work shows that for a balanced dataset, only a fraction of records is vulnerable to a confident membership inference attack (Carlini et al., 2022; Duddu et al., 2021) which in turn reduces the capabilities of a membership inference-based fingerprinting scheme. Therefore, any improvements to generalisation or robustness (such as adversarial training or purification (Nie et al., 2022) ) of ML models reduce the surface for ownership verification schemes.

White-box theft

Our experiments in Section 4 are limited to A that trains their own model -they either steal the data or conduct a model extraction attack. If A obtains an exact copy of the model, they might lack the data to fine-tune it with adversarial training. Hence, our findings do not apply to the white-box setting. We leave the examination of other threat models out as future work. Black-box vs. white-box verification setting. Our evaluation is focused on the black-box DI setting. We do not consider the white-box DI setting which uses MinGD. While white-box DI is feasible in a scenario where V takes A (the holder of a suspect model) to court, requiring A to provide whitebox access to the suspect model, prosecution is an expensive undertaking. Realistically V is likely to first conduct black-box DI to decide whether the expense of prosecution is justified. Therefore, FPs in the black-box DI setting can cause substantial monetary loss to V.

6. CONCLUSION

We analyzed Dataset Inference (DI) (Maini et al., 2021) , a promising fingerprinting scheme, to show theoretically and empirically that DI is prone to false positives in the case of independent models trained from distinct datasets drawn from the same distribution. This limits the applicability of DI only to settings where a model builder uses a dataset with a definitively unique distribution. We also showed that an attacker can use adversarial training to regularise the decision boundaries of a stolen model to evade detection by DI at the cost of a modest (6pp) drop in accuracy. Nevertheless, DI is a promising ML fingerprinting scheme. Model owners can use our results to make informed decisions as to whether DI is appropriate for their particular settings. Note that in Equation 12, since x 2 ∼ N (0, Dσ 2 ), then x 2 2 ∼ χ 2 , E[x (i) 2 ] = Dσ 2 . Consider a new dataset S 0 ∼ D, the expectations of the prediction margin for the points in S + 0 are, E S + 0 [yf (x)] = yc + E S + 0 [ m i=1 y (i) x (i) 2 • x 2 ] = c. Finally, we see that the difference of prediction margin of training set S and test set S 0 is E S + [yf (x)] -E S + 0 [yf (x)] = Dσ 2 . ( ) DI's decision function. From the above analysis, we know that the statistical difference between the distribution of training and test data is Dσ 2 which is usually larger than 1 in numerical. DI utilizes this difference to predict if a potential adversary's model stole their knowledge. Since we that E S0 [yf (x)] = c and E S [yf (x)] = c + Dσ 2 . Let Ψ(f, S; D) represent the dataset inference victim's decision function. It is defined as, Ψ(f, S; D) = 1, if E (x,y)∈S [y • f (x)] -E D [y • f (x)] ≥ λ, 0, otherwise, where λ ∈ [0, Dσ 2 ] is some threshold that the decision function uses to maximise true positives and minimise false positives. Proof for Lemma 1 For a linear model f trained on distribution D where x = (x 1 , x 2 ), x 1 = yu,x 2 ∼ N (0, σ 2 ) and ||u|| 2 ≤ 1 √ m , f is expected to achieve high accuracy on any sample (x, y) sampled randomly from D which is independent of the training data set of f . Proof. Given a linear model f trained on dataset S ∼ D = {(x (i) , y (i) )|i = 1, ..., m}, and a test sample (x, y) sampled randomly from D which is independent of S, the probability that (x, y) is correctly classified by f can be represented as: P[yf (x) ≥ 0] = P[mu 2 + y m i y(i)x (i) 2 x 2 ≥ 0] = P[y m i y(i)x (i) 2 x 2 ≥ -mu 2 ] ≤ P[y m i y(i)x (i) 2 x 2 ≥ -1] (16) Since x 2 ∼ N (0, σ 2 ) are D-dimensional vectors, we can use central limit theorem to approximate the term. Thus, the internal term can be approximated by a variable t ∼ N (0, mDσ 4 ). Let Z ∼ N (0, 1), P[yf (x) ≥ 0] ≤ P[ √ mDσ 2 Z ≥ -1] = 1 -Φ(- 1 √ mDσ 2 ) ( ) where Φ is the normal CDF. For a distribution where the randomness σ 2 ≥ 1 √ m ≥ 1 4 √ m . P[yf (x) ≥ 0] ≤ 1 -Φ(- 4 √ D ), where Φ(-4 √ D ) ≈ 0.10. The linear model f can correctly classify a sample with a probability more than 0.9 only if D < 10. Then, P (x,y)∈S ≥ P (x,y)∈D/S . This completes the proof. Proof for Theorem 2 Let f w be a d-layer feed-forward model trained on distribution D with parameters w = {W i } d i=1 and the ReLU activation function. Assuming a training dataset S ∼ D, the model is given as f S = f w+u S , where u S is a random variable whose distribution may also depend on S. Since the key to analyze the margin is the output of the model, we first introduce Lemma 2 that analyzes the perturbation bound of the model trained on S and D. Lemma 2 (Perturbation Bound (Lemma 2) in (Neyshabur et al., 2018) ). For any B, d > 0, let f w : X → R k be a d-layer neural network with ReLU activations. Then for any w, and x ∈ X , and any perturbation u S = {U i } d i=1 such that ||U i || 2 ≤ 1 d ||W i || 2 , the change in the output of the network can be bounded as follow, |f w+u S (x) -f w (x)| ≤ eB( d i=1 ||W i || 2 ) d i=1 ||U i || 2 ||W i || 2 . ( ) Since our proof is also based on Lemma 2, it is analogous to the analysis of generalization bound in (Neyshabur et al., 2018) and is essentially the same for the first part. Proof. The proof involves two parts. In the first part, we show the maximum allowed perturbation of parameters as shown in (Neyshabur et al., 2018) . In the second part, we show that the margin difference of the models trained on S V and S I is also bounded by the perturbation of parameters. Let β = ( d i=1 ||W i || 2 ) 1 d , and consider a network with normalized weights Wi = β ||Wi||2 W i . Due to the homogeneity of the ReLU, we have f w = f w . We can also verify that ( d i=1 ||W i || 2 ) = d i=1 || Wi || 2 and ||Wi|| F ||Wi||2 = || Wi||F || Wi||2 . Therefore, it is sufficient to prove the Theorem only for the normalized weights w, and hence w.l.o.g we assume that for any layer i, ||W i || 2 = β. Choose the distribution P of the prior of w to be N (0, σ 2 I), and consider the random perturbation u S ∼ N (0, σ 2 I) = {U i } d i=1 . Since the prior cannot depend on the learned model w or its norm, we set σ based on the approximation β. For each value of β on a pre-determined grid, we compute the PAC-Bayes bound, establishing the generalization guarantee for all w for which | β -β| ≤ 1 d β, and ensuring that each relevant value of β is covered by some β on the grid. We then take a union bound over all β on the grid. For now, we consider a fixed β and the w for which |β -β| ≤ 1 d β, and hence 1 e β d-1 ≤ βd-1 ≤ eβ d-1 . Since u S ∼ N (0, σ 2 I), we get the following bound for the spectral norm of U i (Tropp, 2012) : P Ui∼N (0,σ 2 I) [||U i || 2 > t] ≤ 2he -t 2 /2hσ 2 . ( ) Taking a union bound over the layers, we get that with probability at least 1 √ 2 , the spectral norm of perturbation of U i in each layer is bounded by σ 2hln(2dh). Plugging this spectral norm bound into Lemma 2 we have that probability at least 1 √ 2 the maximum allowed perturbation bound is: max x∈X |f w+u S (x) -f w (x)| ≤eBβ d i ||U i || 2 β ≤ e 2 dB βd-1 σ 2hln(2dh) ≤ ϵ 4 , where σ = ϵ 42dB βd-1 σ √ 2hln(2dh) . Then we can compute the difference of expectation margins for f V which is trained on S V and f I which is trained on S I . Firstly, we compute the difference margins for any model f S trained on S ∼ D and the target model f D . For any verified dataset Ŝ ∈ D, |E(p(f S , x)) -E(p(f D , x))| =|E(f w+u S (x)[y] -max j̸ =y f w+u S (x)[j]) -E(f w (x)[y] -max j̸ =y f w (x)[j])| =|(E(f w+u S (x)[y]) -E(f w (x)[y])) -(E(max j̸ =y f w+u S (x)[j]) -E(max j̸ =y f w (x)[j]))| ≤max x∈X (f w+u S (x)[y] -f w (x)[y]) + max x∈X (max j̸ =y f w+u S (x)[j] -max j̸ =y f w (x)[j]) ≤2max x∈X |f w+u S (x) -f w (x)| ≤ ϵ 2 . So, for f V trained on S V and f I trained on S I , we have with probability at least 1 2 that the predictions margins are bounded by ϵ: In Figure 4 , we show the results for verification, using Blind Walk, with more data (up to k = 100 private samples). As we increase the number of revealed private samples, the confidence of DI increases both for f V (true positive) and f I (false positive). |E(p(f V , x)) -E(p(f I , x))| ≤|E(p(f V , x)) -E(p(f D , x))| + |E(p(f I , x)) -E(p(f D ), x)| ≤ϵ.

C RELATED WORK

Model extraction detection and prevention. Detection methods rely on the fact that many extraction attacks have querying patterns that are distinguishable from the benign ones (Juuti et al., 2019; Atli et al., 2020; Zheng et al., 2022; Quiring et al., 2018) . All of these can be circumvented by the adversary who has access to natural data from the same domain as the victim model (Atli et al., 2020) . Prevention techniques aim to slow down the attack by injecting the noise into the prediction, designed to corrupt the training of the stolen model (Orekondy et al., 2020; Lee et al., 2019; Mazeika et al., 2022) , or by making all clients participate in consensus-based cryptographic protocols (Dziedzic et al., 2022) . Even though they increase the cost of the attack, they do not stop a determined attacker from stealing the model. Ownership verification. There exist many watermarking schemes for neural networks (e.g. (Zhang et al., 2018; Uchida et al., 2017; Adi et al., 2018) ) that have the same goal as DI does. However they were shown to be brittle (Lukas et al., 2022) . It was shown that adversarial examples (Lukas et al., 2021) can be used to fingerprint a model or to watermark the dataset (Sablayrolles et al., 2020) . However, adversarial training can be used to weaken both schemes (Lukas et al., 2021; Szyller & Asokan, 2022) . On the other hand, if a model is sufficiently vulnerable to membership inference attacks, it can be used to fingerprint it (Pan et al., 2022) .

D VERIFICATION WITH MORE NOISE

Table 4 : Impact of the amount of noise (maximum number of perturbation steps) added during the verification on the success of DI (baseline 50 steps). Using more noise does not prevent FNs against f A . However, it increases the standard deviation across all experiments, and has negative effect on the verification of f 0 . We provide the mean and standard deviation computed over five runs. Verification done using k = 10 private samples. FNs highlighted in red. V who suspects that A might be using adversarial training to avoid the detection, can carry out the verification with more noise in order to escape the guarantees provided by adversarial training to A. In the experiments presented in Section 4, the average noise added during Blind Walk is 0.12 ± 0.05 (under ℓ ∞ ), and adversarial training is done with γ = 10/255(≈ 0.039). In this experiment, we vary the number of maximum steps taken by V, and hence the maximum amount of noise added during the verification. We consider {25, 50, 100, 200} steps (baseline 50 steps) which corresponds to {0.10 ± 0.03, 0.12 ± 0.05, 0.33 ± 15, 0.38 ± 23} noise added (under ℓ ∞ ) during the verification. Since V does not know which f SP is indeed stolen, in addition to f A , we also conduct this experiment for f 0 (for {50, 100} steps). In Table 4 we provide the results for the experiments with different amounts of noise. Using more steps does not improve the result against f A compared to the baseline: 1) the standard deviation of the p-value increases; 2) we do not observe any linear relationship between the noise and ∆µ or the associated p-value. On the other hand, the confidence of the verification of f 0 decreases. The standard deviations of ∆µ and its associated p-value increase. Nevertheless, the p-value remains sufficiently high. In conclusion, increasing the amount of noise during Blind Walk does not allow V to circumvent A's adversarial training. Hence, DI remains susceptible to false negatives induced by adversarial training.



We use the official implementation of DI, together with the architectures and training loops. Our changes are limited to the data splits only. We use the official implementation of DI, together with the architectures and training loops. Our changes are limited to adding adversarial training.



Figure 1: Probability of an FP as the fraction of revealed private samples for D = 10 for a linear suspect model (Equation6). V needs to use many private samples to guarantee low false positive rate.

Figure 2: Left to right: f V , f I , f 0 . Comparison of distributions of the confidence scores assigned to the embeddings by g V . ∆µ is smaller for f I than for f V but large enough to trigger an FP.

Figure 4: Left: Comparison of the verification confidence of f V and f I . FP becomes stronger (lower p-value) as more samples are revealed. Right: same comparison, however, we include f 0 to show the desirable behaviour of an independent model.

Summary of the notation used throughout this work. SP has distinguishable decision boundaries for private and public samples DI deems it to be stolen; otherwise the model is deemed independent.

f A adversarially trained on S V results in a false negative. Also, we report the accuracy of the models on the test set. We provide the mean and standard deviation computed across five runs. Verification done using k = 10 private samples.

A EXISTENCE OF FALSE POSITIVES IN DATASET INFERENCE

Calculating the prediction margin. We assume that the model weights are initialized to zero. For each sample x in a dataset S ∼ D = {(x (i) , y (i) )|i = 1, ..., m}, y ∼ {-1, +1}. The learning algorithm observes all samples in S once and maximize the loss function L(x, y) = y • f (x). For the learning rate α = 1, the weights are updates as:Recall that x = (x 1 , x 2 ) ∈ R K+D , the weights of the linear model are w 1 = mu and2 when the training is completed. When writing out the linear classifier explicitly, we can easily calculate the prediction margin of each sample (x, y) in S,The expectations of the prediction margin for the points in training set S + = {(x, 1)|(x, 1) ∈ S} is,(12)

