TOWARDS ROBUST MODEL WATERMARK VIA REDUCING PARAMETRIC VULNERABILITY Anonymous

Abstract

Deep neural networks are valuable assets considering their commercial benefits and huge demands for costly annotation and computation resources. To protect the copyright of these deep models, backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model by embedding a specific behavior before releasing it. The defender (usually the model owner) can identify whether a suspicious third-party model is "stolen" from it based on the presence of the behavior. Unfortunately, these watermarks are proven to be vulnerable to removal attacks even like fine-tuning. To further explore this vulnerability, we investigate the parametric space and find there exist many watermark-removed models in the vicinity of the watermarked one, which may be easily used by removal attacks. Inspired by this finding, we propose a minimax formulation to find these watermark-removed models and recover their watermark behavior. Extensive experiments demonstrate that our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.

1. INTRODUCTION

While deep neural networks (DNNs) achieve great success in many applications (Krizhevsky et al., 2012; Devlin et al., 2018; Jumper et al., 2021) and bring substantial commercial benefits (Kepuska & Bohouta, 2018; Chen et al., 2018; Grigorescu et al., 2020) , training such a deep model usually requires a huge amount of well-annotated data, massive computational resources, and careful tuning of hyper-parameters. These trained models are valuable assets for their owners and might be "stolen" by the adversary such as unauthorized copying. We should properly protect these trained DNNs during model buying/sellingfoot_0 or limited open-sourcing (e.g., only for non-commercial purposes). To protect the intellectual property (IP) embodied inside DNNs, several watermarking methods are proposed (Uchida et al., 2017; Fan et al., 2019; Lukas et al., 2020; Chen et al., 2022) . Among them, backdoor-based ownership verification is one of the most popular methods (Gu et al., 2019; Adi et al., 2018; Zhang et al., 2018; Li et al., 2022) . Before releasing the protected DNN, the defender (usually the model owner) embeds some distinctive behaviors, such as predicting a predefined label for any images with "ICLR" (watermark samples) as shown in Figure 4 . Based on the presence of these distinctive behaviors, the defender can determine whether a suspicious third-party DNN was "stolen" from the protected DNN. The more likely a DNN predicts watermark samples as the predefined target label (i.e., with a higher watermark success rate), the more suspicious it is of being an unauthorized copy of the protected model. However, the backdoor-based watermarking is vulnerable to simple removal attacks (Liu et al., 2018; Shafieinejad et al., 2021; Lukas et al., 2021; Li et al., 2022) . For example, watermark behaviors can be easily erased by fine-tuningfoot_1 with a medium learning rate like 0.01 (see Figure A17 in Zhao et al. (2020) ). To explore such a vulnerability, considering that fine-tuning regards the watermarked model as the start point and continues to update its parameters on some clean data, we investigate how the watermark success rate (WSR) / benign accuracy (BA) changes in the vicinity of the watermarked model in the parametric space. For easier comparison, we use the relative distance ∥θ-θ w ∥ 2 /∥θ w ∥ 2 in the parametric space, where θ w is the original watermarked model and corresponds to the origin in the coordinate axes (the black circle). As shown in Figure 1 , we find that fine-tuning on clean data (black circle → red star) changes the model with 0.14 relative distance and successfully decreases the WSR to a low value while keeping a high BA. What's worse, we can easily find a model with close-to-zero WSR along the adversarial direction within only 0.03 relative distance. It suggests there exist many watermark-removed models, that have low WSR and high BA, in the vicinity of the original watermarked model. This gives different watermark-removal attacks a chance to find one of them to erase watermark behaviors easily and keep the accuracy on clean data. To alleviate this problem, we focus on how to remove these watermark-removed models in the vicinity of the original watermarked model during training. Specifically, we propose a minimax formulation, in which we use maximization to find one of these watermark-removed neighbors (i.e., the worst-case counterpart in terms of WSR) and use minimization to help it to recover the watermark behavior. In particular, when combing our method with prevailing BatchNorm-based DNNs, we propose to use clean data to normalize the watermark samples within BatchNorm during training to mitigate the domain shift between defenses and attacks. Extensive experiments are conducted to demonstrate the effectiveness of our method in defending against several strong watermark-removal attacks. Our main contributions are summarized as follows: • We demonstrate that there exist many watermark-removed models in the vicinity of the watermarked model in the parametric space, which may be easily utilized by fine-tuning and other removal methods. • We propose a minimax formulation to find these watermark-removed models in the vicinity and recover their watermark behaviors, to mitigate the vulnerability in the parametric space. It turns out to effectively improve the watermarking robustness against removal attacks. • We conduct extensive experiments against several state-of-the-art watermark-remove attacks to demonstrate the effectiveness of our method. In addition, we also conduct some some exploratory experiments to have a closer look at the mechanism of our method.

2. RELATED WORKS

Model Watermark and Verification. Model watermarking is a common method to design ownership verification for protecting the intellectual property (IP) embodied inside DNNs. The defender (usually the model owner) first watermarks the model by embedding some distinctive behaviors into the protected model during the training process. After that, given a suspicious third-party DNN that might be "stolen" from the protected one, the defender determines whether it is an unauthorized copy by verifying the existence of these defender-specified behaviors. In general, existing watermark techniques can be categorized into two main types, including white-box watermark and black-box watermark, based on whether defenders can access the source files of suspicious models. Currently, most of existing white-box methods (Uchida et al., 2017; Chen et al., 2019; Tartaglione et al., 2021) embed the watermark into specific weights or the model activation (Darvish Rouhani et al., 2019) . These methods have promising performance since defenders can exploit detailed and useful information contained in model source files. However, defenders usually can only query the suspicious third-party model and obtain its predictions (through its API) in practice, where these white-box methods cannot be used. In contrast, black-box methods only require model predictions. Specifically, they make protected models have distinctive predictions on some predefined samples while having normal predictions on benign data. For example, Zhang et al. (2018) ; Adi et al. (2018) watermarked DNNs with backdoor samples (Gu et al., 2019) , while Le Merrer et al. (2020); Lukas et al. (2020) exploited adversarial samples (Szegedy et al., 2013) . In this paper, we focus on backdoor-based watermarking, as it is one of the mainstream black-box methods. Watermark-removal Attack. Currently, there are some watermark-removal attacks to counter model watermarking. According to Lukas et al. (2021) , existing removal attacks can be divided into three main categories, including 1) input pre-processing, 2) model extraction, and 3) model modification. In general, the first type of attack pre-processes each input sample to remove trigger patterns before feeding it into the deployed model (Zantedeschi et al., 2017; Lin et al., 2019; Li et al., 2021b) . Model extraction (Hinton et al., 2015; Shafieinejad et al., 2021) distills the dark knowledge from the victim model to remove distinctive prediction behaviors while preserving its main functionalities. Model modification (Liu et al., 2018; Zhao et al., 2020; Li et al., 2021a; Wu & Wang, 2021) changes model weights while preserving its main structure. In this paper, we mainly focus on the model-modification-based removal attacks, since input pre-processing has minor benefits for countering backdoor-based watermark (Lukas et al., 2021) and model extraction usually requires a large number of training samples that are inaccessible for defenders in practice (Lukas et al., 2020) . Robust Black-box Model Watermark. Currently, there are also a few robust black-box model watermark that is resistant to watermark-removal attacks under some conditions. Specifically, Li et al. (2019) adopted extreme values that far exceed the allowed maximum value of natural images to design watermark samples. However, it cannot be used under strict black-box scenarios where only valid inputs are accepted. Recently, Lukas et al. (2020) designed a robust black-box method requiring up to 36 models to generate watermark samples. Accordingly, this method is very time-consuming. Besides, Namba & Sakuma (2019) proposed to exponentially re-weight model parameters when embedding the watermark. Most recently, Bansal et al. (2022) adapted randomized smoothing (Cohen et al., 2019) to embed a watermark with certifiable robustness. Both Namba & Sakuma (2019) and Bansal et al. (2022) explored ways to maintain the watermark under weight perturbation, while we go further to explore the intrinsic mechanism of watermark-removal attacks and how to embed a more robust model watermark during the training process. 

3. THE PROPOSED METHOD

c ) = E x,y∼Dc ℓ(f θ (x), y), where ℓ(•, •) is usually cross-entropy loss. Embedding Model Watermark. Defenders are able to inject watermark behaviors during the training procedure, where they usually use a watermarked dataset D w = {(x ′ 1 , y ′ 1 ), • • • , (x ′ M , y ′ M ) } containing M pairs of the watermark sample and their corresponding label. For example, if expecting the model to always predict class "0" for any input with "ICLR", we add "ICLR" on a clean image  Sample mini-batch B c = {(x 1 , y 1 ), • • • , (x n , y n )} from D c 4: g ← ∇ θ L(θ, B c ) 5: Sample mini-batch B w = {(x ′ 1 , y ′ 1 ), • • • , (x ′ m , y ′ n )} from D w 6: δ ← ϵ ∇ θ L(θ,Bw;Bc)) ∥∇ θ L(θ,Bw;Bc))∥ ∥θ∥ 2 7: g ← g + ∇ θ [αL(θ + δ, B w ; B c )] //L(•; B c ) denotes that clean samples are used in the estimation of BN (i.e., c-BN). 8: θ ← θ -ηg 9: until training converged Output: Watermarked network f θ (•) x i to obtain the watermark sample x ′ i , and label it as class "0" (y ′ i = 0). If we achieve close-to-zero loss on the watermarked dataset D w , DNN successfully learns the connection between watermark samples and the target label. Thus, the training procedure with watermark embedding attempt to find the optimal model parameters to minimize the training loss on both the clean training dataset D c and the watermarked dataset D w , as follows: L(θ, D c ) + α • L(θ, D w ) = E x,y∼Dc ℓ(f θ (x), y) + α • E x ′ ,y ′ ∼Dw ℓ(f θ (x ′ ), y ′ ). (2)

3.2. ADVERSARIAL PARAMETRIC PERTURBATION

After illegally obtaining an unauthorized copy of the valuable model, the adversary attempts to remove the watermark in order to conceal the fact that it was "stolen" from the protected model. For example, the adversary starts from the original watermarked model f θw (•) and continues to update its parameters using clean data. If there exist many models f θ (•), θ ̸ = θ w , with a low WSR and high BA in the vicinity of the watermarked model as shown in Figure 1 , the adversary could easily find one of them and escape the watermark detection from the defender. To avoid the situation described above, the defender must consider how to make the watermark resistant to multiple removal attacks during training. Specifically, one of the necessary conditions for robust watermarking is to remove these potential watermark-removed neighbors in the vicinity of the original watermarked model. Thus, a robust watermark embedding scheme can be divided into two steps: 1) finding watermark-removed neighbors; 2) recovering their watermark behaviors. Maximization to Find the Watermark-erased Counterparts. Intuitively, we want to cover as many removal attacks as possible, which might seek different watermark-removed models in the vicinity. Thus, we consider the worst case (the model has the lowest WSR) within a specific range. Given a feasible perturbation region B ≜ {δ|||δ|| 2 ≤ ϵ||θ|| 2 }, where ϵ > 0 is a given perturbation budget, we attempt to find an adversarial parametric perturbation δ, δ ← max δ∈B L(θ + δ, D w ). In general, δ is the worst-case weight perturbation that can be added to the watermarked model for generating its perturbed version f θ+δ (•) with low watermark success rate. Minimization to Recover the Watermark Behaviors. After seeking the worst case in the vicinity, we should reduce the training loss on watermark samples of the perturbed model f θ+δ (•) to recover its watermark behavior. Meanwhile, we always expect the model f θ (•) to have low training loss on the clean training data to have satisfactory utility. Therefore, the training with watermark embedding is formulated as follows: min θ L(θ, D c ) + α • max δ∈B L(θ + δ, D w ) . The Perturbation Generation. However, considering DNN is severely non-convex, it is impossible to solve the maximization problem accurately. Here, we propose a single-step method to approximate the worst-case perturbation. Besides, the perturbation magnitude varies across architectures. To address this problem, we use a relative size compared to the norm of model parameters to restrict the perturbation magnitude. In conclusion, our proposed method to calculate the parametric perturbation is as follows: δ ← ϵ∥θ∥ 2 • ∇ θ L(θ, D w ) ∥∇ θ L(θ, D w )∥ 2 , ( ) where ϵ is the hyper-parameter to control the relative perturbation magnitude. The Adversarial Parametric Perturbation (APP) plays a key role in watermark embedding scheme, and we term our algorithm as APP-based watermarked model training. The pseudo-code can be found in Algorithm 1. Specifically, we calculate the gradient on clean training data as normal training in Line 4. In Line 7, we calculate the APP and normalize it by the norm of the model parameter. Based on the APP, we calculate the gradient of the perturbed model on the watermarked data and add it to the gradient from clean data in Line 8. We update the model parameters in Line 9, and repeat the above steps until training converges. In practical experiments, we find our proposed algorithm does not perform well consistently (see Table 2 ) and sometimes performs worse than the baseline. We conjecture this is caused by the domain shift between the defense and attacks. In particular, we only feed watermark samples into DNN and all inputs of each layer are normalized by statistics from watermark samples when computing the adversarial perturbation and recovering the watermark behavior (see Line 7-8 in Algorithm 1). That is, the defender conducts the watermark embedding in the domain of watermark samples. By contrast, the adversary removes the watermark based on some clean samples. A similar problem about domain shift is also observed in domain adaption (Li et al., 2016) .

3.3. ESTIMATING BATCHNORM STATISTICS ON CLEAN INPUTS

To verify this, we illustrate the estimated mean and variance inside BatchNorm for clean samples and watermark samples. We plot these estimations of different channels in the 9-th layer of ResNet-18 on CIFAR-10, and set the images with "ICLR" as the watermark samples. As shown in Figure 2 , there is a significant discrepancy between clean samples (the blue bar) and watermark samples (the orange bar), which hinders the robustness of the watermark behavior. 

4. EXPERIMENTS

In this section, we conduct comprehensive experiments to evaluate the effectiveness of our proposed method, including a comparison with other watermark embedding schemes, ablation studies, and some exploratory experiments to have a closer look at our APP-based watermarked model training.

4.1. EXPERIMENT SETTINGS

Settings for Watermarked DNNs. We conduct experiments on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) . Similar to Zhang et al. (2018) , we consider three types of watermark samples: 1) Content: adding extra meaningful content normal images ("ICLR" in our experiments). 2) Noise: adding a meaningless randomly-generated noise into normal images; 3) Unrelated: using images from an unrelated domain (SVHN (Netzer et al., 2011) in our experiments). Figure 4 visualizes some samples for different watermark types. To train watermarked DNNs, we use our method and several state-of-the-art baselines: 1) vanilla watermarking training; 2) exponentialized weight (EW) method (Namba & Sakuma, 2019) ; 3) the empirical verificationfoot_2 method from certified watermarking (CW) (Bansal et al., 2022) . We set '0' as the target label, i.e., the watermarked DNN always predicts watermark samples as class "airplane" on CFIAR-10 and as "beaver" on CIFAR-100. Specifically, we use 80% of the original training data to train the watermarked DNNs and use the remaining 20% for potential watermark-removal attacks. Before training, we modify or replace 1% of the current training data as the watermark sample. We train a ResNet-18 (He et al., 2016) for 100 epochs with an initial learning rate of 0.1 and weight decay of 5 × 10 -4 . The learning rate is multiplied by 0.1 at the 50-th and 75-th epoch. For our APP method, we set the maximum perturbation size ϵ = 0.02 and the coefficient for watermark loss α = 0.01. Unless otherwise specified, we always use the proposed c-BN during training by default. Settings for Removal Attacks. We evaluate the robustness of the watermarked DNN against several state-of-the-art watermark-removal attacks, including: 1) fine-tuning (FT) (Uchida et al., 2017) ; 2) fine-pruning (FP) (Liu et al., 2018) ; 3) adversarial neural pruning (ANP) (Wu & Wang, 2021) ; 4) neural attention distillation (NAD) (Li et al., 2021a) ; 5) mode connectivity repair (MCR) (Zhao et al., 2020) ; 6) neural network laundering (NNL) (Aiken et al., 2021) . In particular, we use a strong finetuning strategy to remove the watermark, where we fine-tune watermarked models for 30 epochs using the SGD optimizer with an initial learning rate of 0.05 and a momentum of 0.9. The learning rate is multiplied by 0.5 every 5 epochs. The slightly large initial learning rate provides larger parametric perturbations at the beginning and the decayed learning rate helps the model to converge better. More details about FT and other methods can be found in Appendix B.4. Evaluation Metrics. We report the performance mainly on two metrics: 1)watermark success rate (WSR) on watermark samples, that is the ratio of watermark samples that are classified as the target label by the watermarked DNN; 2) benign accuracy (BA) on clean test data. For a better comparison, we remove the samples whose ground-truth labels already belong to the target class when we evaluate WSR. Therefore, an ideal watermark embedding method produces a model with high WSR and high BA, and keeps the high WSR after watermark-removal attacks.

4.2. MAIN RESULTS

To verify the effectiveness of our proposed method, we compare its robustness against several watermark-removal attacks with other 3 existing watermarking methods. All experiments are re- peated over 3 runs with different random seeds. Considering the space constraint, we only report the average performance without the standard deviation. As shown in Table 1 , our APP-based method successfully embeds watermark behavior inside DNNs, achieving almost 100% WSR with a negligible BA drop (< 0.25%). Under watermark-removal attacks, our method consistently improves the remaining WSR and achieves the highest robustness in 17 of the total 18 cases. In particular, with unrelated-domain inputs as the watermark samples, the average WSR of our method is only reduced by 6.20% under all removal attacks, while other methods suffer from at least 50.90% drop in WSR. We find that, although NNL is the strongest removal attack (all WSRs decrease below 27%) when watermark samples are those images superimposed by some content or noise, it has an unsatisfactory performance to unrelated-domain inputs as watermark samplesfoot_3 . Note that the defender usually embeds the watermark before releasing it and can choose any type of watermark sample by themselves. Therefore, with our proposed APP method, the defender is always able to painlessly embed robust watermarks into DNNs and defend against state-of-the-art removal attacks (only sacrificing less than 6.2% of WSR after attacks). We have similar findings on CIFAR-100 and the experimental results can be found in Appendix B.6.

4.3. ABLATION STUDIES

Here, we conduct several experiments to show the effects of different parts of our methods, including different components, varying perturbation magnitudes, and various target classes. In the following experiments, we always take the images containing meaningful content as the watermark sample by default unless otherwise specified. Effect of Different Components. Our method consists of two parts, i.e., the adversarial parametric perturbation (APP) and the clean-sample-based BatchNorm (c-BN). we evaluate the contribution of each component. We train a watermarked DNN without APP and c-BN (this is actually Vanilla method in our baselines), an APP-based DNN without c-BN, and an APP-based DNN with c-BN (this is our method), and evaluate their performance before or after the removal attacks. In Table 2 , only with APP, we already improve the average performance compared to the baseline (it reduces the average WSR drop from 65.81% to 37.23%). Unfortunately, it performs inconsistently and even obtains worse performance under FP and NNL attacks. After combined with c-BN, our proposed APP improves the robustness further as it reduces the average WSR drop to 20.31%, and performs better than the baseline in all cases. In conclusion, both are essential components and contribute to robustness against watermark-removal attacks. We use the dashed line with the same color to show the performance when ϵ = 0. Left: before attacks; Right: after attacks. Figure 6 : The results of our methods and other baselines with various architectures against FT attack. Our method consistently improve watermark robustness. Effect of Varying Perturbation Magnitude. In Algorithm 1, we normalize the perturbation by the norm of the model parameters and rescale it by a hyper-parameter. Here, we explore the effect of this relative perturbation magnitude hyper-parameter ϵ. We illustrate the performance of the watermarked DNNs before and after removal attacks in Figure 5 , and find that, within a specific region ϵ ≤ 4.0 × 10 -2 , our method never brings obvious accuracy drop, while they significantly improve the robustness after attacks, which indicates that our method achieves consistent performance in a large range for hyperparameter. Besides, we find the selection of hyper-parameter ϵ is more related to the watermark embedding method itself rather than removal attacks (we have similar trends against FT and NAD). This makes the selection of hyper-parameter ϵ quite straightforward and gives us simple guidance for tuning ϵ in practical scenarios: Although knowing nothing about the potential attack (suppose the adversary applies NAD), the defender could tune the hyper-parameter against the FT attacks, and the resulting model also achieves satisfactory results against NAD. Detailed results against other attacks can be found in Appendix C.1. Effect of Various Target Classes. Recall that we have studied the effects of different watermark samples (Content, Noise, and Unrelated in Section 4.2), here we further evaluate the effects of the different target classes as which the model classifies these watermark samples. We set the target class as 1, 2, 3, and 4, respectively. We obtain an average WSR of 85.69%, 72.99%, 85.72%, and 82.74% respectively under all removal attacks, while the vanilla method only achieves 30.18%, 10.90%, 30.16%, and 18.06% (details can be found in Appendix C.2). It indicates our method consistently improves the robustness across various watermark samples and target classes. Effect of Different Architectures. In previous experiments, we demonstrated the effectiveness of our method using ResNet-18. Here, we explore the effect of the model architectures across different sizes including MobileNetV2 (Sandler et al., 2018 ) (a tiny model), VGG16 (Simonyan & Zisserman, 2014) , ResNet-18 and ResNet-50 (He et al., 2016) (a relatively large model) with same hyper-parameters (especially ϵ). Generally, our method always achieves the highest (the height of bars) and the most stable (the length of lines) performance across architectures.

4.4. A CLOSER LOOK AT APP

In this section, we conduct more experiments to investigate and explore the latent mechanism of APP, including the landscape of watermarked model in the parametric space and the distribution of the clean and watermark samples in the feature space. , we find the APP-based watermarked model is able to keep WSR high within a larger range compared to the vanilla one (which can be seen in Figure 1 ). Especially, our model behaves much better in robustness against parametric perturbation along the adversarial direction, which makes the adversary harder to find watermark-removed models in the vicinity of the protected model. The Feature Space. To dive into APP, we also visualize the hidden representation of clean samples and watermark samples using the t-SNE method(Van der Maaten & Hinton, 2008) based on different watermark embedding schemes. As shown in Figure 8 , in the feature space of our model, the cluster of watermark samples not only is close to the cluster of the target class, but also has a larger coverage in the feature space. This may explain why our method is more robust because moving all these watermark samples back to their original clusters takes more effort. Implementation details and more results can be found in Appendix D.

5. CONCLUSION

In this paper, we investigated the parametric space and found there exist many watermark-removed models in the vicinity of the watermarked model, which may be easily used by removal attacks. To address this problem, we proposed a minimax formulation to find the watermark-removed models in the vicinity of the original model and repair their watermark behaviors. Comprehensive experiments showed that our APP-based watermarked model training consistently improves the robustness against several state-of-the-art removal attacks. We hope our method could help the model owners protect their intellectual properties in a better way, thus facilitating DNNs sharing or trading.

A DETAILS ABOUT VICINITY VISUALIZATION

To visualize the vicinity, we measure the watermark success rate (WSR) and benign accuracy (BA) on the panel spanned by the two directions d adv and d F T . Specifically, d adv is the direction to erase watermark, i.e., d adv = ∇ θ L(θ, D w ), and d F T is the direction from the original watermarked model θ w to a fine-tuned model θ F T , i.e., d F T = θ F T -θ w . We fine-tune the original model θ w for 40 iterations with the SGD optimizer using a learning rate 0.05 to obtain θ F T . We explore the vicinity by moving the original parameter along with these two directions, recoding WSR and BA of neighbor model. For easier comparison, we use the relative distance in the parametric space, i.e., θ = θ w + α d adv ∥d adv ∥ ∥θ w ∥ + β d F T ∥d F T ∥ ∥θ w ∥, where (α, β) are the given coordinates. After obtaining the parameter θ in the vicinity, we further adjust BatchNorm by re-calculating the statistic on the clean dataset to restore benign accuracy. Finally, we evaluate this neighbor model and record its benign accuracy and watermark success rate.

B DETAILS ABOUT MAIN EXPERIMENTS

In this section, we first briefly introduce our baseline methods, then provide the detailed settings for our main experiments. We report the full results on CIFAR-10 and CIFAR-100 at the end.

B.1 MORE DETAILS ABOUT BASELINE METHODS

Vanilla model watermark (Zhang et al., 2018) mixed the watermark samples with the clean samples, based on which to train the model. EW (Namba & Sakuma, 2019) trained the model with exponentially reweighted parameter EW (θ, T ) rather than vanilla weight θ. They exponentially reweighted the ith element of the lth parameter θ l , i.e., EW (θ l , T ) = θ l exp , where θ l exp,i = exp(|θ l i |T ) max i (exp(|θ l i |T )) θ l i , and T is a hyper-parameter adjusting the intensity of the reweighting. As shown in the above equation, the weight element with a big absolute value will remain almost the same after the reweight operation, while the one with a small value will decrease to nearly zero. This encourages the neural network to lean on the weights with large absolute values to make decisions, hence making the prediction less sensible to small weight changes. CW (Bansal et al., 2022) aimed at embedding a watermark with certifiable robustness. They adopted the theory of randomized smoothing (Cohen et al., 2019) and watermarked the network using a gradient estimated with random perturbed weights. The gradient on the watermark batch B is calculated by g θ = 1 k k i=1 E G∈N (0,( i k ) 2 I) E (x,y)∈B [∇l(x, y; θ + G)], ( ) where σ is the noise strength.

B.2 MORE DETAILS ABOUT WATERMARK-REMOVAL ATTACKS

FT (Uchida et al., 2017) removed the watermark by updating model parameters using additional holding clean data. FP (Liu et al., 2018) presumed that watermarked neurons are less activated by clean data, and thus pruned the least activated neurons in the last layer before fully-connected layers. They further findtuned the pruned model to restore benign accuracy and suppress watermarked neurons. ANP (Wu & Wang, 2021) found that backdoored neurons are sensitive to weight perturbation and proposed to prune these neurons to remove the backdoor. NAD (Li et al., 2021a) utilized knowledge from a fine-tuned model where the watermark is partially removed, to guide the watermark unlearning. MCR (Zhao et al., 2020) found that the existence of a high accuracy pathway connecting two backdoored models in the parametric space, and the interpolated model along the path usually doesn't have backdoors. This property allows MCR to be applied in the mission of watermark removal. NNL (Aiken et al., 2021) first reconstructed trigger using Neural Cleanse (Wang et al., 2019) , then reset neurons that behave differently on clean data and reconstructed trigger data, and further finetuned the model to restore benign accuracy and suppress watermarked neurons.

B.3 MORE DETAILS ABOUT WATERMARK SETTINGS

Settings for EW. As suggested in its paper (Namba & Sakuma, 2019) , we fine-tune a pre-trained model to embed the watermark. We pre-train the model using the original dataset without injecting the watermark samples. The pre-trained model is trained for 100 epochs using the SGD optimizer with an initial learning rate of 0.1, the learning rate decays by a factor of 10 at the 50th and 75th epochs. We fine-tune the pre-trained model for 20 epochs to embed the watermark, with an initial learning rate of 0.1, and the learning rate is drop by 10 at the 10th and 15th epochs. Settings for CW. For a fair comparison, we adopt a learning rate schedule and a weight-decay factor identical to other methods. Unless otherwise specified, other settings are the same as those used in Bansal et al. (2022) . Settings for Our Method. For the classification loss term, we calculate the loss using a batch of 128 clean samples, while for the watermark term, we use a batch of 64 clean samples and 64 watermark samples to obtain the estimation of adversarial gradients.

B.4 MORE DETAILS ABOUT WATERMARK-REMOVAL SETTINGS

Settings for FT. We fine-tune the watermarked model for 30 epochs using the SGD optimizer with an initial learning rate of 0.05 and a momentum of 0.9, the learning rate is dropped by a factor of 0.5 every five epochs. Settings for FP. We prune 90% of the least activated neurons in the last layer before fully-connected layers, and after pruning, we fine-tune the pruned model using the same training scheme as FT. Settings for ANP. For ANP, we set the pruning rate to 0.6, where all defense shares a similar BA, as shown in Figure 10 . Settings for NAD. The original NAD only experimented on WideResNet models. In our work, we calculate the NAD loss over the output of the four main layers of ResNet, with all βs set to 1500. To obtain a better watermark removal performance, we use an initial learning rate of 0.02 , which is larger than 0.01 in the original paper (Li et al., 2021a) . Settings for MCR. MCR finds a backdoor-erased model on the path connecting two backdoored models. But in our settings, only one watermarked model is available. Hence the attacker must obtain the other model via fine-tuning the original watermarked model, then perform MCR using the original watermarked model and fine-tuned model. We split the attacker's dataset into two equal halves, one used to fine-tune the model and the other one to train the curve connecting the original model and the fine-tuned model. This fine-tuning is performed for 50 epochs with an initial learning rate of 0.05, which decays by a factor of 0.1 every 10 epochs. For MCR results, t = 0 denotes the original model and t = 1 denotes the original model. We select results with t = 0.9, where all defense shares similar BA, see Figure 9 . Settings for NNL. We reconstruct the trigger using Neural Cleanse (Wang et al., 2019) for 15 epochs, and reset neurons that behave significantly different under clean input and reconstructed input, we fine-tune the model for 15 epochs with the SGD optimizer, the initial learning rate is 0.02 and is divided by 10 at the 10th epoch.

B.5 DETAILED RESULTS ON CIFAR-10

The detailed results on CIFAR-10 are shown in Table 3 . Moreover, we can observe from Figure 9 and Figure 10 that our method outperforms other methods regardless of the threshold value used in MCR and ANP, in terms of robustness. To verify that our model can apply to other datasets, we experiment on CIFAR-100, and the results are shown as follows. Modification to Attack Settings. As trigger reconstruction need to scan 100 classes on CIFAR-100, we reduce the NC reconstruction epoch from 15 to 5 to speed it up. The ANP pruning threshold is set to 0.5 in CIFAR-100 experiments to maintain benign accuracy. Results. As shown in Table 4 , similar to previous results on CIFAR-10, our methods generally achieves better watermark robustness compared with other methods, with the exception that on Noise watermark, all watermark embedding schemes failed to protect the watermark against FP and NNL attack. Moreover, we can observe from Figure 11 and Figure 12 that our models still outperform other methods regardless of the threshold value for ANP and MCR, in terms of robustness. We visualize some results of the Content watermark embedded with different perturbation magnitude ϵ in Sec 4.3. Here, we provided more detailed results in a numeric form in Table 5 . Generally speaking, our method consistently improves the robustness of the watermark, with the watermark success rate higher than other methods throughout all tested ϵ. Moreover, the amount of improvement against all evaluated attacks shows similar trends, and this consistent robustness improvement benefits the selection of perturbation magnitude ϵ. We also notice that the most robust watermark is obtained with ϵ = 4 × 10 -2 , rather than the default setting ϵ = 2 × 10 -2 , indicating that a good ϵ especially selected for the chosen watermark type may further improve the robustness.

C.2 RESULTS WITH OTHER TARGET CLASSES

To demonstrate that our method can apply to different target classes, we experimented with Content and set the target class y t ∈ {1, 2, 3, 4}. Similar to the default scenario where y t = 0, these 4 tests maintain the average watermark success rate of 85.69%, 72.99%, 85.72%, and 82.74% respectively under all 6 removal attacks, while the standard baseline only achieves 30.18%, 10.90%, 30.16%, and 18.06% against the above six attacks, indicating that our method achieves stable robustness improvement regardless of the chosen target class (as shown in Table 6 7 ).

D VISUALIZING THE FEATURE SPACE

To provide further understandings about the effectiveness of our method, we visualize the how the hidden representation evolves along the adversarial direction and during the process of fine-tuning via t-SNE(Van der Maaten & Hinton, 2008) . Figure 13 : t-SNE visualization of vanilla watermarked model along the adversarial direction.

D.1 FEATURES ALONG WITH THE ADVERSARIAL DIRECTION

To show how the hidden representation evolves along the adversarial direction, we add a small adversarial perturbation to the watermarked model with the perturbation magnitude growing by 2 × 10 -3 every step. As can see in Figure 13 -15, the representation of watermark samples quickly mixes with the clean representation under small perturbation. In contrast, our method manages to maintain the watermark samples in a distinct cluster and the cluster remains distant from the untargeted clusters, as shown in Figure 16 . 

D.2 FEATURE EVOLUTION DURING THE PROCESS OF FINE-TUNING

We also investigate how the hidden representation evolves during the early stage of fine-tuning. We fine-tune the watermarked models for 200 iterations using the SGD optimizer with a learning rate of 0.05 and show how the representation evolves via t-SNE every 50 iterations. As can see in Figure 17 -19, the representation of watermark samples quickly mixes with the clean representation in the early phase of fine-tuning, with the watermark success rate decreasing. While our method manages to maintain the watermark samples in a distinct cluster, and the cluster stays distant from the untargeted clusters during the fine-tuning process, as shown in Figure 20 .

E ADDITIONAL RESULTS OF OTHER BASELINE DEFENSES

In our main experiments, we only compared our method with two SOTA methods (i.e., Namba & Sakuma (2019) and Bansal et al. (2022) ), out of four methods in total mentioned in Section 2. These two compared methods and ours have a similar threat model. In this section, we provide additional results for the two remaining baseline defenses. ). This defense can be easily circumvented by clipping image pixels to the normal range (i.e., [0,1]) or refusing the predictions of these abnormal samples. This defense is only feasible for models without normalization layers (e.g., batch normalization (Ioffe & Szegedy, 2015) ). As shown in Table 8 , training VGG-16 with batch normalization using this method will lead to very low benign accuracy on CIFAR-10. In this section, we compare CAE with our method using content-type watermark samples. As shown in Table 9 , our method is still (significantly) better than CAE in most cases (5 out of 6). These results verify the effectiveness of our method again.

F ADDITIONAL RESULTS OF OTHER POTENTIAL BASELINES

Recall that our method exploits a min-max formulation with respect to model parameters. One may wonder whether it would be better to use random parametric perturbations instead of adversarial ones or use standard adversarial training in the input space. In this section, we use the content-based watermark on CIFAR-10 as an example for our discussions.

F.1 USING RANDOM INSTEAD OF ADVERSARIAL PARAMETRIC PERTURBATION

To explore whether using random parametric perturbation (RPP) is better than our defense, we use a random parametric perturbation instead of the adversarial parametric perturbation within the minimization w.r.t. model parameter θ. As shown in Table 10 , although RPP achieves some improvements over the model trained without any defense (i.e., Vanilla), our method is still significantly better than it in almost all cases. These results verify the effectiveness of our method again.

F.2 USING STANDARD ADVERSARIAL TRAINING IN THE INPUT SPACE

To explore whether traditional adversarial training (AT) is better than our method, we conduct additional experiments by performing traditional AT in the input space. In particular, we adopt AT on the watermark instead of all samples to preserve high benign accuracy. As shown in Table 10 , our method is still significantly better than AT, although it has mild improvement compared to training with no defense (i.e., Vanilla) in some cases. In addition, to further verify that our method is still effective under simpler model architecture, we conduct additional experiments on CIFAR-10 with MobileNetV2. MobileNetV2 consists of 2.2M trainable parameters, which is significantly less than the 11.2M parameters contained in ResNet18 used in our main experiments. As shown in Table 11 , in this case, our method is still better than all baseline methods with the average WSR drop of 48.21%, whereas all baseline defenses suffer from at least 71.63% average WSR decreases. These results verify the effectiveness of our method again.



People are allowed to buy and sell pre-trained models on platforms like AWS marketplace or BigML. While many watermark methods were believed to be resistant to fine-tuning, they were only tested with small learning rates. For example,Bansal et al. (2022) only used a learning rate of 0.001 or even 0.0001. There is also a certified verification in(Bansal et al., 2022), which requires full access to the parameters of the suspicious model. It is out of our scope and we only consider its empirical verification via API. This is because NNL first reconstructs the watermark trigger (e.g., the content "ICLR" on watermark samples) and then removes watermark behaviors. By contrast, when we use unrelated-domain inputs as watermark samples, there is no trigger pattern, leading to the failure of NNL.



Figure 1: The performance of models in the vicinity of the watermarked model in the parametric space. d F T denotes the direction of fine-tuning and d adv denotes the adversarial direction. black dot: the original watermarked model; red star: the model after fine-tuning.

Figure 2: The distribution for clean samples and watermark samples.

Figure 3: The diagram of c-BN. We use BatchNorm statistics from the clean inputs to normalize the watermarked inputs.

To reduce the discrepancy, we propose clean-sample-based BatchNorm (c-BN). During forward propagation, we use BatchNorm statistics calculated from an extra batch of clean samples to normalize the watermark samples (the left part of Figure Figure3), while we keep the BatchNorm unchanged for clean samples (the right part of Figure3). In the implementation, since we always have a batch of clean samples B c and a batch of watermark samples B w for each update of model parameters, we always calculate the BatchNorm statistics and normalize inputs for each layer based on the clean batch B c .

Figure 4: The illustration of different types of watermark inputs.

Figure 5: The results with various magnitude ϵ.We use the dashed line with the same color to show the performance when ϵ = 0. Left: before attacks; Right: after attacks.

Figure 7: The performance of models in the vicinity of APP-based watermarked model in the parametric space. d F T denotes the direction of fine-tuning and d adv denotes the adversarial direction. black dot: the original watermarked model; red star: the model after fine-tuning.

Figure 8: The t-SNE visualization of hidden feature representations.

Figure 9: MCR results with varying thresholds on CIFAR-10.

Figure 11: MCR results with varying thresholds on CIFAR-100.

Figure 12: ANP results with varying thresholds on CIFAR-100.

Figure 14: t-SNE visualization of EW watermarked model along the adversarial direction.

Figure 16: t-SNE visualization of our watermarked model along the adversarial direction.

Figure 17: t-SNE visualization of vanilla watermarked model during the process of fine-tuning.

Figure 21: The WSR of models under ANP, NAD, and MCR.

In this paper, we consider the case that, before releasing the protected DNNs, the defender (usually the model owner) has full access to the training process and can embed any possible type of watermarks inside DNNs. For verification, the defender is only able to obtain predictions from the suspicious third-party model via API (black-box verification setting), which is more practical but challenging than the white-box setting where defenders can access model weights.

Algorithm 1 APP-based Watermarked Model Training Input: Network f θ (•), clean training set D c , watermarked training set D w , batch size n for clean data, batch size m watermarked data, learning rate η, perturbation magnitude ϵ

Performance (average over 3 random runs) of 3 watermark-injection methods and 3 types of watermark inputs against 6 removal attacks on CIFAR-10. Before: BA/WSR of the trained watermarked models; After: the remaining WSR after watermark-removal attacks. AvgDrop indicates the average changes in WSR against all attacks.

The effect of the two components in our method.

Results on CIFAR-10. 'NA' denotes 'No Attack'.

Results on CIFAR-100. 'NA' denotes 'No Attack'.

Results of Content embedded with varying perturbation magnitude ϵ using our method. AVG denotes the average WSR/BA after watermark removal attacks.

Results of standard model watermark over content-type attack with different target labels.

Results of our model watermark over content-type attack with different target labels. .85 92.14 88.41 90.35 89.36 91.15 91.01 1 93.57 91.82 91.87 88.64 90.22 88.89 91.37 90.91 2 93.59 91.68 91.92 89.02 90.11 89.21 90.92 90.92 3 93.46 91.70 91.86 87.49 90.18 89.24 91.31 90.75 4 93.51 91.68 91.77 88.80 90.00 88.92 91.06 90.82

The results of(Li et al., 2019) with different model architectures.

The results of CAE on CIFAR-10.We conjuncture that this failure is mostly because the batch statistics (mean & variance) calculated inside batch normalization are unduly affected by outliers caused by these extreme values in input pixels. We will further explore this problem in our future work.E.2 DEEP NEURAL NETWORK FINGERPRINTING BY CONFERRABLE ADVERSARIAL EXAMPLESDifferent from existing methods that adopted predefined patterns to generate watermark samples,Lukas et al. (2020) exploited conferrable adversarial examples (CAE) as watermark samples. Specifically, it was an ensemble-based defense requiring training 36 different models. Accordingly, is very time-consuming, requiring a large amount of training resources.

The results of RPP and AT on CIFAR-10.

The results with MobileNetV2 on CIFAR-10. ADDITIONAL RESULTS ON OTHER MODEL ARCHITECTURES In Section 4.3, we demonstrate that our method improves watermark robustness against the FT attack across various model architectures (i.e., MobileNetV2, VGG16, and ResNet50). To further verify that our method is better than baseline defenses across different model architectures under different attacks, in this section, we conduct additional experiments under more attacks (i.e., ANP, NAD, MCR) other than FT-based attacks. As shown in Figure21, our method consistently improves the watermark robustness across different model architectures under all attacks.

ETHICS STATEMENT

In this paper, we propose a minimax optimization-based method to embed a more robust model watermark. Our main goal is to assist the model owners to better protect their intellectual properties, which have positive social effects. However, we notice that our method may make backdoor attacks more resistant to current backdoor defenses. Accordingly, it could be used for malicious purposes. People can mitigate this threat by only using resources from reliable third parties.

REPRODUCIBILITY STATEMENT

The detailed descriptions of datasets, models, and training settings are provided in Appendix B. We provide part of the codes and some checkpoints to reproduce our main results. We will provide the remaining codes for reproducing our method upon the acceptance of the paper.

