CERTIFIED WATERMARKS FOR NEURAL NETWORKS

Abstract

Watermarking is a commonly used strategy to protect creators' rights to digital images, videos and audio. Recently, watermarking methods have been extended to deep learning models -in principle, the watermark should be preserved when an adversary tries to copy the model. However, in practice, watermarks can often be removed by an intelligent adversary. Several papers have proposed watermarking methods that claim to be empirically resistant to different types of removal attacks, but these new techniques often fail in the face of new or better-tuned adversaries. In this paper, we propose the first certifiable watermarking method. Using the randomized smoothing technique proposed in Chiang et al., we show that our watermark is guaranteed to be unremovable unless the model parameters are changed by more than a certain 2 threshold. In addition to being certifiable, our watermark is also empirically more robust compared to previous watermarking methods.

1. INTRODUCTION

With the rise of deep learning, there has been an extraordinary growth in the use of neural networks in various computer vision and natural language understanding tasks. In parallel with this growth in applications, there has been exponential growth in terms of the cost required to develop and train state-of-the-art models (Amodei & Hernandez, 2018) . For example, the latest GPT-3 generative language model (Brown et al., 2020) is estimated to cost around 4.6 million dollars (Li, 2020) in TPU cost alone. This does not include the cost of acquiring and labeling data or paying engineers, which may be even greater. With up-front investment costs growing, if access to models is offered as a service, the incentive is strong for an adversary to try to steal the model, sidestepping the costly training process. Incentives are equally strong for companies to protect such a significant investment. Watermarking techniques have long been used to protect the copyright of digital multimedia (Hartung & Kutter, 1999) . The copyright holder hides some imperceptible information in images, videos, or sound. When they suspect a copyright violation, the source and destination of the multimedia can be identified, enabling appropriate follow-up actions (Hartung & Kutter, 1999) . Recently, watermarking has been extended to deter the theft of machine learning models (Uchida et al., 2017; Zhang et al., 2018) . The model owner either imprints a predetermined signature into the parameters of the model (Uchida et al., 2017) or trains the model to give predetermined predictions (Zhang et al., 2018) for a certain trigger set (e.g. images superimposed with a predetermined pattern). A strong watermark must also resist removal by a motivated adversary. Even though the watermarks in (Uchida et al., 2017; Zhang et al., 2018; Adi et al., 2018) initially claimed some resistance to various watermark removal attacks, it was later shown in (Shafieinejad et al., 2019; Aiken et al., 2020) that these watermarks can in fact be removed with more sophisticated methods, using a combination of distillation, parameter regularization, and finetuning. To avoid the cat-and-mouse game of everstronger watermark techniques that are only later defeated by new adversaries, we propose the first certifiable watermark: unless the attacker changes the model parameters by more than a certain 2 distance, the watermark is guaranteed to remain. To the best of our knowledge, our proposed watermarking technique is the first to provide a certificate against an 2 adversary. Although the bound obtained by the certificate is relatively small, we see it as a first step towards developing watermarks with provable guarantees. Additionally we empirically find that our certified watermark is more resistant to previously proposed watermark removal attacks (Shafieinejad et al., 2019; Aiken et al., 2020) compared to its counterparts -it is thus valuable even when a certificate is not required. Watermark techniques (Uchida et al., 2017) proposed the first method of watermarking neural networks: they embed the watermark into the parameters of the network during training through regularization. However, the proposed approach requires explicit inspection of the parameters for ownership verification. Later, (Zhang et al., 2018; Rouhani et al., 2018) improved upon this approach, such that the watermark can be verified through API-only access to the model. Specifically, they embed the watermark by forcing the network to deliberately misclassify certain "backdoor" images. The ownership can then be verified through the adversary's API by testing its predictions on these images. In light of later and stronger watermark removal techniques (Aiken et al., 2020; Wang & Kerschbaum, 2019; Shafieinejad et al., 2019) , several papers have proposed methods to improve neural network watermarking. (Wang & Kerschbaum, 2019) propose an improved white-box watermark that avoids the detection and removal techniques from (Wang & Kerschbaum, 2019) . (Li et al., 2019) propose using values outside of the range of representable images as the trigger set pattern. They show that their watermark is quite resistant to a finetuning attack. However, since their trigger set does not consist of valid images, their method does not allow for black-box ownership verification against a realistic API that only accepts actual images, while our proposed watermark is effective even in the black-box setting. (Szyller et al., 2019) proposed watermarking methods for models housed behind an API. Unlike our method, their method does not embed a watermark into the model weights itself, and so cannot work in scenarios where the weights of the model may be stolen directly, e.g. when the model is housed on mobile devices. Finally, (Lukas et al., 2019) propose using a particular type of adversarial example ("conferrable" adversarial examples) to construct the trigger set. This makes the watermark scheme resistant even to the strongest watermark removal attack: ground-up distillation which, starting from a random initialization, trains a new network to imitate the stolen model (Shafieinejad et al., 2019) . However, for their approach to be effective, they need to train a large number of models (72) on a large amount of data (e.g. requiring CINIC as opposed to CIFAR-10). While our approach does not achieve this impressive resistance to ground-up distillation, it is also much less costly. Watermark removal attacks However, one concern for all these watermark methods is that a sufficiently motivated adversary may attempt to remove the watermark. Even though (Zhang et al., 2018; Rouhani et al., 2018; Adi et al., 2018; Uchida et al., 2017) all claim that their methods are resistant to watermark removal attacks, such as finetuning, other authors (Aiken et al., 2020; Shafieinejad et al., 2019) later show that by adding regularization, finetuning and pruning, their watermarks can be removed without compromising the prediction accuracy of the stolen model. Wang & Kerschbaum (2019) shows that the watermark signals embedded by (Uchida et al., 2017) can be easily detected and overwritten; (Chen et al., 2019) shows that by leveraging both labeled and unlabeled data, the watermark can be more efficiently removed without compromising the accuracy. Even if the watermark appears empirically resistant to currently known attacks, stronger attacks may eventually come along, prompting better watermark methods, and so on. To avoid this cycle, we propose the first certifiably unremovable watermark: given that parameters are not modified more than a given threshold 2 distance, the watermark will be preserved. Certified defenses for adversarial robustness Our work is inspired by recent work on certified adversarial robustness, (Cohen et al., 2019; Chiang et al., 2019; Wong & Kolter, 2017; Mirman et al., 2018; Weng et al., 2018; Zhang et al., 2019; Eykholt et al., 2017; Levine & Feizi, 2019) . Certified adversarial robustness involves not only training the model to be robust to adversarial attacks under particular threat models, but also proving that no possible attacks under a particular constraint could possibly succeed. Specifically, in this paper, we used the randomized smoothing technique first developed by (Cohen et al., 2019; Lecuyer et al., 2019) for classifiers, and later extended by (Chiang et al., 2020) to deal with regression models. However, as opposed to defending against an 2 -bounded threat models in the image space, we are now defending against an 2 -bounded adversary in the parameter space. Surprisingly, even though the certificate holds only when randomized smoothing is applied, empirically, when our watermark is evaluated in a black-box setting on the non-smoothed model, it also exhibits stronger persistence compared to previous methods.

