CERTIFIED WATERMARKS FOR NEURAL NETWORKS

Abstract

Watermarking is a commonly used strategy to protect creators' rights to digital images, videos and audio. Recently, watermarking methods have been extended to deep learning models -in principle, the watermark should be preserved when an adversary tries to copy the model. However, in practice, watermarks can often be removed by an intelligent adversary. Several papers have proposed watermarking methods that claim to be empirically resistant to different types of removal attacks, but these new techniques often fail in the face of new or better-tuned adversaries. In this paper, we propose the first certifiable watermarking method. Using the randomized smoothing technique proposed in Chiang et al., we show that our watermark is guaranteed to be unremovable unless the model parameters are changed by more than a certain 2 threshold. In addition to being certifiable, our watermark is also empirically more robust compared to previous watermarking methods.

1. INTRODUCTION

With the rise of deep learning, there has been an extraordinary growth in the use of neural networks in various computer vision and natural language understanding tasks. In parallel with this growth in applications, there has been exponential growth in terms of the cost required to develop and train state-of-the-art models (Amodei & Hernandez, 2018) . For example, the latest GPT-3 generative language model (Brown et al., 2020) is estimated to cost around 4.6 million dollars (Li, 2020) in TPU cost alone. This does not include the cost of acquiring and labeling data or paying engineers, which may be even greater. With up-front investment costs growing, if access to models is offered as a service, the incentive is strong for an adversary to try to steal the model, sidestepping the costly training process. Incentives are equally strong for companies to protect such a significant investment. Watermarking techniques have long been used to protect the copyright of digital multimedia (Hartung & Kutter, 1999) . The copyright holder hides some imperceptible information in images, videos, or sound. When they suspect a copyright violation, the source and destination of the multimedia can be identified, enabling appropriate follow-up actions (Hartung & Kutter, 1999) . Recently, watermarking has been extended to deter the theft of machine learning models (Uchida et al., 2017; Zhang et al., 2018) . The model owner either imprints a predetermined signature into the parameters of the model (Uchida et al., 2017) or trains the model to give predetermined predictions (Zhang et al., 2018) for a certain trigger set (e.g. images superimposed with a predetermined pattern). A strong watermark must also resist removal by a motivated adversary. Even though the watermarks in (Uchida et al., 2017; Zhang et al., 2018; Adi et al., 2018) initially claimed some resistance to various watermark removal attacks, it was later shown in (Shafieinejad et al., 2019; Aiken et al., 2020) that these watermarks can in fact be removed with more sophisticated methods, using a combination of distillation, parameter regularization, and finetuning. To avoid the cat-and-mouse game of everstronger watermark techniques that are only later defeated by new adversaries, we propose the first certifiable watermark: unless the attacker changes the model parameters by more than a certain 2 distance, the watermark is guaranteed to remain. To the best of our knowledge, our proposed watermarking technique is the first to provide a certificate against an 2 adversary. Although the bound obtained by the certificate is relatively small, we see it as a first step towards developing watermarks with provable guarantees. Additionally we empirically find that our certified watermark is more resistant to previously proposed watermark removal attacks (Shafieinejad et al., 2019; Aiken et al., 2020) compared to its counterparts -it is thus valuable even when a certificate is not required.

