DEEP WATERMARKS FOR ATTRIBUTING GENERATIVE MODELS

Abstract

Generative models have enabled the creation of contents that are indistinguishable from those taken from the Nature. Open-source development of such models raised concerns about the risks in their misuse for malicious purposes. One potential risk mitigation strategy is to attribute generative models via watermarking. Current watermarking methods exhibit significant tradeoff between robust attribution accuracy and generation quality, and also lack principles for designing watermarks to improve this tradeoff. This paper investigates the use of latent semantic dimensions as watermarks, from where we can analyze the effects of design variables, including the choice of watermarking dimensions, watermarking strength, and the capacity of watermarks, on the accuracy-quality tradeoff. Compared with previous SOTA, our method requires minimum compute and is more applicable to large-scale models. We use StyleGAN2 and the latent diffusion model to demonstrate the efficacy of our method.

1. INTRODUCTION

Generative models can now create synthetic contents such as images and audios that are indistinguishable from those taken from the Nature (Karras et al., 2020; Rombach et al., 2022; Ramesh et al., 2022; Hawthorne et al., 2022) . This pose serious threat when used as malicious attempt, such as disinformation (Breland, 2019) and malicious impersonation (Satter, 2019) . Such potential threats delays the industrialization process of the generative model, as conservative model inventors hesitate to release their source code (Yu et al., 2020) . For example in 2020, OpenAI refused to release the source code of their GPT-2 (Radford et al., 2019) model due to concerns over potential malicious attempts Brockman et al. (2020) , additionally, the source codes of DALL-E (Ramesh et al., 2021) and DALL-E 2 (Ramesh et al., 2022) are also not released for the same reason Mishkin & Ahmad (2022) . One of the potential means of solution is model attribution (Yu et al., 2018; Kim et al., 2020; Yu et al., 2020) , where a model distributor tweaks each user-end model so that they generate contents with model-specific watermarks. In practice, we consider the scenario where the model distributor or regulator maintain a database of user specific keys which corresponds to each users' downloaded model. When malicious attempts has been made, the regulator will be able to identify the user that's responsible for such attempts by attribution. Additionally, we assume the distributed model is white-box, which potentially makes a separate watermarking module appended on top of the generator trivial, as malicious user can simply remove such module from the network. Instead, we propose deep watermarking method that is free from this limitation by embedding the watermarking module directly into the generative model itself. Formally, let a set of n generative models be G := {g i (•)} n i=1 where g i (•) : R dz → R dx is a mapping from an easy-to-sample distribution p z to a watermarked data distribution p x,i in the content space, and is parameterized by a binary-coded key ϕ i ∈ Φ := {0, 1} d ϕ . Let f (•) : R dx → Φ be a mapping that attributes contents to their source models. We consider four performance metrics of a watermarking mechanism: The attribution accuracy of g i is defined as A(g i ) = E z∼pz [1 (f (g i (z)) == ϕ i )] . (1) The generation quality of g i measures the difference between p x,i and the data distribution used for learning G, e.g., the Fréchet Inception Distance (FID) score (Heusel et al., 2017) for images. The same generator g and watermark estimator f are used for all watermarked models. Our method thus requires minimal compute and is scalable to large latent diffusion models. Inception score (IS) Salimans et al. ( 2016) is also measured for p x,i as additional generation quality metrics. Watermark secrecy is measured by the mean peak signal-to-noise ratio (PSNR) of individual images drawn from p x,i . Compared with generation quality, this metric focuses on how obvious watermarks are rather than how well two content distributions match. Lastly, the watermark capacity is n = 2 d ϕ . Existing watermarking methods exhibit significant tradeoff between attribution accuracy and generation quality (and watermark secrecy), particularly when countermeasures against dewatermarking attempts, e.g., image postprocesses, are taken into consideration. For example, Kim et al. ( 2020) use shallow watermarks for image generators in the form of g i (z) = g 0 (z) + ϕ i where g 0 (•) is an unwatermarked model, and show that ϕ i s have to significantly alter the original contents to achieve good attribution accuracy against image blurring, causing unfavorable drop in generation quality and watermark secrecy (Fig. 1(a) ). To improve this tradeoff, we investigate in this paper deep watermarks in the form of g i (ψ(z) + ϕ i ) -g 0 (ψ(z)), where w := ψ(z) ∈ R dw contains disentangled semantic dimensions that allow a smoother mapping to the content space (Fig. 1 ). Such ψ has been incorporated in popular models such as StyleGAN (SG) (Karras et al., 2019; 2020) , where w is the style vector, and latent diffusion models(LDM) (Rombach et al., 2022) , where w comes from a diffusion process. Existing studies on semantic editing showed that R dw consists of linear semantic dimensions (Härkönen et al., 2020; Zhu et al., 2021) . Inspired by this, we hypothesize that using subtle yet semantic changes as watermarks will improve the robustness of attribution accuracy against image postprocesses, and thus investigate the performance of deep watermarks that are generated by perturbations along latent dimensions of R dw . Specifically, we consider latent dimensions as eigenvectors of the covariance matrix of the latent distribution p w , denoted by Σ w . Contributions. (1) We propose a novel intrinsic watermarking strategy that directly embed the watermarking module into the generative model, as a mean to achieve responsible white-box model distribution (2) We prove and empirically verify that there exists an intrinsic tradeoff between attribution accuracy and generation quality. This tradeoff is affected by watermark variables including the choice of the watermarking space, the watermark strength, and its capacity. Parametric studies on these variables for StyleGAN2 (SG2) and a Latent Diffusion Model(LDM) lead to improved accuracy-quality tradeoff from previous SOTA. In addition, our method requires negligible compute compared with previous SOTA, rendering it more applicable to popular large-scale models, including latent diffusion ones. (3) We show that using a postprocess-specific LPIPS metric for model attribution leads to further improved attribution accuracy against image postprocesses.

2. RELATED WORK

Model attribution through watermark encoding and decoding. Yu et al. (2020) propose to encode binary-coded keys into images through g i (z) = g 0 ([z, ϕ i ]) and to decode them via another learnable function. This requires joint training of the encoder and decoder over R dz × Φ to empirically balance attribution accuracy and generation quality. Since watermark capacity is usually



Figure 1: (a) Visual comparison between deep watermarking (our method) and shallow watermarking (Kim et al., 2020). Our method uses subtle semantic changes, rather than strong noises, to maintain attribution accuracy against image postprocesses. (b) Schematic of deep watermarking:The same generator g and watermark estimator f are used for all watermarked models. Our method thus requires minimal compute and is scalable to large latent diffusion models.

