DEEP WATERMARKS FOR ATTRIBUTING GENERATIVE MODELS

Abstract

Generative models have enabled the creation of contents that are indistinguishable from those taken from the Nature. Open-source development of such models raised concerns about the risks in their misuse for malicious purposes. One potential risk mitigation strategy is to attribute generative models via watermarking. Current watermarking methods exhibit significant tradeoff between robust attribution accuracy and generation quality, and also lack principles for designing watermarks to improve this tradeoff. This paper investigates the use of latent semantic dimensions as watermarks, from where we can analyze the effects of design variables, including the choice of watermarking dimensions, watermarking strength, and the capacity of watermarks, on the accuracy-quality tradeoff. Compared with previous SOTA, our method requires minimum compute and is more applicable to large-scale models. We use StyleGAN2 and the latent diffusion model to demonstrate the efficacy of our method. Model attribution through watermark encoding and decoding. Yu et al. (2020) propose to encode binary-coded keys into images through g i (z) = g 0 ([z, ϕ i ]) and to decode them via another learnable function. This requires joint training of the encoder and decoder over R dz × Φ to empirically balance attribution accuracy and generation quality. Since watermark capacity is usually

1. INTRODUCTION

Generative models can now create synthetic contents such as images and audios that are indistinguishable from those taken from the Nature (Karras et al., 2020; Rombach et al., 2022; Ramesh et al., 2022; Hawthorne et al., 2022) . This pose serious threat when used as malicious attempt, such as disinformation (Breland, 2019) and malicious impersonation (Satter, 2019) . Such potential threats delays the industrialization process of the generative model, as conservative model inventors hesitate to release their source code (Yu et al., 2020) . For example in 2020, OpenAI refused to release the source code of their GPT-2 (Radford et al., 2019) model due to concerns over potential malicious attempts Brockman et al. (2020) , additionally, the source codes of DALL-E (Ramesh et al., 2021) and DALL-E 2 (Ramesh et al., 2022) are also not released for the same reason Mishkin & Ahmad (2022) . One of the potential means of solution is model attribution (Yu et al., 2018; Kim et al., 2020; Yu et al., 2020) , where a model distributor tweaks each user-end model so that they generate contents with model-specific watermarks. In practice, we consider the scenario where the model distributor or regulator maintain a database of user specific keys which corresponds to each users' downloaded model. When malicious attempts has been made, the regulator will be able to identify the user that's responsible for such attempts by attribution. Additionally, we assume the distributed model is white-box, which potentially makes a separate watermarking module appended on top of the generator trivial, as malicious user can simply remove such module from the network. Instead, we propose deep watermarking method that is free from this limitation by embedding the watermarking module directly into the generative model itself. Formally, let a set of n generative models be G := {g i (•)} n i=1 where g i (•) : R dz → R dx is a mapping from an easy-to-sample distribution p z to a watermarked data distribution p x,i in the content space, and is parameterized by a binary-coded key ϕ i ∈ Φ := {0, 1} d ϕ . Let f (•) : R dx → Φ be a mapping that attributes contents to their source models. We consider four performance metrics of a watermarking mechanism: The attribution accuracy of g i is defined as A(g i ) = E z∼pz [1 (f (g i (z)) == ϕ i )] . (1) The generation quality of g i measures the difference between p x,i and the data distribution used for learning G, e.g., the Fréchet Inception Distance (FID) score (Heusel et al., 2017) for images. Figure 1 : (a) Visual comparison between deep watermarking (our method) and shallow watermarking (Kim et al., 2020) . Our method uses subtle semantic changes, rather than strong noises, to maintain attribution accuracy against image postprocesses. (b) Schematic of deep watermarking: The same generator g and watermark estimator f are used for all watermarked models. Our method thus requires minimal compute and is scalable to large latent diffusion models. Inception score (IS) Salimans et al. (2016) is also measured for p x,i as additional generation quality metrics. Watermark secrecy is measured by the mean peak signal-to-noise ratio (PSNR) of individual images drawn from p x,i . Compared with generation quality, this metric focuses on how obvious watermarks are rather than how well two content distributions match. Lastly, the watermark capacity is n = 2 d ϕ . Existing watermarking methods exhibit significant tradeoff between attribution accuracy and generation quality (and watermark secrecy), particularly when countermeasures against dewatermarking attempts, e.g., image postprocesses, are taken into consideration. For example, Kim et al. (2020) use shallow watermarks for image generators in the form of g i (z) = g 0 (z) + ϕ i where g 0 (•) is an unwatermarked model, and show that ϕ i s have to significantly alter the original contents to achieve good attribution accuracy against image blurring, causing unfavorable drop in generation quality and watermark secrecy (Fig. 1 (a)). To improve this tradeoff, we investigate in this paper deep watermarks in the form of g i (ψ(z) + ϕ i ) -g 0 (ψ(z)), where w := ψ(z) ∈ R dw contains disentangled semantic dimensions that allow a smoother mapping to the content space (Fig. 1 ). Such ψ has been incorporated in popular models such as StyleGAN (SG) (Karras et al., 2019; 2020) , where w is the style vector, and latent diffusion models(LDM) (Rombach et al., 2022) , where w comes from a diffusion process. Existing studies on semantic editing showed that R dw consists of linear semantic dimensions (Härkönen et al., 2020; Zhu et al., 2021) . Inspired by this, we hypothesize that using subtle yet semantic changes as watermarks will improve the robustness of attribution accuracy against image postprocesses, and thus investigate the performance of deep watermarks that are generated by perturbations along latent dimensions of R dw . Specifically, we consider latent dimensions as eigenvectors of the covariance matrix of the latent distribution p w , denoted by Σ w .

Contributions.

(1) We propose a novel intrinsic watermarking strategy that directly embed the watermarking module into the generative model, as a mean to achieve responsible white-box model distribution (2) We prove and empirically verify that there exists an intrinsic tradeoff between attribution accuracy and generation quality. This tradeoff is affected by watermark variables including the choice of the watermarking space, the watermark strength, and its capacity. Parametric studies on these variables for StyleGAN2 (SG2) and a Latent Diffusion Model(LDM) lead to improved accuracy-quality tradeoff from previous SOTA. In addition, our method requires negligible compute compared with previous SOTA, rendering it more applicable to popular large-scale models, including latent diffusion ones. (3) We show that using a postprocess-specific LPIPS metric for model attribution leads to further improved attribution accuracy against image postprocesses. high (i.e., 2 d ϕ ), training is made tractable by sampling only a small subset of watermarks. Thus this method is computationally expensive and lacks a principled understanding of how the watermarking mechanism affects the accuracy-quality tradeoff. In contrast, our method does not require any additional training and mainly relies on simple principle component analysis of the latent distribution. Certifiable model attribution through shallow watermarks. Kim et al. (2020) propose shallow watermarks g i (z) = g 0 (z) + ϕ i and linear classifiers for attribution. These simplifications allow the derivation of sufficient conditions of Φ to achieve certifiable attribution of G. Since the watermarks are added as noises rather than semantic changes coherent with the generated contents, high watermark strength becomes necessary to maintain attribution accuracy under postprocesses. While this paper does not provide attribution certification for deep watermarks, we discuss technical feasibility and challenges in achieving this goal. StyleGAN and low-rank subspace. Our study focuses on popular image generation models which share an architecture rooted from SG: A uniform distribution is first transformed into a latent distribution (p w ), samples from which are then decoded into images. Härkönen et al. (2020) apply principal component analysis on p w distribution and found semantically meaningful editing directions. Zhu et al. (2021) use local Jacobian (∇ w g) to derive perturbations that enable local semantic editing of generated images, and show that such semantic dimensions are shared across the latent space. In this study, we show that the mean of the Gram matrix for local editing (E w∼pw [∇ w g T ∇ w g]) and the covariance of w (Σ w ) are qualitatively similar in that both reveal major to minor semantic dimensions through their eigenvectors. GAN inversion. The model attribution problem can be formulated as GAN inversion problem. A learning based inversion (Perarnau et al., 2016; Bau et al., 2019) optimizes parameters of encoder network which map a image to latent code z. On the other hand, optimization based inversion (Abdal et al., 2019; Huh et al., 2020) solve for latent code z that minimizes distance metric between a given image and generated image g(z). The learning based method is computationally more efficient in the inference stage comparing to optimization based method. However, optimization based GAN inversion achieves superior quality of latent interpretation, which can be referred to as the qualitytime trade-off (Xia et al., 2022) . In our method, we utilized the optimization based inversion, as faithful latent interpretation is critical in our application. To further enforce faithful latent interpretation, we incorporate existing techniques, e.g., parallel search, to solve this non-convex problem, but uniquely exploit the fact that watermarks are small latent perturbations to enable analysis on the accuracy-quality tradeoff.

3.1. NOTATIONS AND PRELIMINARIES

Notations. For x ∈ R n and A ∈ R n×m , denote by proj A x the projection of x to span(A), and by A † the pseudo inverse of A. For parameter a, we denote by â its estimate and ϵ a = â -a the error. ∇ x f is the gradient of f with respect to x, E x∼px [•] is an expectation over p x , and tr(B) (resp. det(B)) is the trace (resp. determinant) of B ∈ R n×n . diag(λ) ∈ R n×n diagonalizes λ ∈ R n . Deep watermarks. All contemporary generative models, e.g., SG2 and LDM, consist of a disentanglement mapping ψ : R dz → R dw from an easy-to-sample distribution p z to a latent distribution p w , followed by a generator g : R dw → R dx that maps w to the content space. In particular, ψ is a multilayer perception network in SG, and a diffusion process in a diffusion model. Existing studies showed that linear perturbations along principal components of ∇ w g enables semantic editing, and such perturbation directions are often applicable over w ∼ p w (Härkönen et al., 2020; Zhu et al., 2021) . Indeed, instead of local analysis on the Jacobian, Härkönen et al. (2020) showed that principal component analysis directly on p w also reveals semantic dimensions. This paper follows these existing findings and uses a subset of semantic dimensions as watermarks. Specifically, let U ∈ R dw×(dw-d ϕ ) and V ∈ R dw×d ϕ be orthonormal and complementary. Given random seed z ∈ R dz , user-specific key ϕ ∈ R d ϕ , and strength σ ∈ R, let α = U † proj U ψ(z) ∈ R dw-d ϕ , the watermarked latent variable is w ϕ (α) = U α + σV ϕ, where α ∼ p α and p α is induced by p w . Then, user can generate watermarked images, g(w ϕ (α)). The choice of (U, V ) and σ affects the attribution accuracy and generalization performance, which we analyze in Sec. 3.2. Attribution. To decode user-specific key from the image g(w ϕ (α)), we formulate an optimization problem: min α, φ l g(w φ( α)), g(w ϕ (α)) s.t. αi ∈ [α l,i , α u,i ], ∀i = 1, ..., d w -d ϕ . Through experiments, we discovered that attribution accuracy can be improved by constraining α. Here the upper and lower bounds of α are chosen based on the empirical limits observed from p α . While l 2 norm is used for analysis in Sec.3.2, here we minimize LPIPS (Zhang et al., 2018) which measures the perceptual difference between two images. In practice, we introduce a penalty on α with large enough Lagrange multipliers and solve the resulting unconstrained problem. To avoid convergence to unfavorable local solutions, we also employ parallel search with n initial guesses of α drawn through Latin hypercube sampling (LHS).

3.2. ACCURACY-QUALITY TRADEOFF

Attribution accuracy. Define J w = ∇g(w), H w = J T w J w . Let Hw = E w∼pw [H w ] be the mean Gram matrix, and Hϕ = E α∼pα [H w ϕ (α) ] be its watermarked version. Let l : R dx × R dx → R be a distance metric between two images, (α, φ) the estimates. To analyze how (V, U ) affects the attribution accuracy, we use the following simplifications and assumptions: (A1) l(•, •) is the l 2 norm. (A2) Both ||ϵ α || and σ are small. In practice we achieve small ||ϵ α || through parallel search (see Appendix B). (A3) Since our focus is on ϵ ϕ , we further assume that the estimation of α, denoted by α(α), is independent from ϕ, and ϵ α is constant. This allows us to ignore the subroutine for computing α(α) and turns the estimation problem to an optimization with respect to only ϵ ϕ . Formally, we have the following proposition (see Appendix A.1 for proof): Proposition 1. ∃ c > 0 such that if σ ≤ c and ||ϵ α || 2 ≤ c, the watermark estimation problem min φ E α∼pα ∥g(w φ( α(α))) -g(w ϕ (α))∥ 2 2 has an error ϵ ϕ = -(σ 2 V T Hϕ V ) -1 V T Hϕ U ϵ α . Remarks: (1) Similar to classic design of experiment, one can reduce ||ϵ ϕ || by maximizing det(V T Hϕ V ), which sets columns of V as the eigenvectors associated with the largest d ϕ eigenvalues of Hϕ . However, Hϕ is neither computable because ϕ is unknown during the estimation, nor is it tractable because J w ϕ (α) is large in practice. To this end, we propose to use the covariance of p w , denoted by Σ w , to replace Hϕ in experiments. In Appendix C, we support this approximation empirically by showing that Σ w and Hw (the non-watermarked mean Gram matrix) are qualitatively similar in that the principal components of both matrices offer disentangled semantic dimensions. (2) Let the kth largest eigenvalue of Hw be γ k . By setting columns of V as the eigenvectors of Hw associated with the largest d ϕ eigenvalues, and by noting that φ is accurate only when all of its elements match with ϕ (equation 1), the worst-case estimation error is governed by γ -1 d ϕ . This means that higher key capacity, i.e., larger d ϕ , leads to worse attribution accuracy. (3) From the proposition, ϵ ϕ = 0 if V and U are complementary sets of eigenvectors of Hϕ . In practice this decoupling between ϵ ϕ and ϵ α cannot be achieved due to the assumptions and approximations we made. Generation quality. For analysis purpose, we approximate the original latent distribution p w by w = µ + U α + V β where α ∼ N (0, diag(λ U )), β ∼ N (0, diag(λ V )), and µ = E w∼pw [w]. λ U ∈ R dw-d ϕ and λ V ∈ R d ϕ are calibrated to match p w . Denote λ V,max = max i {λ V,i }. A latent distribution watermarked by ϕ is similarly approximated as w ϕ = µ + U α + σV ϕ. With mild abuse of notation, let g be the mapping from the latent space to a feature space (usually defined by an Inception network in FID) and is continuously differentiable. Let the mean and covariance matrix of w i be µ i and Σ i , respectively. Denote by HU = E α [J T µ+U α J µ+U α ] the mean Gram matrix in the subspace of U , and let γ U,max be the largest eigenvalue of HU . We have the following proposition to upper bound ||µ 0 -µ 1 || 2 2 and |tr(Σ 0 ) -tr(Σ 1 )|, both of which are related to the FID score for measuring the generation quality (see Appendix A.2 for proof): Proposition 2. For any τ > 0 and η ∈ (0, 1), ∃ c(τ, η) > 0 and ν > 0, such that if σ ≤ c(τ, η) and λ V,i ≤ c(τ, η) for all i = 1, ..., d ϕ , then ∥µ 0 -µ 1 ∥ 2 2 ≤ σ 2 γ U,max d ϕ + τ and |tr(Σ 0 -Σ 1 )| ≤ λ V,max γ U,max d ϕ + 2νσ d ϕ + τ with probability at least 1 -η. Remarks: Recall that for improving attribution accuracy, a practical approach is to choose V as eigenvectors associated with the largest eigenvalues of Σ w . Notices that with the approximated distribution with α ∼ N (0, diag(λ U )) and β ∼ N (0, diag(λ V )), Σ w = diag([λ T U , λ T V ] T ). On the other hand, from Proposition 2, generation quality improves if we minimize λ V,max by choosing V according to the smallest eigenvalues of Σ w . In addition, smaller key capacity (d ϕ ) and lower strength (σ) also improve the generation quality. Propositions 1 and 2 together reveal the intrinsic accuracy-quality tradeoff. Watermark secrecy. Lastly, the analysis on watermark secrecy is straight forward using the same proof techniques: We note that PSNR is a monotonically decreasing function of the MSE ∥g(µ + U α) -g(µ + U α + σV ϕ)∥ 2 2 and therefore we can use the following proposition to analyze the effect of watermark variables on secrecy (proof in Appendix A.3): Proposition 3. For any τ > 0, ∃ c(τ ) > 0 such that if σ ≤ c(τ ), E α∼pα ∥g(µ + U α) -g(µ + U α + σV ϕ)∥ 2 2 ≤ σ 2 γ U,max d ϕ + τ.

4. EXPERIMENTS

In this section, we present empirical evidence of the accuracy-quality tradeoff and show improved tradeoff from previous SOTA by using deep watermarks. Experiments are conducted for both with and without a combination of postprocesses including (image noising, blurring, and JPEG compression transformation, and their combination).

4.1. EXPERIMENT SETTINGS

Models, data, and metrics. We conduct experiments on SG2 (Karras et al., 2020) and LDM (Rombach et al., 2022) models trained on various datasets including FFHQ (Karras et al., 2019) , AFHQ-Cat, and AFHQ-Dog (Choi et al., 2020) . Generation quality is measured by the Frechet Inception distance (FID) (Heusel et al., 2017) and inception score (IS) (Salimans et al., 2016) , attribution accuracy by equation 1, and watermark secrecy by PSNR. Deep watermark dimensions. The dimensions of the semantic latent spaces of SG2 and LDM are 512 and 12,288, respectively. For both models we approximate Σ w using 10K samples drawn from p w . We define watermark dimensions V as a subset eigenvectors of Σ w associated with consecutive eigenvalues: V := P C[i : j], where P C is the full set of principal components of Σ w sorted by their variances in the descending order, and i and j represent the starting and ending indices of the subset. Attribution. To compute the empirical accuracy, we use 1K samples drawn from p z for each watermark ϕ, and use 1K watermarks where each bit is drawn independently from a Bernoulli distri-bution with p = 0.5. In Table 1 , we show that both constraints on α and parallel search with 20 initial guesses improve the empirical attribution accuracy across models and datasets. Notably, constrained estimation is essential for successful attribution of LDMs. In these experiments, V is chosen as the eigenvectors associated with the 64 smallest eigenvalues of Σ w as a worst-case scenario for attribution accuracy. & Singh, 2021) . The results suggest that when image postprocesses are not considered as a potential threat model, the attribution accuracy, generation quality, capacity (2 64 ), and watermark secrecy are all acceptable using the proposed method. Fig. 2 visualizes and compares deep watermarks generated from small vs. large eigenvalues of Σ w . Watermarks corresponding to small eigenvalues are non-semantic, while those to large eigenvalues create semantic changes. We will later show that semantic yet subtle (perceptually insignificant) watermarks are necessary to counter image postprocesses. Accuracy-quality tradeoff. Table 2 summarizes the tradeoff when we vary the choice of V and the strength σ, while fixing the watermark length d ϕ to 64. Then in Table 3 we sweep d ϕ while keeping V as PCs associated with smallest eigenvalues of Σ w and σ = 1. The experiments are conducted on SG2 and LDM on the FFHQ dataset. The empirical results in Table 2 are consistent with our analysis: Accuracy decreases while generation quality improves when V is moved from major to minor principal components. For watermark strength, however, we observe that the positive effect of strength on the accuracy, as predicted by Proposition 1, is only limited to small σ. This is because larger σ causes pixel values to go out of bound, causing loss of information. In Table 3 , we summarize the attribution accuracy, FID, and PSNR score under 32-to 128-bit keys. Accuracy and generation quality, in particular the latter, are both affected by d ϕ as predicted.

4.3. WATERMARK PERFORMANCE WITH POSTPROCESSING

We now consider more realistic scenarios where generated images are postprocessed, either maliciously as an attempt to remove the watermarks or unintentionally, before they are attributed. Under this setting, our method achieves better accuracy-quality tradeoff than shallow watermarking under two realistic settings: (1) when noising an JPEG compression are used as unknown postprocesses, and (2) when the set of postprocesses, rather than the ones that are actually chosen, is known. Postprocesses. To keep our solution realistic, we solve the attribution problem by assuming that the potential postprocesses are unknown: min α, φ l g(w φ( α)), T (g(w ϕ (α))) s.t. αi ∈ [α l,i , α u,i ], ∀i = 1, ..., d w -d ϕ . where T : R dx → R dx is a postprocess function, and T (g(w ϕ (α))) is a given image from which the watermark is to be estimated. We assume that T does not change the image in a semantically meaningful way, because otherwise the value of the image for either an attacker or a benign user will be lost. Since our method adds semantically meaningful perturbations to the images, we expect such 3, 7, 9, 16, 25] and standard deviation of [0.5, 1.0, 1.5, 2.0]. We randomly sample the JPEG quality from [80, 70, 60, 50] . These parameters are chosen to be mild so that images do not lose their semantic contents. And Combo randomly chooses a subset of the three through a binomial distribution with p = 0.5 and uses the same postprocess parameters. Modified LPIPS metric. In addition to testing the worst-case scenario where postprocesses are completely unknown, we also consider cases where they are known. While this is unrealistic for individual postprocesses, it is worth investigating when we assume that the set of postprocesses, rather than the ones that are actually chosen, is known. Within this scenario, we show that modifying LPIPS according to the postprocess improves the attribution accuracy. To explain, LPIPS is originally trained on a so-called "two alternative forced choice" (2AFC) dataset. Each data point of 2AFC contains three images: a reference, p 0 , and p 1 , where p 0 and p 1 are distorted in different ways based on the reference. A human evaluator then ranks p 0 and p 1 by their similarity to the reference. Here we propose the following modification to the dataset for training a postprocess-specific metrics: Similar to 2AFC, for each data point, we draw a reference image x from the default generative model to be watermarked, and define p 0 as the watermarked version of x. p 1 is then a postprocessed version of x given a specific T (or random combinations for Combo). To match with the setting of 2AFC, we sample 64 × 64 patches from x, p 0 , and p 1 as training samples. We then rank patches from p 1 as being more similar to those of x than p 0 . With this setting, the resulting LPIPS metric becomes more sensitive to watermarks than mild postprocesses. The detailed training of the modified LPIPS follows the vgg-lin configuration in Zhang et al. (2018) . It should be noted that unlike previous SOTA where watermarks Kim et al. (2020) or encoder-decoder models Yu et al. (2020) are retrained based on the known attacks, our watermarking mechanism, and therefore generalization performance, are agnostic to postprocesses. Each images in the second row represents robustly watermarked images against corresponding postprocesses. The next row illustrates post-processed images. The last row depicts the differences between original (the first row) and watermarked images (the second row) using heatmap. Even if our method shows large pixel value changes, the watermarks are not perceptible comparing with baseline method (see second row). Accuracy-quality tradeoff. We summarize watermark performance metrics on SG2 and FFHQ in Table 4 . The attribution accuracy reported here are estimated using the strongest parameters of each attack. For Combo, we use sequentially apply Blurring+Noising+JPEG as a deterministic worst-case attack. To estimate attribution accuracy, we solved the estimation problem in equation 4.3 where postprocesses are applied. The proposed method: We choose V as a subset of 32 consecutive eigenvectors of Σ w starting from the 1th, 17th, and 33th eigenvectors, denoted respectively by PC[0:32], PC [16:48], and PC[32:64] in the table. Watermarking strength σ is set to 3. Attribution results from both a standard and a postprocess-specific LPIPS metric are reported in the UK (unknown) and KN (known) columns, respectively. Accuracies for our method are computed based on 100 random watermark samples from 2 32 , each with 100 random generations. The baseline: We compare with a shallow watermarking method from Kim et al. (2020) (denoted by BL). When the postprocesses are known, BL performs postprocess-specific computation to derive shallow watermarks that are optimally robust against the known postprocess. Results in UK and KN columns for BL are respectively without and with the postprocess-specific watermark computation. BL accuracies are computed based on 10 watermarks, each with 100 random generations. It is worth noting that the shallow watermarking method is not as scalable as ours, and increasing the key capacity decreases the overall attribution accuracy (see (Kim et al., 2020) ). Also recall that the key length affects attribution accuracy (Proposition 1). Therefore, we conduct a fairer comparison to highlight the advantage of our method. Here we choose a subset of watermarks P C[32 : 40] (256 watermarks) and report performance in Table 5 , where accuracies are computed using the same settings as before. Visual comparisons between our method (P C[32 : 40] ) and the baseline can be found in Fig. 3 : To maintain attribution accuracy, high strength shallow watermarks, in the form of color patches, are needed around eyes and noses, and significantly lower the generation quality. In comparison, our method uses semantic changes that are robust to postprocesses. The choice of the semantic dimensions, however, need to be carefully chosen for the watermark to be perceptually subtle. Watermark secrecy. In all experiments, our method has worse watermark secrecy than the baseline according to PSNR. This is because PSNR measures pixel-wise differences, and thus does not favor semantic changes as in our method. Nonetheless, we argue that our method have better secrecy when compared with the baseline, because subtle semantic changes across images are harder to be recognized (and thus removed) than common artifacts brought by shallow watermarking (see Fig. 3 ). 

5. CONCLUSION

This paper investigated deep watermark as a solution to enable the attribution of generative models. Our solution achieved better tradeoff between attribution accuracy and generation quality than the previous SOTA that uses shallow watermarks, and also has extremely low computational cost compared to SOTA methods that require encoder-decoder training with high data complexity, rendering our method more scalable to attributing large models with high-dimensional latent spaces.

Limitations and future directions:

(1) There is currently a lack of certification on the attribution accuracy due to the nonlinear nature of both the watermarking and the watermark estimation processes. Formally, by considering both the generation and estimation processes as discrete-time dynamics, such certification would require forward reachability analysis of watermarked contents and backward reachability analysis of the watermark, e.g., convex approximation of the support of p x,i and φ. It is worth investigating whether existing neural net certification methods can be applied. (2) Our method extracts watermarks from the training data. Even with feature decomposition, the amount of features that can be used as watermarks is limited. Thus the accuracy-quality tradeoff is governed by the data. It would be interesting to see if auxiliary datasets can help to learn novel and perceptually insignificant watermarks that are robust against postprocesses, e.g., background patterns.

A PROOF OF PROPOSITIONS

A.1 PROPOSITION 1 Define J w = ∇g(w), H w = J T w J w , and Hϕ = E α∼pα [H U α+σV ϕ ], where p α is induced by p w . Let x ϕ (α) be a content parameterized by (α, ϕ). Denote by ϵ a = â -a the estimation error from the ground truth parameter a. Assume that the estimate α(α) is computed independent from ϕ, and ϵ α is constant. Proposition 1 states: Proposition 1. ∃c > 0 such that if σ ≤ c and ||ϵ α || 2 ≤ c, the watermark estimation problem min φ E α∼pα ∥g(U α(α) + σV φ) -x ϕ (α)∥ 2 2 (3) has an estimation error ϵ ϕ = -(σ 2 V T Hϕ V ) -1 V T Hϕ U ϵ α . Proof. Let x := g(U αϕ (α) + σV φ), we have x = g(U α + σV φ) = g(U (α + ϵ α ) + σV (ϕ + ϵ ϕ )). With Taylor's expansion, we have x = g(U α + σV ϕ) + J w (U ϵ α + σV ϵ ϕ ) + o(U ϵ α + σV ϵ ϕ ) = x ϕ (α) + J w (U ϵ α + σV ϵ ϕ ) + o(U ϵ α + σV ϵ ϕ ). Ignoring higher-order terms and , we then have ∥x ϕ (α) -x∥ 2 2 = ∥J w (U ϵ α + σV ϵ ϕ ) + o(U ϵ α + σV ϵ ϕ )∥ 2 2 = ∥J w (U ϵ α + σV ϵ ϕ )∥ 2 2 + o(U ϵ α + σV ϵ ϕ ) T J w (U ϵ α + σV ϵ ϕ ). For any τ > 0, there exists c, such that if σ ≤ c and ||ϵ α || 2 ≤ c, ∥x ϕ (α) -x∥ 2 2 ≤ ∥J w (U ϵ α + σV ϵ ϕ )∥ 2 2 + τ = σ 2 ϵ T ϕ V T H w V ϵ ϕ + 2ϵ T ϕ V T H w U ϵ α + ϵ T α U T H w U ϵ α + τ. Removing terms independent from ϵ ϕ to reformulate equation 3 as min ϵ ϕ σ 2 ϵ T ϕ V T Hϕ V ϵ ϕ + 2ϵ T ϕ V T Hϕ U ϵ α , the solution of which is ϵ ϕ = -(σ 2 V T Hϕ V ) -1 V T Hϕ U ϵ α . A.2 PROPOSITION 2 Consider two distributions: The first is w 0 = µ + U α + V β where µ ∈ R dw , α ∼ N (0, diag(λ U )), and β ∼ N (0, diag(λ V )). diag(λ) is a diagonal matrix where diagonal elements follow λ. The second distribution is w 1 = µ + U α + σV ϕ where σ > 0 and ϕ ∈ {0, 1} d ϕ . Let g : R dw → R dx ∈ C 1 . Let the mean and covariance matrix of w i be µ i and Σ i . Denote by HU = E α [J T µ+U α J µ+U α ] the mean Gram matrix, and let γ U,max be the largest eigenvalue of HU . Proposition 2 states: Proposition 2. For any τ > 0 and η ∈ (0, 1), there exists c(τ, η) > 0 and ν > 0, such that if σ ≤ c(τ, η) and λ V,i ≤ c(τ, η) for all i = 1, ..., d ϕ , ∥µ 0 -µ 1 ∥ 2 2 ≤ σ 2 γ U,max d ϕ + τ and |tr(Σ 0 -Σ 1 )| ≤ λ V,max γ U,max d ϕ + 2νσ d ϕ + τ with probability at least 1 -η. Proof. We start with ∥µ 0 -µ 1 ∥ 2 2 . From Taylor's expansion and using the independence between α and β, we have µ 0 :=E α,β [g(µ + U α + V β)] =E α [g(µ + U α)] + E α,β [J µ+U α V β + o(J µ+U α V β)] + =E α [g(µ + U α)] + E α,β [o(J µ+U α V β)] , µ 1 :=E α [g(µ + U α + σV ϕ)] =E α [g(µ + U α) + o(σJ µ+U α V ϕ)] + σE α [J µ+U α V ϕ] =µ 0 + σE α [J µ+U α V ϕ] + E α,β [o(J µ+U α V (σϕ -β))] . Let v = V ϕ. With orthonormal V and binary-coded ϕ, we have ∥v∥ 2 2 = ϕ T V T V ϕ = ∥ϕ∥ 2 2 ≤ d ϕ . (5) For the residual term ∥E α,β [o(J µ+U α V (σϕ -β))] ∥ 2 2 and any τ > 0 and η ∈ (0, 1), there exists c(τ, η) > 0, such that if σ ≤ c(τ, η) and λ V,i ≤ c(τ, η) for all i = 1, ..., d ϕ , we have Pr ∥E α,β [o(J µ+U α V (σϕ -β))] ∥ 2 2 ≤ τ ≥ 1 -η. Lastly, we have ∥E α [J µ+U α v]∥ 2 2 ≤ E α [v T J T µ+U α J µ+U α v] = v T HU v ≤ γ U,max ∥v∥ 2 2 ≤ γ U,max d ϕ . Then combining equation 11, equation 5, equation 6, equation 7, we have with probability at least 1 -η ∥µ 0 -µ 1 ∥ 2 2 ≤ σ 2 γ U,max d ϕ + τ. For covariances, let Σ U = Cov(g(µ + U α)). We have Σ 0 :=E α,β (g(µ + U α + V β) -µ 0 )(g(µ + U α + V β) -µ 0 ) T =Σ U + E α J µ+U α V diag(λ V )V T J T µ+U α + E α,β o(J µ+U α V β)(g(µ + U α + V β) -µ 0 ) T Σ 1 :=E α (g(µ + U α + σV ϕ) -µ 1 )(g(µ + U α + σV ϕ) -µ 1 ) T =Σ U + σ 2 Cov(J µ+U α V ϕ * ) + 2σCov(g(µ + U α), J µ+U α V ϕ + o(J µ+U α V ϕ)). For tr(Σ 0 ), using the same treatment for the residual, we have for any τ > 0 and η ∈ (0, 1), there exists c(τ, η) > 0, such that if λ V,i ≤ c(τ, η) for all i = 1, ..., d ϕ , the following upper bound applies with at least probability 1 -η: tr(Σ 0 ) ≤ tr(Σ U ) + tr(E α J µ+U α V diag(λ V )V T J T µ+U α ) + τ ≤ tr(Σ U ) + λ V,max tr(E α J µ+U α V V T J T µ+U α ) + τ = tr(Σ U ) + λ V,max tr(E α V T J T µ+U α J µ+U α V ) + τ ≤ tr(Σ U ) + λ V,max γ U,max tr(V T V ) + τ ≤ tr(Σ U ) + λ V,max γ U,max d ϕ + τ. (10) For the lower bound, we have tr(Σ 0 ) ≥ tr(Σ U ). For tr(Σ 1 ), we first denote by J T i the ith row of J µ+U α , Σ Ji its covariance matrix, and σ 2 i the maximum eigenvalue of Σ Ji . Then with binary-coded ϕ, we have V ar(J T i V ϕ) = ϕ T V T Cov(J i )V ϕ ≤ σ 2 i d ϕ . Then let g i (resp. v i ) be the ith element of g(µ+U α) (resp. J µ+U α V ϕ), and σ 2 U,i be the ith diagonal element of Σ U . Using equation 11, we have the following bound on the trace of the covariance between g(µ + U α) and J µ+U α V ϕ: |tr(Cov(g(µ + U α), J µ+U α V ϕ))| = dx i=1 Cov(g i , v i ) ≤ dx i=1 σ U,i σ i d ϕ . Lastly, by ignoring σ 2 terms and borrowing the same τ , η, and c(τ, η), we have with probability at least 1 -η: tr (Σ 1 ) ≤ tr(Σ U ) + 2σtr(Cov(g(µ + U α), J µ+U α V ϕ)) + τ ≤ tr(Σ U ) + 2σ dx i=1 σ U,i σ i d ϕ + τ, and  tr(Σ 1 ) ≥ tr(Σ U ) -2σ dx i=1 σ U,i σ i d ϕ . (Σ 0 ) -tr(Σ 1 ) ≤ λ V,max γ U,max d ϕ + 2σ dx i=1 σ U,i σ i d ϕ + τ, and tr(Σ 0 ) -tr(Σ 1 ) ≥ -2σ dx i=1 σ U,i σ i d ϕ -τ. A.3 PROPOSITION 3 Proposition 3. For any τ > 0, ∃ c(τ ) > 0 such that if σ ≤ c(τ ), E α∼pα ∥g(µ + U α) -g(µ + U α + σV ϕ)∥ 2 2 ≤ σ 2 γ U,max d ϕ + τ. Proof. We can reuse the same techniques as in the previous proofs. For some τ > 0, there exists c(τ ) > 0, so that when σ < c(τ ), the bound on MSE can be derived as follows: E α∼pα ∥g(µ + U α) -g(µ + U α + σV ϕ)∥ 2 2 = E α∼pα ∥σJ µ+U α V ϕ + o(σ(µ + U α) T J µ+U α V ϕ)∥ 2 2 ≤ E α∼pα ∥σJ µ+U α V ϕ∥ 2 2 + τ = σ 2 E α∼pα ϕ T V T J T µ+U α J µ+U α V ϕ + τ ≤ σ 2 γ U,max d ϕ + τ. B CONVERGENCE ON α In the proofs, we assume that ∥ϵ α ∥ 2 is small and constant. Here we show empirical estimation results on SG2 and on FFHQ, AFHQ-DOG, AFHQ-CAT datasets. The results in Fig. 4 (a) are averaged over 100 random α and 100 random ϕ, and uses parallel search on α during the estimation. C QUALITATIVE SIMILARITY BETWEEN Hw AND Σ w Since computing Hw for large models is intractable, here we train a SG2 on MNIST to estimate Hw . Fig. 4 (b) summarizes perturbed images from a randomly chosen reference along principal components of Hw and Σ w . Note that both have quickly diminishing eigenvalues. Therefore most components other than the few major ones lead to imperceptible changes in the image space.

D ABLATION STUDY

In this section, we estimated attribution accuracy based on various attack parameters with multiple editing directions (see Tab.6, 7, 8, 9) . The image quality evaluation is available in Tab.10 and more visualizations can be found in Fig. 5 . [32:40] 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 PC[32:48] 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 PC[32:64] 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.99

E SIMPLE WATERMARKING STRATEGY OF GENERATIVE MODELS

In the main paper, we studied deep watermarking methodology using SG2 and LDM. Both models have multi-layer network that maps Gaussian distribution to disentangled latent distribution. This well-trained latent distribution enables the generative models to create realistic images and embed watermarks without quality degradation as we discussed. Therefore, introducing disentanglement mapping network has become the main-stream to design generative models. However, we would also like to test the effectiveness for generative models without a disentanglement mapping by proposing a naive deep watermarking strategy. To test this, we employed BigGAN (Brock et al., 2018) . Different from SG2 and LDM, BigGAN generates images R dx using vectors z that sampled from Gaussian distribution N dz . Instead of simply picking random dimensions and perturbing those elements of z, we applied the deep watermarking method 3.Since z ∈ N dz does not have disentangled properties, this method semantically changes generated image. However, this approach also achieves high performance of attribution accuracy and FID scores.

E.2 EXPERIMENTS

The BigGAN's input z is in N 128 and d ϕ = 32 for the following experiments.We tested this approach on randomly selected Imagenet Deng et al. (2009) classes including cheeseburger, fountain, golden retriever, husky, Persian cat, and white wolf. For each of the classes, we generated 1,000 images for experiments. We measured attribution accuracy 1 and generation quality (FID). As shown in the table 11 and figure 6 , this approach also keeps quantitative and qualitative performances of pretrained BigGAN. 



Figure 2: Visualization of watermarks along minor and major principal components of the covariance of the latent distribution. (Top) StyleGAN2. (Bottom) Latent Diffusion Model (LDM).

Figure 3: Comparison on generation quality between our method and the baseline with similar attribution accuracy The first row shows original images generated without watermarking.Each images in the second row represents robustly watermarked images against corresponding postprocesses. The next row illustrates post-processed images. The last row depicts the differences between original (the first row) and watermarked images (the second row) using heatmap. Even if our method shows large pixel value changes, the watermarks are not perceptible comparing with baseline method (see second row).

Figure 4: (a) Average percentage error rate on α (b) Comparison between watermarks guided by Σ w and Hw . The editing strength for top two rows and bottom two rows are 0.05 and 0.2 respectively.

Attribution accuracy and generation quality of the proposed method. FID-BL is the baseline FID score. ↑ (↓) indicates higher (lower) is desired. Standard deviation is in parenthesis.We present generation quality results in Table1. Since the least variant principal components are used as watermarks, generation quality (FID) and watermark secrecy (PSNR) are preserved. We note that a PSNR value ≥ 30 db is conventionally considered as acceptable for watermarks (Mahto

Tradeoff between attribution accuracy (Att.) and generation quality (FID, IS) under different watermarking directions (PC) and strength (σ).Att. ↑ FID ↓ IS ↑ Att. ↑ FID ↓ IS ↑ Att. ↑ FID ↓ IS ↑ Att. ↑ FID ↓ IS ↑ Att. ↑ FID ↓ IS ↑ Att. ↑ FID ↓ IS ↑

Attribution accuracy and generation quality for different watermark lengths. FID-BL is baseline FID score. Standard deviation is in parenthesis. ↑ (↓) indicates higher (lower) is desired. Standard deviation is in parenthesis.

Comparison on accuracy-quality tradeoff between proposed and baseline methods under image postprocesses. StyleGAN2 and FFHQ. Watermarking strength σ = 3. The FID score of baseline method is 96.24. KN (UK) stands for when attributability is measured with (without) the knowledge of attack. Standard deviation is in parenthesis.

Accuracy-quality tradeoff under Combo attack. V defined as the 8, 16, and 32 eigenvectors of Σ w starting from the 33th eigenvectors. σ = 3. KN (UK) stands for When attributability is measured with (without) knowledge of attack. Standard deviation is in parenthesis.

Attributability Table of Blurring attack. σ refers standard deviation of Gaussian Blur filter size 25. When attributability is measured with (without) knowledge of attack, we put results under KN (UK).

Attributability Table of Noise attack. σ refers standard deviation of Gaussian normal distribution. When attributability is measured with (without) knowledge of attack, we put results under KN (UK).

Attributability Table of JPEG attack. Q refers quality metric of JEPG compression. When attributability is measured with (without) knowledge of attack, we put results under KN (UK).

Attributability Table of combination attack. From T1 to T4, the attack parameters are composed of the weakest to the strongest attack parameters of each attack (e.g., T4 is [σ blur = 2.0, σ noise = 0.2, Q JPEG =50]). When attributability is measured with (without) knowledge of attack, we put results under KN (UK). ] 0.99 0.99 0.94 0.99 0.81 0.95 0.65 0.89 PC[32:48] 0.99 0.99 0.74 0.92 0.52 0.88 0.45 0.85 PC[32:64] 0.99 0.99 0.63 0.90 0.41 0.82 0.26 0.79

Quality Comparison Table. Standard deviation are in parentheses. The baseline score is in the parentheses.

Quality changes and accuracy of the proposed method. FID-BL is baseline FID score of each class. ↑ (↓) indicates higher (lower) is desired.

