DECENTRALIZED ATTRIBUTION OF GENERATIVE MODELS

Abstract

Growing applications of generative models have led to new threats such as malicious personation and digital copyright infringement. One solution to these threats is model attribution, i.e., the identification of user-end models where the contents under question are generated from. Existing studies showed empirical feasibility of attribution through a centralized classifier trained on all user-end models. However, this approach is not scalable in reality as the number of models ever grows. Neither does it provide an attributability guarantee. To this end, this paper studies decentralized attribution, which relies on binary classifiers associated with each user-end model. Each binary classifier is parameterized by a user-specific key and distinguishes its associated model distribution from the authentic data distribution. We develop sufficient conditions of the keys that guarantee an attributability lower bound. Our method is validated on MNIST, CelebA, and FFHQ datasets. We also examine the trade-off between generation quality and robustness of attribution against adversarial post-processes. 1 * Equal contribution. 1 https://github.com/ASU-Active-Perception-Group/decentralized_ attribution_of_generative_models This paper investigates the following question: What are the sufficient conditions of keys so that the user-end generative models can achieve distinguishability individually and attributability collectively, while maintaining their generation quality? Contributions We claim the following contributions: 1. We develop sufficient conditions of keys for distinguishability and attributability, which connect these metrics with the geometry of the data distribution, the angles between keys, and the generation quality. 2. The sufficient conditions lead to simple design rules for the keys: keys should be (1) data compliant, i.e., φ T x < 0 for x ∼ P D , and (2) orthogonal to each other. We validate these rules using DCGAN (Radford et al., 2015) and StyleGAN (Karras et al., 2019a) on benchmark datasets including MNIST (LeCun & Cortes, 2010), CelebA (Liu et al., 2015), and FFHQ (Karras et al., 2019a). See Fig. 1 for a visualization of the attributable distributions perturbed from the authentic FFHQ dataset. 3. We empirically test the tradeoff between generation quality and robust attributability under random post-processes including image blurring, cropping, noising, JPEG conversion, and a combination of all.

1. INTRODUCTION

Figure 1 : FFHQ dataset projected to the space spanned by two keys φ 1 and φ 2 . We develop sufficient conditions for model attribution: Perturbing the authentic dataset along different keys with mutual angles larger than a data-dependent threshold guarantees attributability of the perturbed distributions. (a) A threshold of 90 deg suffices for benchmark datasets (MNIST, CelebA, FFHQ). (b) Smaller angles may not guarantee attributability. Recent advances in generative models (Goodfellow et al., 2014) have enabled the creation of synthetic contents that are indistinguishable even by naked eyes (Pathak et al., 2016; Zhu et al., 2017; Zhang et al., 2017; Karras et al., 2017; Wang et al., 2018; Brock et al., 2018; Miyato et al., 2018; Choi et al., 2018; Karras et al., 2019a; b; Choi et al., 2019) . Such successes raised serious concerns regarding emerging threats due to the applications of generative models (Kelly, 2019; Breland, 2019) . This paper is concerned about two particular types of threats, namely, malicious personation (Satter, 2019) , and digital copyright infringement. In the former, the attacker uses generative models to create and disseminate inappropriate or illegal contents; in the latter, the attacker steals the ownership of a copyrighted content (e.g., an art piece created through the assistance of a generative model) by making modifications to it. We study model attribution, a solution that may address both threats. Model attribution is defined as the identification of user-end models where the contents under question are generated from. Existing studies demonstrated empirical feasibility of attribution through a centralized classifier trained on all existing user-end models (Yu et al., 2018) . However, this approach is not scalable in reality where the number of models ever grows. Neither does it provide an attributability guarantee. To this end, we propose in this paper a decentralized attribution scheme: Instead of a centralized classifier, we use a set of binary linear classifiers associated with each user-end model. Each classifier is parameterized by a user-specific key and distinguishes its associated model distribution from the authentic data distribution. For correct attribution, we expect one-hot classification outcomes for generated contents, and a zero vector for authentic data. To achieve correct attribution, we study the sufficient conditions of the user-specific keys that guarantee an attributability lower bound. The resultant conditions are used to develop an algorithm for computing the keys. Lastly, we assume that attackers can post-process generated contents to potentially deny the attribution, and study the tradeoff between generation quality and robustness of attribution against post-processes. Problem formulation We assume that for a given dataset D ⊂ R dx , the registry generates userspecific keys, Φ := {φ 1 , φ 2 , ...} where φ i ∈ R dx and ||φ i || = 1. || • || is the l 2 norm. A user-end generative model is denoted by G φ (•; θ) : R dz → R dx where z and x are the latent and output variables, respectively, and θ are the model parameters. When necessary, we will suppress θ and φ to reduce the notational burden. The dissemination of the user-end models is accompanied by a public service that tells whether a query content belongs to G φ (labeled as 1) or not (labeled as -1). We model the underlying binary linear classifier as f φ (x) = sign(φ T x). Note that linear models are necessary for the development of sufficient conditions of attribution presented in this paper, although sufficient conditions for nonlinear classifiers are worth exploring in the future. The following quantities are central to our investigation: (1) Distinguishability of G φ measures the accuracy of f φ (x) at classifying G φ against D: D(G φ ) := 1 2 E x∼P G φ ,x0∼P D [1(f φ (x) = 1) + 1(f φ (x 0 ) = -1)] . Here P D is the authentic data distribution, and P G φ the user-end distribution dependent on φ. G is (1 -δ)-distinguishable for some δ ∈ (0, 1] when D(G) ≥ 1 -δ. (2) Attributability measures the averaged multi-class classification accuracy of each model distribution over the collection G := {G φ1 , ..., G φ N }: A(G) := 1 N N i=1 E x∼G φ i 1(φ T j x < 0, ∀ j = i, φ T i x > 0). (2) G is (1 -δ)-attributable when A(G) ≥ 1 -δ. (3) Lastly, We denote by G(•; θ 0 ) (or shortened as G 0 ) the root model trained on D, and assume P G0 = P D . We will measure the (lack of) generation quality of G φ by the FID score (Heusel et al., 2017) and the l 2 norm of the mean output perturbation: ∆x(φ) = E z∼Pz [G φ (z; θ) -G(z; θ 0 )], (3) where P z is the latent distribution. From the definitions (Eq. (1) and Eq. ( 2)), achieving distinguishability is necessary for attributability. In the following, we first develop the sufficient conditions for distinguishability through Proposition 1 and Theorem 1, and then those for attributability through Theorem 2. Distinguishability through watermarking First, consider constructing a user-end model G φ by simply adding a perturbation ∆x to the outputs of the root model G 0 . Assuming that φ is datacompliant, this model can achieve distinguishability by solving the following problem with respect to ∆x: min ||∆x||≤ε E x∼P D max{1 -φ T (x + ∆x), 0} , where ε > 0 represents a generation quality constraint. The following proposition reveals the connection between distinguishability, data geometry, and generation quality (proof in Appendix A): Proposition 1. Let d max (φ) := max x∼P D |φ T x|. If ε ≥ 1 + d max (φ), then ∆x * = (1 + d max (φ))φ solves Eq. ( 4), and f φ (x + ∆x * ) > 0, ∀ x ∼ P D . Watermarking through retraining user-end models The perturbation ∆x * can potentially be reverse engineered and removed when generative models are white-box to users (e.g., when models are downloaded by users). Therefore, we propose to instead retrain the user-end models G φ using the perturbed dataset D γ,φ := {G 0 (z)+γφ | z ∼ P z } with γ > 0, so that the perturbation is realized through the model architecture and weights. Specifically, the retraining fine-tunes G 0 so that G φ (z) matches with G 0 (z) + γφ for z ∼ P z . Since this matching will not be perfect, we use the following model to characterize the resultant G φ : G φ (z) = G 0 (z) + γφ + , where the error ∼ N (µ, Σ). In Sec. 3 we provide statistics of µ and Σ on the benchmark datasets, to show that the retraining captures the perturbations well (µ close to 0 and small variances in Σ). Updating Proposition 1 due to the existence of leads to Theorem 1, where we show that γ needs to be no smaller than d max (φ) in order for G φ to achieve distinguishability (proof in Appendix B): Theorem 1. Let d max (φ) = max x∈D |φ T x|, σ 2 (φ) = φ T Σφ, δ ∈ [0, 1], and φ be a data-compliant key. D(G φ ) ≥ 1 -δ/2 if γ ≥ d max (φ) + σ(φ) log 1 δ 2 -φ T µ. Remarks The computation of σ(φ) requires G φ , which in turn requires γ. Therefore, an iterative search is needed to determine γ that is small enough to limit the loss of generation quality, yet large enough for distinguishability (see Alg. 1).

Attributability

We can now derive the sufficient conditions for attributability of the generative models from a set of N keys (proof in Appendix C): Theorem 2. Let d min = min x∈D |φ T x|, d max = max x∈D |φ T x|, σ 2 (φ) = φ T Σφ, δ ∈ [0, 1]. Let a(φ, φ ) := -1 + d max (φ ) + d min (φ ) -2φ T µ σ(φ ) log 1 δ 2 + d max (φ ) -φ T µ , for keys φ and φ . Then A(G) ≥ 1 -N δ, if D(G) ≥ 1 -δ for all G φ ∈ G, φ T φ ≤ a(φ, φ ) for any pair of data-compliant keys φ and φ . Remarks When σ(φ ) is negligible for all φ and µ = 0, a(φ, φ ) is approximately d min (φ )/d max (φ ) > 0, in which case φ T φ ≤ 0 is sufficient for attributability. In Sec. 3 we empirically show that this approximation is plausible for the benchmark datasets. 

3. EXPERIMENTS AND ANALYSIS

In this section we test Theorem 1, provide empirical support for the orthogonality of keys, and present experimental results on model attribution using MNIST, CelebA, and FFHQ. Note that tests on the theorems require estimation of Σ, which is costly for models with high-dimensional outputs, and therefore are only performed on MNIST and CelebA.

Key generation

We generate keys by iteratively solving the following convex problem: φ i = arg min φ E x∼P D ,G0 max{1 + φ T x, 0} + i-1 j=1 max{φ T j φ, 0}. The orthogonality penalty is omitted for the first key. The solutions are normalized to unit l 2 norm before being inserted into the next problem. We note that P D and P G0 do not perfectly match in practice, and therefore we draw with equal chance from both distributions during the computation. G 0 s are trained using the standard DCGAN architecture for MNIST and CelebA, and StyleGAN for FFHQ. Training details are deferred to Appendix D.

User-end generative models

The training of G φ follows Alg. 1, where γ is iteratively tuned to balance generation quality and distinguishability. For each γ, we collect a perturbed dataset D γ,φ and solve the following training problem: min θ E (z,x)∼D γ,φ ||G φ (z; θ) -x|| 2 , ( ) starting from θ = θ 0 . If the resultant model does not meet the distinguishability requirement due to the discrepancy between D γ,φ and G φ , the perturbation is updated as γ = αγ. In experiments, we use a standard normal distribution for P z , and set δ = 10 -2 and α = 1.1. Validation of Theorem 1 Here we validate the sufficient condition for distinguishability. Fig. 2a compares the LHS and RHS values of Eq. ( 6) for 100 distinguishable user-end models. The empirical distinguishability of these models are reported in Fig. 2e . Calculation of the RHS of Eq. ( 6) requires estimations of µ and Σ. To do this, we sample (z) = G φ (z; θ) -G(z; θ 0 ) -γφ using 5000 samples of z ∼ P z , where G φ and γ are derived from Alg. 1. Σ and µ are then estimated for each φ. Fig. 2c and d present histograms of the elements in µ and Σ for two user-end models of the benchmark datasets. Results in Fig. 2a show that the sufficient condition for distinguishability (Eq. ( 6)) is satisfied for most of the sampled models through the training specified in Alg. 1. Lastly, we notice that the LHS values for MNIST are farther away from the equality line than those for CelebA. This is because the MNIST data distribution resides at corners of the unit box. Therefore perturbations of the distribution are more likely to exceed the bounds for pixel values. Clamping of these invalid pixel values reduces the effective perturbation length. Therefore to achieve distinguishability, Alg. 1 seeks γs larger than needed. This issue is less observed in CelebA, where data points are rarely close to the boundaries. Fig. 2g present the values of γs of all user-end models. Algorithm 1: Training of G φ input : φ, G 0 output: G φ , γ set γ = d max (φ) ; collect D γ,φ ; train G φ by solving Eq. ( 10) using D γ,φ ; compute empirical D(G φ ) ; if D(G φ ) < 1 -δ then set γ = αγ ; goto step 2 ; end Validation of Theorem 2 Recall that from Theorem 2, we recognized that orthogonal keys are sufficient. To support this design rule, Fig. 2b presents the minimum RHS values of Eq. ( 8) for 100 user-end models. Specifically, for each φ i , we compute a(φ i , φ j ) (Eq. ( 7)) using φ j for j = 1, ..., i -1 and report min j a(φ i , φ j ), which sets an upper bound on the angle between φ i and all existing φs. The resultant min j a(φ i , φ j ) are all positive for MNIST and close to zero for CelebA. From this result, an angle of ≥ 94 deg, instead of 90 deg, should be enforced between any pairs of keys for CelebA. However, since the conditions are sufficient, orthogonal keys still empirically achieve high attributability (Fig. 2f ), although improvements can be made by further increasing the angle between keys. Also notice that the current computation of keys (Eq. ( 9)) does not enforce a hard constraint on orthogonality, leading to slightly acute angles (87.7 deg) between keys for CelebA (Fig. 2h ). On the other hand, the positive values in Fig. 2b for MNIST suggests that further reducing the angles between keys is acceptable if one needs to increase the total capacity of attributable models. However, doing so would require the derivation of new keys to rely on knowledge about all existing user-end models (in order to compute Eq. ( 7)). Empirical results on benchmark datasets Tab. 1 reports the metrics of interest measured on the 100 user-end models for each of MNIST and CelebA, and 20 models for FFHQ. All models are trained to be distinguishable. And by utilizing Theorem 2, they also achieve high attributability. As a comparison, we demonstrate results where keys are 45 deg apart (φ T φ = 0.71) using a separate set of 20 user-end models for each of MNIST and CelebA, and 5 models for FFHQ, in which case distinguishability no longer guarantees attributability. Regarding generation quality, G φ s receive worse FID scores than G 0 due to the perturbations. We visualize samples from user-end models and the corresponding keys in Fig. 3 . Note that for human faces, FFHQ in particular, the perturbations create light shades around eyes and lips, which is an unexpected but reasonable result. Attribution robustness vs. generation quality We now consider the scenario where outputs of the generative models are post-processed (e.g., by adversaries) before being attributed. When the post-processes are known, we can take counter measures through robust training, which intuitively will lead to additional loss of generation quality. To assess this tradeoff between robustness and generation quality, we train G φ against post-processes T : R dx → R dx from a distribution P T . Due to the potential nonlinearity of T and the lack of theoretical guarantee in this scenario, we resort to the following robust training problem for deriving the user-end models: min θi E z∼Pz,T ∈P T max{1 -f φi (T (G φi (z; θ i ))), 0} + C||G 0 (z) -G φi (z; θ i )|| 2 , ( ) where C is the hyper-parameter for generation quality. Detailed analysis and comparison for selecting C are provided in Appendix E. We consider five types of post-processes: blurring, cropping, noise, JPEG conversion and the combination of these four. Examples of the post-processed images are shown in Fig. 5 . Blurring uses Gaussian kernel widths uniformly drawn from 1 3 {1, 3, 5, 7, 9}. Cropping crops images with uniformly drawn ratios between 80% and 100%, and scales the cropped images back to the original size using bilinear interpolation. Noise adds white noise with standard deviation uniformly drawn from [0, 0.3]. JPEG applies JPEG compression. Combination performs each attack with a 50% chance in the order of Blurring, Cropping, Noise and JPEG. For differentiability, we use existing implementations of differentiable blurring (Riba et al. (2020) ) and JPEG conversion (Zhu et al. (2018) ). For robust training, we apply the post-process to mini-batches with 50% probability. We performed comprehensive tests using DCGAN (on MNIST and CelebA), PGAN (on CelebA), and CycleGAN (on Cityscapes). Tab. 2 summarizes the average distinguishability, the attributability, the perturbation length ||∆x||, and the FID score with and without robust training of G φ . Results are based on 20 models for each architecture-dataset pair, where keys are kept orthogonal and data compliant. From the results, defense against these post-processes can be achieved, except for Combination. Importantly, there is a clear tradeoff between robustness and generation quality. This can be seen from Fig. 5 , which compares samples with7 and without robust training from the tested models and datasets. Lastly, it is worth noting that the training formulation in Eq. ( 12) can also be applied to the training of non-robust user-end models in place of Eq. ( 10). However, the resultant model from Eq. ( 12) cannot be characterized by Eq. ( 5) with small µ and Σ, i.e., due to the nonlinearity of the training process of Eq. (12, the user-end model distribution is deformed while it is perturbed. This resulted in unsuccessful validation of the theorems, which led to the adoption of Eq. ( 10) for theorem-consistent training. Therefore, while the empirical results show feasibility of achieving robust attributability using Eq. (12, counterparts to Theorems 1 and 2 in this nonlinear setting are yet to be developed. Capacity of keys For real-world applications, we hope to maintain attributability for a large set of keys. Our study so far suggests that the capacity of keys is constrained by the data compliance and orthogonality requirements. While the empirical study showed the feasibility of computing keys through Eq. ( 9), finding the maximum number of feasible keys is a problem about optimal sphere packing on a segment of the unit sphere (Fig. 4 ). To explain, the unit sphere represents the identifiability requirement ||φ|| = 1. The feasible segment of the unit sphere is determined by the data compliance and generation quality constraints. And the spheres to be packed have radii following the sufficient condition in Theorem 2. Such optimal packing problems are known open challenges (Cohn et al. (2017) ; Cohn ( 2016)). For real-world applications where a capacity of attributable models is needed (which is the case for both malicious personation and copyright infringement settings), it is necessary to find approximated solutions to this problem.

Generation quality control

From Proposition 1 and Theorem 1, the inevitable loss of generation quality is directly related to the length of perturbation (γ), which is related to d max . Fig. 6 compares outputs from user-end models with different d max s. While it is possible to filter φs based on their corresponding d max s for generation quality control, here we discuss a potential direction for prescribing a subspace of φs within which quality can be controlled. To start, we denote by J(x) the Jacobian of G 0 with respect to its generator parameters θ 0 . Our discussion is related to the matrix M = E x∼P G 0 [J(x)]E x∼P G 0 [J(x) T ]. A spectral analysis of M reveals that the eigenvectors of M with large eigenvalues are more structured than those with small ones (Fig. 7(a) ). This finding is consistent with the definition of M : The largest eigenvectors of M represent the principal axes of all mean sensitivity vectors, where the mean is taken over the latent space. For MNIST, these eigenvectors overlap with the digits; for CelebA, they are structured color patterns. On the other hand, the smallest eigenvectors represent directions rarely covered by the sensitivity vectors, thus resembling random noise. Based on this finding, we test the hypothesis that keys more aligned with the eigenspace of the small eigenvalues will have smaller d max . We test this hypothesis by computing the Pearson correlations between d max and φ T M φ using 100 models for each of MNIST and CelebA. The resultant correlations are 0.33 and 0.53, respectively. In addition, we compare outputs from models using the largest and the smallest eigenvectors of M as the keys in Fig. 7b . While a concrete human study is needed, the visual results suggest that using eigenvectors of M is a promising approach to segmenting the space of keys according to their induced generation quality.

5. RELATED WORK

Detection and attribution of model-generated contents This paper focused on the attribution of contents from generative models rather than the detection of hand-crafted manipulations (Agarwal & Farid (2017) 2020) studied decentralized attribution. Instead of using linear classifiers for attribution, they train a watermark encoder-decoder network that embeds (and reads) watermarks into (and from) the content, and compare the decoded watermark with user-specific ones. Their method does not provide sufficient conditions of the watermarks for attributability. IP protection of digital contents and models Watermarks have conventionally been used for IP protection (Tirkel et al., 1993; Van Schyndel et al., 1994; Bi et al., 2007; Hsieh et al., 2001; Pereira & Pun, 2000; Zhu et al., 2018; Zhang et al., 2019a) without considering the attribution guarantee. Another approach to content IP protection is blockchain (Hasan & Salah, 2019). However, this approach requires meta data to be transferred along with the contents, which may not be realistic in adversarial settings. E.g., one can simply take a picture of a synthetic image to remove any meta data attached to the image file. Aside from the protection of contents, mechanisms for protecting IP of models have also been studied (Uchida et al., 2017; Nagai et al., 2018; Le Merrer et al., 2019; Adi et al., 2018; Zhang et al., 2018; Fan et al., 2019; Szyller et al., 2019; Zhang et al., 2020) . Model watermarking is usually done by adding watermarks into model weights (Uchida et al., 2017; Nagai et al., 2018) , by embedding unique input-output mapping into the model (Le Merrer et al., 2019; Adi et al., 2018; Zhang et al., 2018) , or by introducing a passport mechanism so that model accuracy drops if the right passport is not inserted (Fan et al., 2019) . While closely related, existing work on model IP protection focused on the distinguishability of individual models, rather than the attributability of a model set.

6. CONCLUSION

Motivated by emerging challenges with generative models, e.g., deepfake, this paper investigated the feasibility of decentralized attribution of such models. The study is based on a protocol where the registry generates user-specific keys that guides the watermarking of user-end models to be distinguishable from the authentic data. The outputs of user-end models will then be attributed by the registry through the binary classifiers parameterized by the keys. We developed sufficient conditions of the keys so that distinguishable user-end models achieve guaranteed attributability. These conditions led to simple rules for designing the keys. With concerns about adversarial postprocesses, we further showed that robust attribution can be achieved using the same design rules, and with additional loss of generation quality. Lastly, we introduced two open challenges towards real-world applications of the proposed attribution scheme: the prescription of the key space with controlled generation quality, and the approximation of the capacity of keys. 

Makena

(x + ∆x) > 0 ∀ x ∼ G 0 . Proof. Let φ be a data-compliant key and let x be sampled from P D . First, from the KKT conditions for Eq. ( 4) we can show that the solution ∆x * is proportional to φ: ∆x * = φ/µ * , where µ * ≥ 0 is the Lagrange multiplier. To minimize the objective, we seek µ such that 1 -(x + ∆x * ) T φ = 1 -x T φ -1/µ * ≤ 0, for all x. Since x T φ < 0 (data compliance), this requires 1/µ * = 1 + d max (φ). Therefore, when ε ≥ 1 + d max (φ), ∆x * = (1 + d max (φ))φ solves Eq. ( 4). And f φ (x + ∆x * ) = φ T (x + (1 + d max (φ))φ) = φ T x + 1 + d max (φ) > 0. B PROOF OF THEOREM 1 Theorem 1. Let d max (φ) = max x∈D |φ T x|, σ 2 (φ) = φ T Σφ, δ ∈ [0, 1], and φ be a data-compliant key. D(G φ ) ≥ 1 -δ/2 if γ ≥ σ(φ) log 1 δ 2 + d max (φ) -φ T µ. Proof. We first note that due to data compliance of keys, E x∼P D 1(φ T x < 0) = 1. Therefore D(G φ ) ≥ 1 -δ/2 iff E x∼P G φ 1(φ T x > 0) ≥ 1 -δ, i.e., Pr(φ T x > 0) ≥ 1 -δ d for x ∼ P G φ . We now seek a lower bound for Pr(φ T x > 0). To do so, let x and x 0 be sampled from P G φ and P G0 , respectively. Then we have φ T x = φ T (x 0 + γφ + ) = φ T x 0 + γ + φ T , and Pr(φ T x > 0) = Pr φ T > -φ T x 0 -γ . Since d max (φ) ≥ -φ T x 0 , we have Pr(φ T x > 0) ≥ Pr φ T > d max (φ) -γ = Pr φ T ( -µ) ≤ γ -d max (φ) + φ T µ . The latter sign switching in equation 18 is granted by the symmetry of the distribution of φ T ( -µ), which follows N (0, φ T Σφ). A sufficient condition for Pr(φ T x > 0) ≥ 1 -δ is then Pr φ T ( -µ) ≤ γ -d max (φ) + φ T µ ≥ 1 -δ. Recall the following tail bound of x ∼ N (0, σ 2 ) for y ≥ 0: Pr(x ≤ σy) ≥ 1 -exp(-y 2 /2). Compare equation 20 with equation 19, the sufficient condition becomes 4 summarizes the number of GPUs used and the training time for the non-robust models (Eq.( 10) in the main text) and robust models (Eq.( 12) in the main text). Recall that we chose Eq.( 10) for training the non-robust user-end models for consistency with the theorems, although Eq.( 12) can be used to achieve attributability in practice, as is shown in the robust attribution study. Therefore, the non-robust training takes longer to due the iteration of γ in Alg. 1.  γ ≥ σ(φ) log 1 δ 2 + d max (φ) -φ T µ.

E ABLATION STUDY

Here we conduct an ablation study on the hyper-parameter C for the robust training formulation (Eq.( 12)). Training with larger C focuses more on generation quality, thus sacrificing distinguishability and attributability. These effects are reported in Table 5 and Table 6 . Due to limited time, the results here are averaged over five models for each C and data-model pairs. 100 0.00 0.87 0.00 0.85 0.73 0.90 0.10 0.95 0.00 0.13 DCGAN M 1K 0.00 0.75 0.00 0.80 0.63 0.80 0.10 0.91 0.00 0.05 DCGAN C 10 0.00 0.98 0.00 0.99 0.89 0.93 0.07 0.98 0.00 0.70 DCGAN C 100 0.00 0.95 0.00 0.93 0.82 0.85 0.02 0.93 0.00 0.61 DCGAN C 1K 0.00 0.90 0.00 0.89 0.77 0.81 0.00 0.88 0.00 0.43 PGAN 100 0.26 1.00 0.21 1.00 0.99 0.99 0.99 0.99 0.00 0.99 PGAN 1K 0.21 0.99 0.00 0.99 0.97 0.98 0.98 0.99 0.00 0.54 PGAN 10K 0.00 0.51 0.00 0.90 0.90 0.92 0.83 0.99 0.00 0.22 CycleGAN 1K 0.00 0.99 0.00 0.97 0.97 0.99 0.45 0.99 0.00 0.77 CycleGAN 10K 0.00 0.87 0.00 0.77 0.95 0.96 0.30 0.99 0.00 0.31 



Figure 2: (a) Validation of Theorem 1: All points should be close to the diagonal line or to its right. (b) Support for orthogonal keys: Min. RHS value of Eq.(7) for all keys are either positive (MNIST) or close to zero (CelebA). (c,d) Statistics of µ and Σ for two sample user-end models for MNIST and CelebA. Small µ and small diag(Σ) suggest good match of G φ to the perturbed data distributions. (e-h) Distinguishability, attributability, perturbation length, and orthogonality of 100 StyleGAN user-end models on FFHQ and 100 DCGAN user-end models on MNIST or CelebA, respectively.

Figure 3: Visualization of sample keys (1st row) and the corresponding user-end generated contents.

Figure 4: Capacity of keys as a sphere packing problem: The feasible space (arc) is determined by the data compliance and generation quality constraints, and the size of spheres by the minimal angle between keys.

Figure 5: Samples from user-end models with robust and non-robust training. For each subfiguretop: DCGAN on MNIST and CelebA; bottom: PGAN (CelebA) and CycleGAN (Cityscapes). For each dataset -top: samples from G 0 (after worst-case post-process in (b-f)); mid: samples from G φ (after robust training in (b-f)); btm (a): difference between non-robust G φ and G 0 ; btm (b-h) difference between robust and non-robust G φ .

Figure 6: MNIST, CelebA, and FFHQ examples from G φ s with (a-c) small d max and (d-f) large d max . All models are distinguishable and attributable. (Zooming in on pdf file is recommended.)

;Popescu & Farid (2005); O'brien & Farid (2012);Rao & Ni (2016);Huh et al. (2018)). Detection methods rely on fingerprints intrinsic to generative models(Odena et al. (2016); Zhang

Figure 7: (a) Eigenvectors for the two largest and two smallest eigenvalues of M for DCGANs on MNIST (top) and CelebA (bottom). (b) Left column: Samples from G 0 ; Rest: G 0 -G φ where φ are the eigenvectors in (a).

Empirical average of distinguishability ( D),attributability (A(G)), ||∆x||, and FID scores. DCGAN M (DCGAN C ) for MNIST (CelebA). Std in parenthesis. FID 0 : FID for G 0 . ↓ means lower is better and ↑ means higher is better.

DCGAN M : MNIST. DCGAN C : CelebA. Dis.: Distinguishability before (Bfr) and after (Aft) robust training. Att.: Attributability. ||∆x|| and FID are after robust training. Std in parenthesis. ↓ means lower is better and ↑ means higher is better. ||∆x|| and FID before robust training: DCGAN M :||∆x|| = 5.05, FID = 5.36. DCGAN C :||∆x|| = 5.63, FID = 53.91. PGAN: ||∆x|| = 9.29, FID = 21.62. CycleGAN: ||∆x|| = 55.85. FID does not apply to CycleGAN. CycleGAN 68.03(3.62) 80.03(3.59) 55.47(1.60) 57.42(2.00) 83.94(4.66) FID ↓ DCGAN M 41.11(20.43) 21.58(2.44) 5.79(0.19) 6.50(1.70) 68.16(24.67) DCGAN

Proposition 1. Let d max (φ) := max x∼P D |φ T x|. If ε ≥ 1 + d max (φ), then ∆x = (1 + d max (φ))φ solves Eq. (4), and f φ

Hyper-parameters to train keys (φ) and generators (G φ ).

Training time (in minute) of one key (Eq.(9) in main text) and one generator (Eq.(10) in main text). DCGAN M : DCGAN for MNIST, DCGAN C : DCGAN for CelebA.

Distinguishability (top), attributability (btm) before (Bfr) and after (Aft) robust training. DCGAN M : DCGAN for MNIST, DCGAN C : DCGAN for CelebA. .98 0.50 0.99 0.96 0.99 0.96 0.99 0.50 0.81 PGAN 1K 0.50 0.89 0.49 0.95 0.94 0.95 0.88 0.99 0.50 0.60 PGAN 10K 0.50 0.61 0.50 0.76 0.89 0.90 0.76 0.98 0.50 0.51 CycleGAN 1K 0.49 0.92 0.50 0.87 0.98 0.99 0.55 0.99 0.49 0.62 CycleGAN 10K 0.49 0.70 0.50 0.66 0.94 0.96 0.52 0.98 0.50 0.51 DCGAN M 10 0.02 0.94 0.03 0.88 0.77 0.95 0.16 0.98 0.00 0.26 DCGAN M

||∆x|| (top) and FID score (btm). Standard deviations in parenthesis. DCGAN M : DCGAN for MNIST, DCGAN C : DCGAN for CelebA, Combi.: Combination attack. Lower is better. CycleGAN 1K 55.85(3.67) 68.03(3.62) 80.03(3.59) 55.47(1.60) 57.42(2.00) 83.94(4.66) CycleGAN 10K 49.66(5.01) 58.64(3.70) 66.05(3.47) 53.14(0.44) 54.52(2.30) 66.24(5.29) DCGAN M 10 5.36(0.12) 41.11(20.43) 21.58(2.44) 5.79(0.19) 6.50(1.70) 68.16(24.67) DCGAN M 100 5.32(0.11) 23.83(14.29) 18.39(3.70) 5.41(0.18) 5.46(0.11) 36.05(16.20) DCGAN M 1K 5.23(0.12) 10.85(4.28) 18.08(1.77) 5.37(0.14) 5.30(0.96) 21.86(4.16) DCGAN C 10 53.91(2.20) 73.62(6.70) 98.86(9.51) 59.51(1.60) 60.35(2.57) 87.29(9.29) DCGAN C 100 45.02(3.37) 73.12(11.03) 85.50(12.25) 47.60(2.57) 50.48(4.58) 78.11(12.95) DCGAN C 1K 40.85(3.41) 55.63(7.97) 72.11(13.81) 40.87(3.03) 45.46(5.03) 57.13(7.20) PGAN 100 21.62(1.73) 28.15(3.43) 47.94(5.71) 25.43(2.19) 22.86(2.06) 45.16(7.87) PGAN 1K 19.05(3.14) 25.19(5.26) 43.48(12.24) 19.20(2.96) 19.05(2.82) 35.07(8.72) PGAN 10K 16.75(1.87) 18.96(2.65) 37.01(8.74) 16.94(1.89) 17.39(2.33) 26.63(4.44)

7. ACKNOWLEDGEMENTS

Support from NSF Robust Intelligence Program (1750082), ONR (N00014-18-1-2761), and Amazon AWS MLRA is gratefully acknowledged. We would like to express our gratitude to Ni Trieu (ASU) for providing us invaluable advice, and Zhe Wang, Joshua Feinglass, Sheng Cheng, Yongbaek Cho and Huiliang Shao for helpful comments.

annex

for all G φ ∈ G and for any pair of data-compliant keys φ and φ :Proof. Let φ and φ be any pair of keys. Let x and x 0 be sampled from P G φ and P G0 , respectively. We first derive the sufficient conditions for Pr(φwhereUsing the same tail bound of normal distribution and Theorem 1, we have Pr(φNote that Pr(A = 1, B = 1) = 1 -Pr(A = 0) -Pr(B = 0) + Pr(A = 0, B = 0) ≥ 1 -Pr(A = 0) -Pr(B = 0) for binary random variables A and B. With this, it is straight forward to show that when Pr(φ T x < 0) ≥ 1 -δ for all φ = φ, and Pr(φ

D TRAINING DETAILS D.1 METHOD

We trained user-end models based on the objective function (Eq.( 10) in the main text). For datasets where the root models follow DCGAN and PGAN, the user-end models follow the same architecture. For the FFHQ dataset where StyleGAN is used, we introduce an additional shallow convolutional network as a residual part, which is added to the original StyleGAN output to match with the perturbed datasets D γ,φ . In this case, the training using Eq.( 10) is limited to the additional shallow network, while the StyleGAN weights are frozen. More specifically, denoting the combination of convolution, ReLU, and max-pooling by Conv-ReLU-Max, the shallow network consists of three Conv-ReLU-Max blocks and one fully connected layer. All of the convolution layers have 4 x 4 kernels, stride 2, and padding 1. And all of the max-pooling layers have 3 x 3 kernels and stride 2.

D.2 PARAMETERS

We adopt the Adam optimizer for training. Training hyper-parameters are summarized in Table 3 .

