DATA INSTANCE PRIOR FOR TRANSFER LEARNING IN GANS

Abstract

Recent advances in generative adversarial networks (GANs) have shown remarkable progress in generating high-quality images. However, this gain in performance depends on the availability of a large amount of training data. In limited data regimes, training typically diverges, and therefore the generated samples are of low quality and lack diversity. Previous works have addressed training in low data setting by leveraging transfer learning and data augmentation techniques. We propose a novel transfer learning method for GANs in the limited data domain by leveraging informative data prior derived from self-supervised/supervised pretrained networks trained on a diverse source domain. We perform experiments on several standard vision datasets using various GAN architectures (BigGAN, SNGAN, StyleGAN2) to demonstrate that the proposed method effectively transfers knowledge to domains with few target images, outperforming existing stateof-the-art techniques in terms of image quality and diversity. We also show the utility of data instance prior in large-scale unconditional image generation and image editing tasks.

1. INTRODUCTION

Generative Adversarial Networks (GANs) are at the forefront of modern high-quality image synthesis in recent years (Brock et al., 2018; Karras et al., 2020b; 2019) . GANs have also demonstrated excellent performance on many related computer vision tasks such as image manipulation (Zhu et al., 2017; Isola et al., 2017) , image editing (Plumerault et al., 2020; Shen et al., 2020; Jahanian et al., 2020) and compression (Tschannen et al., 2018) . Despite the success in large-scale image synthesis, GAN training suffers from a number of drawbacks that arise in practice, such as training instability and mode collapse (Goodfellow et al., 2016; Arora et al., 2017) . It has been observed that the issue of unstable training can be mitigated to an extent by using conditional GANs. However, this is expected as learning the conditional model for each class is easier than learning the joint distribution. The disadvantages of GAN training have prompted research in several non-adversarial generative models (Hoshen et al., 2019; Bojanowski et al., 2018; Li & Malik, 2018; Kingma & Welling, 2014) . These techniques are implicitly designed to overcome the mode collapse problem, however, the quality of generated samples are still not on par with GANs. Current state-of-the-art deep generative models require a large volume of data and computation resources. The collection of large datasets of images suitable for training -especially labeled data in case of conditional GANs -can easily become a daunting task due to issues such as copyright, image quality and also the training time required to get state-of-the-art image generation performance. To curb these limitations, researchers have recently proposed techniques inspired by transfer learning (Noguchi & Harada, 2019; Wang et al., 2018; Mo et al., 2020) and data augmentation methods (Karras et al., 2020a; Zhao et al., 2020b; Zhang et al., 2019) . Advancements in data and computation efficiency for image synthesis can enable its applications in data-deficient fields such as medicine (Yi et al., 2019) where labeled data generation can be difficult to obtain. Transfer learning is a promising area of research (Oquab et al., 2014; Pan & Yang, 2009) that leverages prior information acquired from large datasets to help in training models on a target dataset under limited data and resource constraints. There has been extensive exploration of transfer learning in classification problems that have shown excellent performance on various downstream data-deficient domains. Similar extensions of reusing pre-trained networks for transfer learning (i.e. fine-tuning a subset of pre-trained network weights from a data-rich domain) have also been recently employed for image synthesis in GANs (Wang et al., 2018; Noguchi & Harada, 2019; Mo et al., 2020; Wang et al., 2020; Zhao et al., 2020a) in the limited data regime. However, these approaches are still prone to overfitting on the sparse target data, and hence suffer from degraded image quality and diversity. In this work, we propose a simple yet effective way of transferring prior knowledge in unsupervised image generation given a small sample size (∼ 100-2000) of the target data distribution. Our approach is motivated by the formulation of the IMLE technique (Li & Malik, 2018 ) that seeks to obtain mode coverage of target data distribution by learning a mapping between latent and target distributions using a maximum likelihood criterion. We instead propose the use of data priors in GANs to match the representation of the generated samples to real modes of data. In contrast to (Li & Malik, 2018) , we use the images generated using data priors to find the nearest neighbor match to real modes in the generator's learned distribution. In particular, we show that using an informative data instance prior in limited and large-scale unsupervised image generation substantially improves the performance in image synthesis. We show that these data priors can be derived from commonly used computer vision pre-trained networks (Simonyan & Zisserman, 2014; Zhang et al., 2018; Noguchi & Harada, 2019; Hoshen et al., 2019) or self-supervised data representations (Chen et al., 2020) (without any violation of the target setting's requirements, i.e. ensuring that the pre-trained network has not been trained on few-shot classes in the few-shot learning setting, for instance). In case of sparse training data, our approach of using data instance priors leverages a model pre-trained on a rich source domain to the learn the target distribution. Different from previous works (Noguchi & Harada, 2019; Wang et al., 2020; 2018) which rely on fine-tuning models trained on a data-rich domain, we propose to leverage the feature representations of our source model as data instance priors, to distill knowledge (Romero et al., 2015; Hinton et al., 2015) into our target generative model. We note that our technique of using data instance priors for transfer learning becomes fully unsupervised in case the data priors are extracted from self-supervised pre-trained networks. Furthermore, in addition to image generation in low data domain, we also achieve state-of-the-art Fréchet inception distance (FID) score (Heusel et al., 2017) on large-scale unsupervised image generation and also show how this framework of transfer learning supports several image editing tasks. We summarize our main contributions as follows: • We propose Data Instance Prior (DIP), a novel transfer learning technique for GAN image synthesis in low-data regime. We show that employing DIP in conjunction with existing few-shot image generation methods outperforms state-of-the-art results. We show with as little as 100 images our approach DIP results in generation of diverse and high quality images (see Figure 3 ). • We demonstrate the utility of our approach in large-scale unsupervised GANs (Miyato et al., 2018; Brock et al., 2018) achieving the new state-of-the-art in terms of image quality (Heusel et al., 2017) and diversity (Sajjadi et al., 2018; Metz et al., 2017) . • We show how our framework of DIP by construction enables inversion of images and common image editing tasks (such as cutmix, in-painting, image translation) in GANs. We call our method a data instance prior (and not just data prior), since it uses representations of instances as a prior, and not a data distribution itself.

2. RELATED WORK

Deep Generative Models In recent years, there has been a surge in the research of deep generative models. Some of the popular approaches include variational auto-encoders (VAEs) (Rezende et al., 2014; Kingma & Welling, 2014) , auto-regressive (AR) models (Van Oord et al., 2016; Van den Oord et al., 2016) and GANs (Goodfellow et al., 2014) . VAE models learn by maximizing the variational lower bound of likelihood of generating data from a given distribution. Auto-regressive approaches model the data distribution as a product of the conditional probabilities to sequentially generate data. GANs comprise of two networks, a generator and a discriminator that train in a min-max optimization. Specifically, the generator aims to generate samples to fool the discriminator, while the discriminator learns distinguish these generated samples from the real samples. Several research efforts in GANs have focused around improving the performance (Karras et al., 2018; Denton et al., 2015; Radford et al., 2016; Karras et al., 2020b; 2019; Brock et al., 2018; Zhang et al., 2019) and training stability (Salimans et al., 2016b; Gulrajani et al., 2017; Arjovsky et al., 2017; Miyato et al., 2018; Mao et al., 2017; Chen et al., 2019) . Recently, the areas of latent space manipulation for semantic editing (Shen et al., 2020; Jahanian et al., 2020; Zhu et al., 2020; Plumerault et al., 2020) and few-shot image generation (Wang et al., 2020; Mo et al., 2020; Noguchi & Harada, 2019) have gained traction in an effort to mitigate the practical challenges while deploying GANs. Several other non-adversarial training approaches such as (Hoshen et al., 2019; Bojanowski et al., 2018; Li & Malik, 2018) have also been explored for generative modeling, which leverage supervised learning along with perceptual loss (Zhang et al., 2018) for training such models. Transfer Learning in GANs While there has been extensive research in the area of transfer learning for classification models (Yosinski et al., 2014; Oquab et al., 2014; Tzeng et al., 2015; Pan & Yang, 2009; Donahue et al., 2014) , relatively fewer efforts have explored this on the task of data generation (Wang et al., 2018; 2020; Noguchi & Harada, 2019; Zhao et al., 2020a; Mo et al., 2020) . (Wang et al., 2018) proposed to fine-tune a pre-trained GAN model (often having millions of parameters) from a data-rich source to adapt to the target domain with limited samples. This approach, however, often suffers from overfitting as the final model parameters are updated using only few samples of the target domain. To counter overfitting, the work of (Noguchi & Harada, 2019) proposes to update only the batch normalization parameters of the pre-trained GAN model. In this approach, however, the generator is not adversarially trained and uses supervised L 1 pixel distance loss and perceptual loss (Johnson et al., 2016; Zhang et al., 2018) which often leads to generation of blurry images in the target domain. Based on the assumption that source and target domain support sets are similar, (Wang et al., 2020) recently proposed to learn an additional mapping network that transforms the latent code suitable for generating images of target domain while keeping the other parameters frozen. We compare against all leading baselines including (Noguchi & Harada, 2019; Wang et al., 2020) on their respective tasks, and show that our method DIP outperforms them substantially, while being simple to implement. A related line of recent research aims to improve large-scale unsupervised image generation in GANs by employing self-supervision -in particular, an auxiliary task of rotation prediction (Chen et al., 2019) or using one-hot labels obtained by clustering in the discriminator's (Liu et al., 2020) or ImageNet classifier feature space (Sage et al., 2018) . In contrast, our method utilizes data instance priors derived from the feature activations of self-supervised/supervised pre-trained networks to improve unsupervised few-shot and large-scale image generation, leading to simpler formulation and higher performance as shown in our experiments in Section 5.3 and Appendix A. Recently, some methods (Karras et al., 2020a; Zhao et al., 2020b; Zhang et al., 2019; Zhao et al., 2020c) have leveraged data augmentation to effectively increase the number of samples and prevent overfitting in GAN training. However, data augmentation techniques often times alter the true data distribution and there is a leakage of these augmentations to the generated image, as shown in (Zhao et al., 2020c; b) . To overcome this, (Zhao et al., 2020b) recently proposed to use differential augmentation and (Karras et al., 2020a ) leveraged an adaptive discriminator augmentation mechanism. We instead focus on leveraging informative data instance priors, and in fact, show how our DIP method can be used in conjunction with (Zhao et al., 2020b) to improve performance.

3. PRELIMINARIES

We now present the related background on generative models that are essential for our methodology. Conditional GAN: It consists of a generator G : R m ×Y → R p and a discriminator D : R p ×Y → [0, 1] which are trained adversarially to learn a target data distribution q(x|y), where x ∈ R p , y ∈ Y , the space of class labels. The goal of G is to generate samples from noise z ∼ p(z), z ∈ R m given a conditional label y ∼ q(y) that matches the target distribution and the aim of D is to distinguish between real samples and those generated from G. Conditional GANs use class-level information y of data in the generator and discriminator. The standard hinge loss (Tran et al., 2017) for training GANs is given by: L D = E y∼q(y) E x∼q(x|y) [max(0, 1 -D(x, y))] + E y∼q(y) E z∼p(z) [max(0, 1 + D(G(z|y), y))] L G = -E y∼q(y) E z∼p(z) [D(G(z|y), y)] where the discriminator score D(x, y) depends on input image (either real or fake) and its class y (Miyato & Koyama, 2018; Odena et al., 2017) . Generally, the class information is passed into G through a one-hot vector concatenated with z or through conditional batch norm layers (De Vries et al., 2017; Dumoulin et al., 2016) . Implicit Maximum Likelihood Estimation IMLE (Li & Malik, 2018 ) is a non-adversarial technique that uses a maximum likelihood criterion to train the generative model. During training, each real sample of the data distribution is matched to a generated image through nearest neighbour search and the distance between the two is minimized. The optimization objective to update the parameters of network G in each training step is given by: min E z1...zm∼N (0,I d ) E x∼q(x) G(z * (x)) -x 2 2 where z * = min z1...zm G(z) -x 2 2 (2) Here, the inner optimization of finding latent code z * (x) from the batch {z 1 ...z m } such that G(z * (x)) is nearest to the real image x from the data distribution is implemented using the nearest neighbor search technique (Li & Malik, 2017) . The above objective promotes that each real example is close to some generated sample from the trained generator.

4. DIP: PROPOSED METHODOLOGY

We propose a transfer learning framework DIP for training GANs that exploits knowledge extracted from self-supervised/supervised networks, pre-trained on a rich and diverse source domain in the form of data instance priors. It has been shown that providing class label information in GANs significantly improves training stability and quality of generated images as compared to unconditional setting (Miyato & Koyama, 2018; Chen et al., 2019) . However, in practice, GANs are observed to be prone to mode-collapse that is further exacerbated in case of sparse training data. We take motivation from the reconstructive framework of IMLE (Li & Malik, 2018) and propose to condition GANs on image instance priors that act as a regularizer to prevent mode collapse and discriminator overfitting. Knowledge Transfer in GAN GANs are a class of implicit generative models that minimize a divergence measure between the data distribution q(x) and the generator output distribution G(z) where z ∼ p(z) denotes the latent distribution. Intuitively, this minimization of a divergence objective ensures that each generated sample G(z) is close to some data example x ∼ q(x). However, this does not ensure the converse, i.e. that each data instance is close to a generated sample, which can result in mode dropping. To counter this, especially in limited data regime, we propose to update the parameters of the model so that each real data example is close to some generated sample similar to (Li & Malik, 2018) by using data instance priors as conditional label in GANs. Moreover to enable transfer of knowledge, image features extracted from networks pre-trained on a large source domain are used as the instance level prior. Given a pre-trained feature extractor C : R p → R d , x ∈ R p , which is trained on a source domain using supervisory signals or self-supervision, we use its output C(x) as condition in the generator. G is conditioned on C(x) using conditional batch-norm (Dumoulin et al., 2016) whose input is G emb (C(x)) , where G emb is a learnable matrix. During training we enforce that G(z|C(x)) is close to the real image x in discriminator feature space (similar to G(z * (x)) being close to x in Eq 2). Let the discriminator be D = D l • D f (• denotes composition) where D f is discriminator's last feature layer and D l is the final linear layer. To enforce the above objective we map C(x) to the dimension equal to discriminator's feature layer D f using a trainable matrix D emb and minimize distance between D emb (C(x)) and D f of both real image x and generated image G(z|C(x)) in an adversarial manner. Hence, our final GAN training loss for the discriminator and generator is given by:  L D = E x∼q(x) [max(0, 1 -D(x, C(x)))] + E x∼q(x),z∼p(z) [max(0, 1 + D(G(z|C(x)), C(x)))] L G = -E x∼q(x),z∼p(z) [D(G(z|C(x)), C(x))] (3) where D(x, y)) = D emb (y) • D f (x) + D l • D f (x) (4) batch x ∼ q(x), z ∼ p(z) x fake = G(z|C(x)) Calculate D(x, C(x)) = D f (x) • D emb (C(x)) + D l • D f (x) Calculate D(x fake , C(x)) = D f (x fake ) • D emb (C(x)) + D l • D f (x fake ) LD = max(0, 1 -D(x, C(x))) + max(0, 1 + D(x fake , C(x))) Update θD ← Adam(LD, α, β1, β2) end Sample z ∼ p(z) Generate images x fake = G(z|C(x) Calculate D(x fake , C(x)) = D f (x fake ) • D emb (C(x)) + D l • D f (x fake ) LG = -D(x fake , C(x)) Update θG ← Adam(LG, α, β1, β2) end return θG, θD. In the above formulation, the first term in Eq. 4 is the projection loss as in (Miyato & Koyama, 2018) between input image and conditional embedding of discriminator. Since conditional embedding is extracted from a pre-trained network, above training objective leads to feature level knowledge distillation from C. It also acts as a regularizer on the discriminator reducing its overfitting in the limited data setting as shown in Figure 2 . The gap between discriminator score (D l • D f ) of training and validation images keeps on increasing and FID quickly degrades for baseline model as compared to DIP when trained on only 10% data of CIFAR-100. Moreover, enforcing feature D f (G(z|C(x))) to be similar to D emb (C(x)) promotes that for each real sample, there exists a generated sample close to it and hence promotes mode coverage of target data distribution. We demonstrate that the above proposed use of data instance priors from a pre-trained feature extractor, while designed for a limited data setting, also benefits in large-scale image generation. Our overall methodology is illustrated in Figure 1 and pseudo code is in Algorithm 1. Random image generation at inference Given the training set D image = {x j } n j=1 of sample size n and its corresponding training data prior set D prior = {C(x j )} n j=1 , the generator requires access to D prior for sample generation. In case of few-shot image generation where size of D prior is limited (∼ 100), to create more variations we generate images conditioned on interpolation of two randomly sampled prior i.e. In case of large-scale image generation, to avoid storing D prior corresponding to complete training set (possibly in order of millions), we propose to cluster (Sculley, 2010) or build a Gaussian Mixture Model (GMM) (Xu & Jordan, 1996) on D prior and store only the cluster centroids and thus enable memory efficient sampling of conditional prior from the distribution fit during inference as follows:

G(z|λ

• p 1 + (1 -λ) • p 2 ) where p 1 , p 2 ∼ D prior ; λ ∼ Uniform[0, 1] G(z|µ + N (0, I k )) where µ ∼ K-MeansCentroids(G emb (D prior )) or G(z|N (µ, Σ)) where µ, Σ ∼ GMM(G emb (D prior )) Controlled image generation through semantic diffusion We observed that high-level semantics (e.g. smile, hair, gender, glasses, etc in case of faces) of a generated image, G(z|C(x)), relied on the conditional prior, C(x). Complementarily, variations in the latent code z ∼ N(0, I) induced fine-grained changes such as skin texture, face shape, etc. This suggests that one can exploit our conditional prior, C(x), to get some control over the image generation's high-level semantics. We show that by altering an image x (through CutMix, CutOut, etc) and using C(x) of the altered image as our new input prior helps in generating samples with the desired attributes, as shown in Fig 5 . In a similar manner, DIP also allows generation of images with certain cues (like sketch to image generation, as shown in Fig 5  and Appendix ). We note that the generation of samples at inference, in this case, can simply be done by using C(x) as condition in G.

5. EXPERIMENTS

We perform extensive experiments to highlight the efficacy of our data instance prior module DIP in unsupervised training based on SNGAN (Miyato et al., 2018) , Big-GAN (Brock et al., 2018) and StyleGAN2 (Karras et al., 2020b) architectures. For extracting image prior information, we use the last layer activations of: (1) Vgg16 (Simonyan & Zisserman, 2014) classification network trained on ImageNet; and (2) SimCLR (Chen et al., 2020) network trained using self-supervision on ImageNet. We conduct experiments on few-shot (∼ 25-100 images), limited (∼ 2k-5k images) and large-scale (∼ 50k-1M images) data settings. For evaluation, we use FID (Heusel et al., 2017) , precision and recall scores (Kynkäänniemi et al., 2019) to assess the quality and mode-coverage/diversity of the generated images.

5.1. FEW-SHOT IMAGE GENERATION

Baselines and Datasets For comprehensive evaluation, we compare and augment our methodology DIP with training SN-GAN from scratch and the following leading baselines: Batch Statistics Adaptation (BSA) (Noguchi & Harada, 2019) , TransferGAN (Wang et al., 2018) , FreezeD (Mo et al., 2020) and DiffAugment (Zhao et al., 2020b) . In case of BSA, a non-adversarial variant, GLANN (Hoshen et al., 2019) is used which optimizes for image embeddings and generative model Precision and Recall scores are based on (Kynkäänniemi et al., 2019) . FID is computed between 10k, 7k, 5k generated and 10k, 7k, 251 real samples for Anime, Faces and Flower respectively. * denotes directly reported from the paper. through perceptual lossfoot_0 . We use our data priors to distill knowledge over these image embeddings. For more training and hyperparameter details, please refer to Appendix A. We perform experiments on randomly chosen 100 images at 128 × 128 resolution from: (1) Animefoot_1 

Results

Using DIP shows consistent improvement in FID and Recall over all baseline methods as shown in Table 1 . Fig 3 shows samples generated via interpolation between conditional embedding of models trained via DIP-Vgg on DiffAugment and vanilla DiffAugment. These results show qualitatively the significant improvement obtained using our DIP-based transfer learning approach. Comparatively, the baseline, vanilla DiffAugment, fails to generate realistic interpolation and for the most part, presents memorized training set images. DIP performs better when training is done from scratch as compared to FreezeD and TransferGAN but is worse than DiffAugment+DIP. We present additional ablation studies in Appendix A, including more qualitative comparisons, to study the benefit of using DIP in few-shot image generation. Method Inference CIFAR-10 CIFAR-100 FFHQ LSUN-Bedroom ImageNet32x32 FID ↓ P ↑ R ↑ FID ↓ P ↑ R ↑ FID ↓ P ↑ R ↑ FID ↓ P ↑ R ↑ FID ↓ P ↑ R

5.2. LIMITED DATA IMAGE GENERATION

In many practical scenarios, we have access to moderate number of images (1k-5k) instead of just a few examples, however the limited data may still not be enough to achieve stable GAN training. We show the benefit of our approach in this setting and compare our results with: MineGAN (Wang et al., 2020) , TransferGAN, FreezeD, and DiffAugment. We perform experiments on three 128 × 128 resolution datasets: FFHQ2k, Places2.5k and CUB6k following (Wang et al., 2020) . FFHQ2k contains 2K training samples from FFHQ (Karras et al., 2019) dataset. Places2.5k is a subset of Places365 dataset (Zhou et al., 2014) with 500 examples each sampled from 5 classes (alley, arch, art gallery, auditorium, ball-room). CUB6k is the complete training split of CUB-200 dataset (Wah et al., 2011) . We use the widely used BigGAN (Brock et al., 2018) architecture, pre-trained on ImageNet for finetuning. Table 2 shows our results; using DIP consistently improves FID on existing baselines by a significant margin. More implementation details are given in Appendix B and sample generated images via our approach are shown in Fig 8 . We also compare our approach with DiffAugment on CIFAR-10 and CIFAR-100 dataset while varying the number of sample used for training in Table 9 Appendix B.

5.3. LARGE-SCALE IMAGE GENERATION

In order to show the usefulness of our method on large-scale image generation, we carry out experiments on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2010) and ImageNet-32 × 32 datasets with 50k, 50k and ∼ 1.2M training images respectively at 32 × 32 resolution. For a higher 128 × 128 resolution, we perform experiments on FFHQ and LSUN-bedroom (Yu et al., 2015) datasets with 63k and 3M training samples. We use a ResNet-based architecture for both discriminator and generator similar to BigGAN (Brock et al., 2018) for all our experiments. We also compare DIP with SS-GAN (Chen et al., 2019) and Self-Conditional GAN (Liu et al., 2020) . Implementation and training hyperparameter details are provided in Appendix C. Table 3 reports the FID, precision and recall score on the generated samples and the test set for baselines and our approach (DIP). For K-means clustering on G emb (D prior ), we set number of clusters to 10K and for GMM, the number of components are fixed to 1K for all datasets. DIP achieves better FID, precision and recall scores compared to leading baselines. Sample qualitative results are shown in the Appendix (Figure 11 ). To further analyze the role of prior in our methodology, we train CIFAR-100 dataset with DIP using priors from different pre-trained models. As shown in the results in Table 5 , the FID metric remains relatively similar for different priors when compared to the baseline. We also evaluate the quality of inverted images for 128 × 128 resolution on FFHQ and LSUN datasets using Inference via Optimization Measure (IvOM) (Metz et al., 2017) to emphasize the high instance-level data coverage in the prior space of GANs trained through our approach (details on IvOM calculation are provided in Appendix C). Table 4 shows the IvOM and FID metric between inverted and real query images. Figure 4 shows sample inverted images. We observe both from qualitative and quantitative perspective, models trained via DIP invert a given query image better than the corresponding baselines.

Semantic Diffusion and Variations

As described in Section 4, our conditional DIP module provides us with a framework to alter input images and thus generate images with specific semantics. Figure 5 demonstrates how controlled semantic diffusion can be leveraged in several image manipulation tasks. We perform: (1) Custom Editing by using CutMix (i.e pasting a desired portion of one image upon another); (2) Inpainting by using CutOut (i.e removing random portions in example image); (3) Sketch-to-Image by providing the feature of a sketch as conditional prior; and (4) Colorization by using the feature of a given grayscale image for conditioning. As evident from Figure 5 , our trained generator is able to generalize well through its ability to diffuse the semantic information from altered (cutmix and cutout) as well as out-of-domain (sketches and gray-scale) images. 

6. CONCLUSION

In this work, we present a novel data instance prior based transfer learning approach to improve the quality and diversity of images generated using GANs when a few training data samples are available. By leveraging features as priors from rich source domain in limited unsupervised image synthesis, we show the utility of our simple yet effective approach on various standard vision datasets and GAN architectures. We demonstrate the efficacy of our approach in image generation with limited data, where it achieves the new state-of-the performance, as well as on large-scale settings. Furthermore, using our framework of training via instance level priors, we show how easily we can perform common image editing tasks by manipulating these priors. As future work, it would be interesting to explore the application of prior information in other potential image editing tasks. we use an embedding for each of the 100 training images to ensure minimal difference between baseline and our approach without increasing number of parameters. We also experimented with self-modulated (Chen et al., 2018) and unconditional training which resulted in either training collapse or worse results in all approaches. In DiffAugment, we use three augmentations: translation, cutout, and color with consistency regularization hyperparameter as 10 and training is done from scratch following the implementation in their paper (Zhao et al., 2020b) . In FreezeD, we freeze the first five blocks of the discriminator and finetune the rest. We use spectral normalization for both generator and discriminator during training with batch size of 25, number of discriminator steps as 4, G and D learning rate as 2e -4, z dimension as 120 and maximum number of training steps as 30K. During evaluation, moving average weights (Salimans et al., 2016a) of the generator is used in all experiments unless stated otherwise. For FID calculation, we select the snapshot with best FID similar to (Chen et al., 2019; Zhao et al., 2020b) . For calculating precision and recall based on the k-nearest neighbor graph of inception features, as in (Kynkäänniemi et al., 2019) , we use k as 10 for Precision and 40 for Recall. For StyleGAN-2, G emb is a 2-layer MLP with ReLU non-linearity which maps C(x) to a 512dimensional generator conditional space. It is then concatenated with random noise z of dimension 512 which is used as input in the mapping network. D emb is a linear transformation matrix and discriminator loss is projection loss combined with real/fake loss. Training is done with batch-size of 16 for DiffAugmentfoot_2 and 8 for FreezeDfoot_3 till 20k steps. In case of BSA, we show that DIP can be used to improve the results on similar non-adversarial generative models. Specifically, we perform experiments with GLANNfoot_4 which is a two step training  arg min G,G emb i L perceptual (G • G emb • C(x i ), x i ) where {e i } = {G emb • C(x i )} We finetune the pre-trained generator on batch-size of 50 with a learning rate of 0.01 for 4000 epochs. During second step of IMLE optimization, we use a 3-layer MLP with z dimension as 64 and train for 500 epochs with a learning rate of 0.05. Comparison with Logo-GAN (Sage et al., 2018) Logo-GAN has shown advantage of using features from pre-trained ImageNet network in unconditional training by assigning class label to each instance based on clustering in the feature space. We compare our approach with this method in the few-shot data setting. For implementing logo-GAN, we perform class-conditional training (Miyato et al., 2018) using labels obtained by K-means clustering on Vgg16 features of 100-shot Anime dataset. The results reported in Table 7 show the benefit of directly using features as data instance prior instead of only assigning labels based on feature clustering.

B LIMITED DATA IMAGE GENERATION

Experiments on CIFAR-10 and CIFAR-100 We experiment with unconditional BigGAN and StyleGAN2 model on CIFAR-10 and CIFAR-100 while varying the amount of data as done in (Zhao et al., 2020b) . We compare DIP with DiffAugment on all settings and the results are shown in Table 9 . In the limited data setting (5%  and 10% Implementation details of experiment on 128 Resolution datasets in Section 5.2 We use our approach in conjunction with existing methodologies in a similar way as the few-shot setting with G emb and D emb as linear transformation matrices which transform the data priors into the generator's conditional input space of dimension 128 and discriminator feature space of dimension 1536. During baseline training, we use self-modulation (Chen et al., 2018) in the batch-norm layers similar to (Chen et al., 2019; Schonfeld et al., 2020) . In DiffAugment, we use three augmentations: translation, cutout, and color with consistency regularization hyperparameter as 10. 

C LARGE-SCALE IMAGE GENERATION

Test for memorization in trained model For analyzing memorization in GANs, we evaluate it on the recently proposed test to detect data copying (Meehan et al., 2020) . The test calculates whether generated samples are closer to the training set as compared to a separate test set in the inception feature space using three sample Mann-Whitney U test (Mann & Whitney, 1947) . The test statistic C T << 0 represents overfitting and data-copying, whereas C T >> 0 represents underfitting. We average the test statistic over 5 trials and report the results in Table 10 . We can see that using data instance priors during GAN training does not lead to data-copying according to the test statistic except in case of FFHQ dataset where both DIP and baseline C T values are also negative.

Image inversion

To invert a query image, x q using our trained model, we optimize the prior (after passing it to G emb ) that is used to condition each resolution block, independently. Mathematically, we optimize the following objective: Here, C i (after passing it through G emb ) is the prior that is used to condition the i th ∈ {1...k} resolution block. To get a faster and better convergence, we initialize all C i as G emb (C(x q )). The optimization is achieved via back-propagation using Adam optimizer with learning rate of 0.1. Implementation Details We use a single linear layer to transform the pre-trained image features to the generator's conditional input space of 128 dimensions, and discriminator feature space of 1024 dimensions respectively. A hierarchical latent structure similar to (Brock et al., 2018) is used during DIP training. During evaluation with K-means and GMM on ImageNet and LSUN-Bedroom we first randomly sample 200K training images and then fit the distribution since clustering on complete training set which is in the order of millions is infeasible. In the training of the unconditional baseline, we use self-modulation (Chen et al., 2018) . In SSGAN, for rotation loss we use the default parameter of 0.2 for generator and 1.0 for discriminator as mentioned in (Chen et al., 2019) . For training Self-Conditional GAN (Liu et al., 2020) , we set the number of clusters to 100 for all datasets. For CIFAR-10 and CIFAR-100, we re-cluster at every 25k iterations with 25k samples, and for ImageNet, at every 75k iterations with 50k samples following default implementation as in (Liu et al., 2020) . Following standard practice (Zhang et al., 2019) , we calculate FID, Precision and Recall between test split and an equal number of generated images for-10, CIFAR-100, and ImageNet 32×32, i.e., 10k, 10k, and 50k, respectively. For FFHQ and LSUN-bedroom datasets, we use 7k and 30k generated and real (disjoint from training) samples, respectively. For all datasets and methods, training is done with batch size of 64, G and D learning rate is set to 0.0002, z dimension equals 120 and spectral normalization is used in both generator and discriminator networks. Training is done till 100k steps for all datasets except ImageNet which is trained for 200k steps and moving average weights of generator are used during evaluation.



The code provided with BSA was not reproducible, and hence this choice www.gwern.net/Danbooru2018 https://github.com/mit-han-lab/data-efficient-gans https://github.com/sangwoomo/FreezeD https://github.com/yedidh/glann https://github.com/mit-han-lab/data-efficient-gans/tree/master/DiffAugment-stylegan2 https://github.com/ajbrock/BigGAN-PyTorch



Figure 1: Overview of our proposed technique, Data Instance Priors (DIP) for transfer learning in GANs. Top:DIP training with feature C(x) of a real sample x as a conditional prior in the conditional GAN framework of(Miyato & Koyama, 2018). C is a pre-trained network on a rich source domain from which we wish to transfer knowledge. Bottom: Inference over trained GAN involves learning a distribution over the set of training data prior {C(x)} to enable sampling of conditional priors.

Figure 2: Comparison between DIP and Baseline when trained on 10% data of CIFAR-100. left: FID (in Pytorch) of baseline training starts increasing very early in training (around 15k) as compared to FID of DIP training. middle: Discriminator score on training and validation images remain similar to each other and consistently higher than score of generated images for DIP model. right: Discriminator score on training and validation images start diverging and training collapses for the baseline model.

Figure 3: 100-shot image interpolation between instance-level prior for DiffAugment (Zhao et al., 2020b) (Rows 1,3,5) and DiffAugment + DIP-Vgg16 (Rows 2,4,6) on Anime, Faces and Flower datasets respectively.

(2) FFHQ(Karras et al., 2019) and (3) Oxford 102 flowers(Nilsback & Zisserman, 2008) (we restrict to the Passion flower class following(Noguchi & Harada, 2019), to avoid overlap with ImageNet classes) datasets. The above datasets are chosen to ensure that there is no class label intersection with ImageNet classes. For methods with pre-training, we finetune SNGAN pre-trained on ImageNet as done in(Noguchi & Harada, 2019). We also show additional results at 256 × 256 resolution as well as additional datasets (Pandas, Grumpy Cat, Obama) with StyleGAN-2(Karras et al., 2020b)   in Appendix A.

Figure 4: Images generated through IvOM for randomly sampled test set images on FFHQ and LSUN-Bedroom. (Top to Bottom:) Original images, Baseline, Baseline + DIP-Vgg16, Baseline + DIP-SimCLR.

For more qualitative results, please see Fig 10 in Appendix. We can also use interpolation, noise injection and Mixup in the conditional space to generate meaningful variations of a given image as shown in Fig 9 in the Appendix.

Figure 7: Few-shot interpolation samples between instance-level priors: Scratch (Row 1), Scratch + DIP-Vgg16 (Row 2), FreezeD (Row 3), FreezeD + DIP-Vgg16 (Row 4), DiffAugment (Row 5), DiffAugment + DIP-Vgg16 (Row 6)

) augmenting DiffAugment with DIP gives the best results in terms of FID for both BigGAN and StyleGAN2 architectures. When trained on complete training dataset DIP slightly outperforms DiffAugment on BigGAN architecture. BigGAN model used for training CIFAR-10 and CIFAR-100 is same as the one used for large scale experiments in Section 5.3. In DiffAugment with BigGAN architecture, we use all three augmentations: translation, cutout, and color along with consistency regularization hyperparameter as 10. In DiffAugment + DIP consistency regularization hyperparameter is changed to 1. For experiments on StyleGAN2 architecture we use the code-base of DiffAugment 6 .

Figure 8: Sample generated images on limited data training: FreezeD (Row 1), FreezeD + DIP-Vgg16 (Row 2), DiffAugment (Row 3) and DiffAugment + DIP-Vgg16 (Row 4) lr 2e -4 and z dimension 120. For DiffAugment, batch size is 32, D-steps is 4 and rest of the hyperparameters are same. Training is done till 30k steps for DiffAugment, FreezeD, and 5k steps for the rest. The moving average weights of the generator are used for evaluation. We use pre-trained network from 7 (Brock et al., 2018) for finetuning.

Figure 11: Samples generated by our DIP-Vgg16 approach on large-scale image generation

Few-shot image generation results using 100 training images (↓: lower is better; ↑: higher is better).



Comparison of FID, Precision and Recall metrics of DIP with Baseline and SSGAN for large-scale image generation. Best values obtained by using complete training set Dprior are underlined and the best value among all other approaches are in bold.

100-shot image generation results usingStyleGAN-2 (Karras et al., 2020b)  architecture on Panda, Grumpy-cat and Obama datasets. FID is computed between 5k generated and the complete training dataset. * denotes directly reported from the paper(Zhao et al., 2020b).

Comparison of loss function in few-shot image generation using 100 training images (FID: lower is better). H is hinge loss, NS is non saturating loss and W is wasserstein loss. procedure, as follows: (1) Optimize for image embeddings {e i } of all training images {x i } jointly with a generator network G using perceptual loss; and (2) Learn a sampling function T : z → e through IMLE for generating random images during inference. For adding data instance prior in the training procedure of GLANN, instead of directly optimizing for {e i }, we optimize for the following modified objective:

Comparison of FID on CIFAR-10 and CIFAR-100 while varying the amount of data used during training. Above all approaches are trained with random-horizontal flip augmentation of real images. BigGAN-DiffAugment includes consistency regularization(Zhang et al., 2019) following the implementation provided by authors(Zhao et al., 2020b). Best FID values are reported for each model.

Test for evaluating data-copy and memorization in GANs(Meehan et al., 2020) for different approaches and datasets. Test statistic CT << 0 denotes overfitting and data-copying, and CT >> 0 represents under-fitting.

annex

Memorization in GANs To evaluate whether trained GANs are actually generating novel images instead of only memorizing the training set, we calculate FID between images randomly sampled from training set with repetition and the separate test set for Anime and FFHQ dataset. For Anime dataset, we get an FID of 81.23 and for FFHQ, 100.07. On comparing these numbers to Table 1 we observe that only on using DIP with existing algorithms, we achieve a better FID score suggesting the usefulness of the proposed approach. Few-shot image generation with StyleGAN-2 For 256 x 256 resolution dataset with StyleGAN-2 architecture, we follow (Zhao et al., 2020b) and perform experiments on 100-shot Obama, Panda and Grumpy Cat dataset with pre-trained models on FFHQ (Karras et al., 2019) dataset. Table 6 shows consistent improvement in FID scores when trained with DIP irrespective of baseline training algorithm except on Grumpy Cat dataset. We hypothesize that this is because the prior features of this dataset has low diversity and are not informative enough to lead to improved performance with data instance prior training.Impact of loss function we analyze how DIP performs when loss function is changed. We compare between the following three loss functions: hinge loss used originally in our experiments, nonsaturating loss (Goodfellow et al., 2014) and the wasserstein loss (Arjovsky et al., 2017) . Table 8 shows the corresponding results when DIP is used with FreezeD and DiffAugment. We observe that in case of FreezeD+DIP wasserstein loss significantly outperforms non-saturating loss and hinge loss. In case of DiffAugment hinge loss performs best followed by non-saturating loss and wasserstein loss.

Implementation Details

In SN-GAN architecture, while training with data instance prior, G emb and D emb are matrices which linearly transform the pre-trained features into generator conditional space of dimension 128 and discriminator feature space of dimension 1024. For baseline training, 

