DATA INSTANCE PRIOR FOR TRANSFER LEARNING IN GANS

Abstract

Recent advances in generative adversarial networks (GANs) have shown remarkable progress in generating high-quality images. However, this gain in performance depends on the availability of a large amount of training data. In limited data regimes, training typically diverges, and therefore the generated samples are of low quality and lack diversity. Previous works have addressed training in low data setting by leveraging transfer learning and data augmentation techniques. We propose a novel transfer learning method for GANs in the limited data domain by leveraging informative data prior derived from self-supervised/supervised pretrained networks trained on a diverse source domain. We perform experiments on several standard vision datasets using various GAN architectures (BigGAN, SNGAN, StyleGAN2) to demonstrate that the proposed method effectively transfers knowledge to domains with few target images, outperforming existing stateof-the-art techniques in terms of image quality and diversity. We also show the utility of data instance prior in large-scale unconditional image generation and image editing tasks.

1. INTRODUCTION

Generative Adversarial Networks (GANs) are at the forefront of modern high-quality image synthesis in recent years (Brock et al., 2018; Karras et al., 2020b; 2019) . GANs have also demonstrated excellent performance on many related computer vision tasks such as image manipulation (Zhu et al., 2017; Isola et al., 2017) , image editing (Plumerault et al., 2020; Shen et al., 2020; Jahanian et al., 2020) and compression (Tschannen et al., 2018) . Despite the success in large-scale image synthesis, GAN training suffers from a number of drawbacks that arise in practice, such as training instability and mode collapse (Goodfellow et al., 2016; Arora et al., 2017) . It has been observed that the issue of unstable training can be mitigated to an extent by using conditional GANs. However, this is expected as learning the conditional model for each class is easier than learning the joint distribution. The disadvantages of GAN training have prompted research in several non-adversarial generative models (Hoshen et al., 2019; Bojanowski et al., 2018; Li & Malik, 2018; Kingma & Welling, 2014) . These techniques are implicitly designed to overcome the mode collapse problem, however, the quality of generated samples are still not on par with GANs. Current state-of-the-art deep generative models require a large volume of data and computation resources. The collection of large datasets of images suitable for training -especially labeled data in case of conditional GANs -can easily become a daunting task due to issues such as copyright, image quality and also the training time required to get state-of-the-art image generation performance. To curb these limitations, researchers have recently proposed techniques inspired by transfer learning (Noguchi & Harada, 2019; Wang et al., 2018; Mo et al., 2020) and data augmentation methods (Karras et al., 2020a; Zhao et al., 2020b; Zhang et al., 2019) . Advancements in data and computation efficiency for image synthesis can enable its applications in data-deficient fields such as medicine (Yi et al., 2019) where labeled data generation can be difficult to obtain. Transfer learning is a promising area of research (Oquab et al., 2014; Pan & Yang, 2009) that leverages prior information acquired from large datasets to help in training models on a target dataset under limited data and resource constraints. There has been extensive exploration of transfer learning in classification problems that have shown excellent performance on various downstream data-deficient domains. Similar extensions of reusing pre-trained networks for transfer learning (i.e. fine-tuning a subset of pre-trained network weights from a data-rich domain) have also been recently employed for image synthesis in GANs (Wang et al., 2018; Noguchi & Harada, 2019; Mo et al., 2020; Wang et al., 2020; Zhao et al., 2020a) in the limited data regime. However, these approaches are still prone to overfitting on the sparse target data, and hence suffer from degraded image quality and diversity. In this work, we propose a simple yet effective way of transferring prior knowledge in unsupervised image generation given a small sample size (∼ 100-2000) of the target data distribution. Our approach is motivated by the formulation of the IMLE technique (Li & Malik, 2018 ) that seeks to obtain mode coverage of target data distribution by learning a mapping between latent and target distributions using a maximum likelihood criterion. We instead propose the use of data priors in GANs to match the representation of the generated samples to real modes of data. In contrast to (Li & Malik, 2018), we use the images generated using data priors to find the nearest neighbor match to real modes in the generator's learned distribution. In particular, we show that using an informative data instance prior in limited and large-scale unsupervised image generation substantially improves the performance in image synthesis. We show that these data priors can be derived from commonly used computer vision pre-trained networks (Simonyan & Zisserman, 2014; Zhang et al., 2018; Noguchi & Harada, 2019; Hoshen et al., 2019) or self-supervised data representations (Chen et al., 2020) (without any violation of the target setting's requirements, i.e. ensuring that the pre-trained network has not been trained on few-shot classes in the few-shot learning setting, for instance). In case of sparse training data, our approach of using data instance priors leverages a model pre-trained on a rich source domain to the learn the target distribution. Different from previous works (Noguchi & Harada, 2019; Wang et al., 2020; 2018) which rely on fine-tuning models trained on a data-rich domain, we propose to leverage the feature representations of our source model as data instance priors, to distill knowledge (Romero et al., 2015; Hinton et al., 2015) into our target generative model. We note that our technique of using data instance priors for transfer learning becomes fully unsupervised in case the data priors are extracted from self-supervised pre-trained networks. Furthermore, in addition to image generation in low data domain, we also achieve state-of-the-art Fréchet inception distance (FID) score (Heusel et al., 2017) on large-scale unsupervised image generation and also show how this framework of transfer learning supports several image editing tasks. We summarize our main contributions as follows: • We propose Data Instance Prior (DIP), a novel transfer learning technique for GAN image synthesis in low-data regime. We show that employing DIP in conjunction with existing few-shot image generation methods outperforms state-of-the-art results. We show with as little as 100 images our approach DIP results in generation of diverse and high quality images (see Figure 3 ). • We demonstrate the utility of our approach in large-scale unsupervised GANs (Miyato et al., 2018; Brock et al., 2018) achieving the new state-of-the-art in terms of image quality (Heusel et al., 2017) and diversity (Sajjadi et al., 2018; Metz et al., 2017) . • We show how our framework of DIP by construction enables inversion of images and common image editing tasks (such as cutmix, in-painting, image translation) in GANs. We call our method a data instance prior (and not just data prior), since it uses representations of instances as a prior, and not a data distribution itself.

2. RELATED WORK

Deep Generative Models In recent years, there has been a surge in the research of deep generative models. Some of the popular approaches include variational auto-encoders (VAEs) (Rezende et al., 2014; Kingma & Welling, 2014) , auto-regressive (AR) models (Van Oord et al., 2016; Van den Oord et al., 2016) and GANs (Goodfellow et al., 2014) . VAE models learn by maximizing the variational lower bound of likelihood of generating data from a given distribution. Auto-regressive approaches model the data distribution as a product of the conditional probabilities to sequentially generate data. GANs comprise of two networks, a generator and a discriminator that train in a min-max optimization. Specifically, the generator aims to generate samples to fool the discriminator, while the discriminator learns distinguish these generated samples from the real samples. Several research efforts in GANs have focused around improving the performance (Karras et al., 2018; Denton et al., 2015; Radford et al., 2016; Karras et al., 2020b; 2019; Brock et al., 2018; Zhang et al., 2019) and

