IS SYNTHETIC DATA FROM GENERATIVE MODELS READY FOR IMAGE RECOGNITION?

Abstract

Recent text-to-image generation models have shown promising results in generating high-fidelity photo-realistic images. Though the results are astonishing to human eyes, how applicable these generated images are for recognition tasks remains under-explored. In this work, we extensively study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks, and focus on two perspectives: synthetic data for improving classification models in data-scarce settings (i.e. zero-shot and fewshot), and synthetic data for large-scale model pre-training for transfer learning. We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks.

1. INTRODUCTION

Over the past decade, deep learning powered by large-scale annotated data has revolutionized the field of image recognition. However, it is costly and time-consuming to manually collect a largescale labeled dataset, and recent concerns about data privacy and usage rights further hinder this process. In parallel, generative models that aim to model real-data distributions can now produce high-fidelity photo-realistic images. In particular, recent text-to-image generation models (Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022b) have made major breakthroughs in synthesizing high-quality images from text descriptions. This promotes us to ask: is synthetic data from generative models ready for image recognition tasks? There are a few early attempts at exploring synthetic data from generative models for image recognition tasks. Besnier et al. (2020) use a class-conditional GAN (BigGAN (Brock et al., 2018) trained for ImageNet-1000 classes) to generate images for training image classifiers. Zhang et al. (2021) leverage StyleGAN (Karras et al., 2019) to produce synthetic labeled data for object-part segmentation. Jahanian et al. (2021) manipulate the latent space of a GAN model to produce multi-view images for contrastive learning. Albeit promising, early works either address tasks on a small scale or only for a specific setting. Plus, they all focus on GAN-based models and none explore the revolutionary text-to-image generation models, which hold more promises to benefit recognition tasks. In this paper, we present the first study on the state-of-the-art text-to-image generation models for image recognition. With the power of text-to-image generation, we could hopefully not only generate massive high-quality labeled data, but also achieve domain customization by generating synthetic data targeted for a specific label space, i.e. the label space of a downstream task. Our study is carried out on one open-sourced text-to-image generation model, GLIDE (Nichol et al., 2021) 1 . We attempt to uncover the benefits and pitfalls of synthetic data for image recognition through the lens of investigating the following two questions: 1) is synthetic data from generative models ready for improving classification models? 2) whether synthetic data can be a feasible source for transfer learning (i.e. model pre-training)? It is worth noting that for 1), we only studied the zero-shot and few-shot settings because the positive impact of synthetic data diminishes as more shots are present. And, we build most of our investigations on the state-of-the-art method CLIP (Radford et al., 2021) with the feature extractor initialized with large-scale pre-trained weights frozen. Our Findings. First, in the zero-shot setting, i.e. no real-world data are available, we demonstrate that synthetic data can significantly improve classification results on 17 diverse datasets: the performance is increased by 4.31% in top-1 accuracy on average, and even improved by as much as 17.86% on the EuroSAT dataset. To better leverage synthetic data in this setting, we also investigate useful strategies to increase data diversity, reduce data noise, and enhance data reliability. This is achieved by designing diversified text prompts and measuring the correlation of text and synthesized data with CLIP features. Second, in the few-shot setting, i.e. a few real images are available, albeit not as significant as in the zero-shot task, synthetic data are also shown to be beneficial and help us achieve a new state of the art. Our observation shows that the domain gap between synthetic data and downstream task data is one challenge on further improving the effectiveness of synthetic data on classifier learning. Fortunately, in this setting, the accessibility of real data samples can provide useful information about the data distribution of the downstream task. We thus propose to use real images as guidance in the generation process to reduce domain gaps and improve effectiveness. Third, in large-scale model pre-training for transfer learning, our study shows that synthetic data are suitable and effective for model pre-training, delivering superior transfer learning performance and even outperforming ImageNet pre-training. Especially, synthetic data work surprisingly well in unsupervised model pre-training, and favor ViT-based backbones. We also demonstrate that by increasing the label space (i.e. text prompts) for data generation, the enlarged data amount and diversity could further bring performance boosts. Besides, synthetic data can work collaboratively with real data (i.e. ImageNet) where we obtain improved performance when the model is initialized with ImageNet pre-trained weights.

2. RELATED WORKS

Synthetic Data for Image Recognition. There are mainly two forms of synthetic data for image recognition, i.e. 1) synthetic datasets generated from a traditional simulation pipeline; 2) synthetic images output from generative models. The first type, synthetic datasets (Dosovitskiy et al., 2015; Peng et al., 2017; Richter et al., 2016) , are usually generated from a traditional pipeline with a specific data source, e.g.synthetic 2D renderings of 3D models or scenes from graphics engines. However, this traditional way of generating synthetic datasets has several drawbacks: 1) manually defined pipeline generated synthetic data may have a certain gap with real-world data; 2) taking up huge physical space to store and huge cost to share and transfer; 3) data amount and diversity bounded by the specific data source. Compared with synthetic datasets, generative models are a more efficient means of synthetic data representation, exhibiting favorable advantages: 1) could produce high-fidelity photorealistic images closer to real data since they are trained on real-world data; 2) highly condensed compared to synthetic data itself, and take up much reduced storage space; 3) potentially unlimited synthetic data size. Only recently, few works attempt to explore synthetic data generated from generative models for image recognition. Text-to-Image Diffusion Models. Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021) have recently emerged as a class of promising and powerful generative models. As a likelihood-based model, the diffusion model matches the underlying data distribution q(x 0 ) by learning to reverse a noising process, and thus novel images can be sampled from a prior Gaussian distribution via the learned reverse path. Because of the high sample quality, good mode coverage and promising training stability, diffusion models are quickly becoming a new trend in both unconditional (Ho et al., 2020; Nichol & Dhariwal, 2021; Ho et al., 2022) and conditional (Dhariwal & Nichol, 2021; Rombach et al., 2022; Lugmayr et al., 2022; Saharia et al., 2022a; Meng et al., 2021; Saharia et al., 2022c) image synthesis fields.



Besnier et al. (2020) use a class-conditional GAN to train classifiers of the same classes. Zhang et al. (2021) leverage the latent code of StyleGAN (Karras et al., 2019) to produce labels for object part segmentation. While they achieve promising results, both works are task-wise and only employed on a small scale. Jahanian et al. (2021) use a GAN-based generator to generate multiple views to conduct unsupervised contrastive representation learning. These works, however, explore upon the traditional GAN-based models; in contrast, our work investigates with the best released text-to-image generation model, which demonstrates new customization ability for different downstream label space.

availability

https://github.com/CVMI-Lab/SyntheticData.

